基于GPU的大规模基因片段并行匹配的方法

The new approach of multiple genome sequence matching based on GPU

作者：丁莎(四川大学计算机学院; 四川大学锦江学院计算机学院)；赵士元(四川大学锦江学院计算机学院)；林涛(四川大学计算机学院)

Author：DING Sha(College of Computer Science, Sichuan University; College of Computer Science, Jinjiang College of Sichuan University)；ZHAO Shi-Yuan(College of Computer Science, Jinjiang College of Sichuan University)；LIN Tao(College of Computer Science, Sichuan University)

收稿日期：2016-07-01 年卷（期）页码：2017,54(2):280-286

期刊名称：四川大学学报: 自然科学版

Journal Name：Journal of Sichuan University (Natural Science Edition)

关键字：后缀数组;后缀树; GPU; 基因片段匹配;并行

Key words：suffix array; suffix tree; GPU; genome sequence matching; parallel

基金项目：四川省科技厅支撑项目(2012GZ0091；2013GZX0138)；四川大学青年教师科研启动基金2015SCU11050)

中文摘要

后缀树和后缀数组广泛用于生物信息学领域中,特别是通过启发式算法在对DNA基因片段进行匹配的阶段。本文提出了在GPU 的平台下，利用多核和超多核体系构成的后缀树以及后缀数组并行匹配大规模基因片段，从而加速基因搜索匹配过程。相对于后缀树，后缀数组二分搜素算法具有内存占用少，缓存使用率高等优点。在GPU的性能评估中，后缀数组执行效率明显超过后缀树，后缀数组占用的空间仅为后缀树的20%—30%。相对于CPU的串行实现，后缀树组达到了约99倍的加速比。实验结果表明在基因片段匹配的过程中，基于GPU的后缀数组二分搜索是一种高效且实用的方法。

英文摘要

Suffix trees and suffix arrays have been used widely in bioinformatics applications, especially for DNA sequence alignments in the initial exact match phase of heuristic algorithms. In this paper, a new GPU implementation and optimization of the suffix tree and suffix array on both multi-core and many-core platforms to accelerate multiple genome sequence searching is presented. The comparative performance evaluation between the suffix tree and suffix array is then carried out. The results showed that the suffix array needed only 20–30% of memory space compared with the suffix tree, and that the mean search time of the suffix array was significantly shorter than the mean search time of the suffix tree because of the use of a binary search with coalesced memory access and tile optimization under the GPU architecture. Moreover, the GPU implementation of the suffix array gained a speedup of approximate 99 times compared with the corresponding CPU serial implementation. This study showed that the massively parallel sequence matching algorithm based on suffix array was an efficient approach with the high-performance in the process of multiple DNA sequence matching.

【关闭】

论文摘要

基于GPU的大规模基因片段并行匹配的方法

The new approach of multiple genome sequence matching based on GPU