Interdisciplinary Bio Central
Etc. (Omics (Physiomics/metabolomics/proteomics/genomics) )

Protein Sequence Search based on N-gram Indexing
Mi-Nyeong Hwang1,* and Jinsuk Kim1,*
1System Development Team, Knowledge Information Center, Korea Institute of Science and Technology Information (KISTI), Korea
*Corresponding author
  Published : February 28, 2006
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Main text PDF(343.KB)
(pre-print version)

According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

Keyword: homology search, N-gram indexing, sequence retrieval, sequence search tool, ProSeS
IBC   ISSN : 2005-8543   Contact IBC