Interdisciplinary Bio Central
Full Report (Bioinformatics/Computational biology/Molecular modeling)

An approach for a substitution matrix based on protein blocks and physiochemical properties of amino acids through PCA
Youngki You1, In Hwan Jang1, Kyungro Lee2, Heon Joo Kim1 and Kwan Hee Lee1,*
1School of Life Science, Handong Global University, Pohang, 791-708, Republic of Korea
2Department of Biotechnology Yonsei, University, Seoul, 120-749, Republic of Korea
*Corresponding author
  Received : February 25, 2014
  Accepted : August 29, 2014
  Published : November 05, 2014
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Main text PDF(795.KB)
   (Print version)

Amino acid substitution matrices are essential tools for protein sequence analysis, homology sequence search in protein databases and multiple sequence alignment. The PAM matrix was the first widely used amino acid substitution matrix. The BLOSUM series then succeeded the PAM matrix. Most substitution matrixes were developed by using the statistical frequency of substitution between each amino acid at blocks representing groups of protein families or related proteins. However, substitution of amino acids is based on the similarity of physiochemical properties of each amino acid. In this study, a new approach was used to obtain major physiochemical properties in multiple sequence alignment. Frequency of amino acid substitution in multiple sequence alignment database and selected attributes of amino acids in physiochemical properties database were merged. This merged data showed the major physiochemical properties through principle components analysis. Using factor analysis, these four principle components were interpreted as flexibility of electronic movement, polarity, negative charge and structural flexibility. Applying these four components, BAPS was constructed and validated for accuracy. When comparing receiver operated characteristic (ROC50) values, BAPS scored slightly lower than BLOSUM and PAM. However, when evaluating for accuracy by comparing results from multiple sequence alignment with the structural alignment results of two test data sets with known three-dimensional structure in the homologous structure alignment database, the result of the test for BAPS was comparatively equivalent or better than results for prior matrices including PAM, Gonnet, Identity and Genetic code matrix.

Keyword: BAPS, factor analysis, principle component analysis, scoring matrix, sequence alignment
IBC   ISSN : 2005-8543   Contact IBC