Pseudo amino acid composition

In molecular biology, pseudo amino acid composition (PseACC) is a method introduced by Kuo-Chen Chou to convert the protein sequence into a numerical vector for enhancing pattern recognition techniques, such as during discrimination between classes of proteins based on their sequences (e.g. between membrane proteins, transmembrane proteins, cytosolic proteins, and other types).[1] This method represented an advance beyond using the immediate amino acid composition (AAC). Instead, the protein is characterized into a matrix of amino-acid frequencies. This matrix incorporates not only amino acid composition, but can also incorporate information from local features of the protein sequence.[2]

Due to the success and widespread application of the PseACC method, it was extended to address sequence-order effects in nucleotide compositions, giving rise to a comparative method called PseKNC.[3]

Sequential and discrete models

Two kinds of models are usually used to represent protein samples: the sequential and the discrete (or non-sequential) models.[4] The most elementary sequential model is to use the entire amino acid sequence, as expressed by:

where, P represents the amino acid sequence, is the number of amino acid residues, R1 is the first residue of the protein P, R2 is the second residue, and so forth.

The problem with this approach was that in some sequence-similarity-search-based tools, the query protein often lacked significant homology (or sequence similarity) with any other known protein in the database. To resolve this problem, discrete models for representing protein samples were proposed. The simplest discrete model is using the amino acid composition (AAC) to represent protein samples. Under the AAC model, the protein P of Eq.1 can also be expressed by

where are the normalized occurrence frequencies of the 20 native amino acids in P, and T is the transposing operator.[4]

Pseudo-Amino Acid Composition (PseAAC) model

The primary weakness of the discrete model that relies on the amino acid composition (AAC) is that the information on the frequencies of each amino acid from the sample alone involves a loss of sequence-order information, or information obtained by the order of the amino acid residues. To avoid this information loss, the concept of PseAAC (pseudo amino acid composition) was proposed.[1]

Under this new model, the first 20 discrete factors represent amino acid frequencies are retained, but additional discrete factors are included that also ascertain information about sequence order. The sequence order information is represented by what are called "pseudo components". The number of additional components, beyond the first 20 frequencies, is called λ (or upper-case Λ), and so 20+λ components are included in the model. The upper limit for λ is one less than the length of the shortest protein sample in the dataset.[1] The total number of components (20+λ) may be denoted Ω. Any additional factors can be incorporated so long as they, in some way, obtain or represent information about the sequence-order. Typically, these are a series of rank-different correlation factors along the protein chain.[4]

The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAAC is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.

Meanwhile, various modes to formulate the PseAAC vector have also been developed, as summarized in a 2009 review article.[2]

Algorithm

Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (c) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represents the amino acid residue at the sequence position 1, R2 at position 2, and so forth (cf. Eq.1), and the coupling factors are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues.

According to the PseAAC model, the protein P of Eq.1 can be formulated as

where the () components are given by

where is the weight factor, and the -th tier correlation factor that reflects the sequence order correlation between all the -th most contiguous residues as formulated by

with

where is the -th function of the amino acid , and the total number of the functions considered. For example, in the original paper by Chou,[1] , and are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid ; while , and the corresponding values for the amino acid . Therefore, the total number of functions considered there is . It can be seen from Eq.3 that the first 20 components, i.e. are associated with the conventional AA composition of protein, while the remaining components are the correlation factors that reflect the 1st tier, 2nd tier, ..., and the -th tier sequence order correlation patterns (Figure 1). It is through these additional factors that some important sequence-order effects are incorporated.

in Eq.3 is a parameter of integer and that choosing a different integer for will lead to a dimension-different PseAA composition.[5]

Using Eq.6 is just one of the many modes for deriving the correlation factors in PseAAC or its components. The others, such as the physicochemical distance mode[6] and amphiphilic pattern mode,[7] can also be used to derive different types of PseAAC, as summarized in a 2009 review article.[2] In 2011, the formulation of PseAAC (Eq.3) was extended to a form of the general PseAAC as given by:[8]

where the subscript is an integer, and its value and the components will depend on how to extract the desired information from the amino acid sequence of P in Eq.1.

The general PseAAC can be used to reflect any desired features according to the targets of research, including those core features such as functional domain, sequential evolution, and gene ontology to improve the prediction quality for the subcellular localization of proteins.[9][10] as well as their many other important attributes.

References

  1. ^ a b c d Chou KC (May 2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.
  2. ^ a b c Chou KC (2009). "Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology". Current Proteomics. 6 (4): 262–274. doi:10.2174/157016409789973707.
  3. ^ Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.
  4. ^ a b c Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193.
  5. ^ Chou KC, Shen HB (November 2007). "Recent progress in protein subcellular location prediction". Anal. Biochem. 370 (1): 1–16. doi:10.1016/j.ab.2007.07.006. PMID 17698024.
  6. ^ Chou KC (November 2000). "Prediction of protein subcellular locations by incorporating quasi-sequence-order effect". Biochem. Biophys. Res. Commun. 278 (2): 477–83. doi:10.1006/bbrc.2000.3815. PMID 11097861.
  7. ^ Chou KC (January 2005). "Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes". Bioinformatics. 21 (1): 10–9. doi:10.1093/bioinformatics/bth466. PMID 15308540.
  8. ^ Chou KC (March 2011). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–47. Bibcode:2011JThBi.273..236C. doi:10.1016/j.jtbi.2010.12.024. PMC 7125570. PMID 21168420.
  9. ^ Chou KC, Shen HB (2008). "Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms". Nat Protoc. 3 (2): 153–62. doi:10.1038/nprot.2007.494. PMID 18274516. S2CID 226104. Archived from the original on 2007-08-27. Retrieved 2008-03-24.
  10. ^ Shen HB, Chou KC (February 2008). "PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition". Anal. Biochem. 373 (2): 386–8. doi:10.1016/j.ab.2007.10.012. PMID 17976365.