PROFILPELAJAR.COM

The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a nucleotide sequence (DNA or RNA) into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide (when K=2) or a trinucleotide (when K=3). Depending on the instance, the technique can also be called PseDNC or PseTNC.^[1]

The method was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition) that is applied to protein sequences.^[2]

Background

PseAAC

PseKNC was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition).^[2] Previously, investigations either relied on sequential models for making predictions of certain protein properties (which, in its simplest case, just refers to the amino acid composition of the protein), or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with λ components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.^[3]

Analogous problem in genomics

Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:^[1]

$\mathbf {D} ={\begin{bmatrix}f(AA)f(AC)\cdots f(TT)\end{bmatrix}}^{\mathbf {T} }$

Where D is the DNA sequence, T is the transpose operator, and f(AA) is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:^[1]

$\mathbf {D} ={\begin{bmatrix}f(AAA)f(AAC)\cdots f(TTT)\end{bmatrix}}^{\mathbf {T} }$

As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple nucleotide composition or PseKNC was proposed.^[4]^[5]^[6]

PseKNC

PseKNC extends the discrete model by adding λ components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4^K components. In a dinucleotide situation where K = 2, 4² = 16 components will be included. The extension by PseKNC results in (4^K + λ) components.^[1]

Applications

A wide diversity of applications have been developed with respect to the PseKNC method.^[7] For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.^[8]^[9]

Web servers

For the convenience scientific community, a freely available web server called PseKNC^[4] and an open source package called PseKNC-General^[5] were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC.

Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.^[10]

References

^ ^a ^b ^c ^d Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.
^ ^a ^b Chou, Kuo-Chen (2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins: Structure, Function, and Genetics. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.
^ Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193. PMC 7125570.
^ ^a ^b Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001. PMID 24732113.
^ ^a ^b Chen, Wei; Zhang, Xitong; Brooker, Jordan; Lin, Hao; Zhang, Liqing; Chou, Kuo-Chen (2015). "PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions". Bioinformatics. 31 (1): 119–20. doi:10.1093/bioinformatics/btu602. PMID 25231908.
^ Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–34. doi:10.1039/c5mb00155b. PMID 26099739.
^ Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–2634. doi:10.1039/C5MB00155B. ISSN 1742-206X.
^ Liu, Bin; Wang, Shanyi; Long, Ren; Chou, Kuo-Chen (2017-01-01). "iRSpot-EL: identify recombination spots with an ensemble learning approach". Bioinformatics. 33 (1): 35–41. doi:10.1093/bioinformatics/btw539. ISSN 1367-4803.
^ Ye, Dong-Xin; Yu, Jun-Wen; Li, Rui; Hao, Yu-Duo; Wang, Tian-Yu; Yang, Hui; Ding, Hui (2024-06-12). "The Prediction of Recombination Hotspot Based on Automated Machine Learning". Journal of Molecular Biology: 168653. doi:10.1016/j.jmb.2024.168653. ISSN 0022-2836.
^ Liu, Bin; Liu, Fule; Wang, Xiaolong; Chen, Junjie; Fang, Longyun; Chou, Kuo-Chen (2015-07-01). "Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences". Nucleic Acids Research. 43 (W1): W65 – W71. doi:10.1093/nar/gkv458. ISSN 0305-1048. PMC 4489303. PMID 25958395.

[:1-1] Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.

[Chou01-2] Chou, Kuo-Chen (2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins: Structure, Function, and Genetics. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.

[:0-3] Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193. PMC 7125570.

[Chen01-4] Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001. PMID 24732113.

[Chen02-5] Chen, Wei; Zhang, Xitong; Brooker, Jordan; Lin, Hao; Zhang, Liqing; Chou, Kuo-Chen (2015). "PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions". Bioinformatics. 31 (1): 119–20. doi:10.1093/bioinformatics/btu602. PMID 25231908.

[Chen03-6] Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–34. doi:10.1039/c5mb00155b. PMID 26099739.

[7] Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–2634. doi:10.1039/C5MB00155B. ISSN 1742-206X.

[8] Liu, Bin; Wang, Shanyi; Long, Ren; Chou, Kuo-Chen (2017-01-01). "iRSpot-EL: identify recombination spots with an ensemble learning approach". Bioinformatics. 33 (1): 35–41. doi:10.1093/bioinformatics/btw539. ISSN 1367-4803.

[9] Ye, Dong-Xin; Yu, Jun-Wen; Li, Rui; Hao, Yu-Duo; Wang, Tian-Yu; Yang, Hui; Ding, Hui (2024-06-12). "The Prediction of Recombination Hotspot Based on Automated Machine Learning". Journal of Molecular Biology: 168653. doi:10.1016/j.jmb.2024.168653. ISSN 0022-2836.

[10] Liu, Bin; Liu, Fule; Wang, Xiaolong; Chen, Junjie; Fang, Longyun; Chou, Kuo-Chen (2015-07-01). "Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences". Nucleic Acids Research. 43 (W1): W65 – W71. doi:10.1093/nar/gkv458. ISSN 0305-1048. PMC 4489303. PMID 25958395.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Pseudo K-tuple nucleotide composition