Substitution matrix

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.[1]

Background

In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one step, and possibly

AQEINYQRD

over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophilic residue such as arginine is more likely to be replaced by another hydrophilic residue such as glutamine, than it is to be mutated into a hydrophobic residue such as leucine. (Here, a residue refers to an amino acid stripped of a hydrogen and/or a hydroxyl group and inserted in the polymeric chain of a protein.) This is primarily due to redundancy in the genetic code, which translates similar codons into similar amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties could affect the folding and/or activity of the protein. This type of disruptive substitution is likely to be removed from populations by the action of purifying selection because the substitution has a higher likelihood of rendering a protein nonfunctional.[2]

If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.

To this end, we will construct a 20x20 matrix where the th entry is equal to the probability of the th amino acid being transformed into the th amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:

Identity matrix

The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like

This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.

Log-odds matrices

We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as

where is the probability that amino acid transforms into amino acid , and , are the frequencies of amino acids i and j. The base of the logarithm is not important, and the same substitution matrix is often expressed in different bases.

Example matrices

PAM

One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection.

One PAM unit is defined as 1% of the amino acid positions that have been changed. To create a PAM1 substitution matrix, a group of very closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. Based on collected mutational data from this group of sequences, a substitution matrix can be derived. This PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed.

The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. With this assumption, the PAM2 matrix can estimated by squaring the probabilities. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.

BLOSUM

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff & Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins and will therefore have lower substitution rates than less conserved regions. To reduce bias from closely related sequences on substitution rates, segments in a block with a sequence identity above a certain threshold were clustered, reducing the weight of each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

Differences between PAM and BLOSUM

  1. PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree: maximum parismony), whereas the BLOSUM matrices are based on an implicit model of evolution.
  2. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
  3. The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
  4. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.

Newer matrices

A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs.

  • JTT, published in the same year as BLOSOM, also performs clustering and uses an implicit model. This may help reduce the systematic error from maximum parismony (MP), but also wastes sequence information.[3]
  • WAG (Wheelan And Goldman), published in 2001, uses a maximum likelihood estimating procedure instead of any form of MP. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using neighbor-joining. The scores correspond to an substitution model which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included protein families.[3]

Specialized substitution matrices and their extensions

The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices,[4] for combinations of secondary structure states and solvent accessibility states,[5][6][7] or for local sequence-structure contexts.[8] These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used.

Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular BLAST program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds (CS-BLAST).

Terminology

Although "transition matrix" is often used interchangeably with "substitution matrix" in fields other than bioinformatics, the former term is problematic in bioinformatics. With regards to nucleotide substitutions, "transition" is also used to indicate those substitutions that are between the two-ring purines (A → G and G → A) or are between the one-ring pyrimidines (C → T and T → C). Because these substitutions do not require a change in the number of rings, they occur more frequently than the other substitutions. "Transversion" is the term used to indicate the slower-rate substitutions that change a purine to a pyrimidine or vice versa (A ↔ C, A ↔ T, G ↔ C, and G ↔ T).

See also

References

  1. ^ Zvelebil, Marketa J. (2008). Understanding bioinformatics. New York: Garland Science. pp. 117–127, 747. ISBN 978-0-8153-4024-9.
  2. ^ Xiong, Jin (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/cbo9780511806087.004. ISBN 978-0-511-80608-7.
  3. ^ a b Whelan, Simon; Goldman, Nick (1 May 2001). "A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach". Molecular Biology and Evolution. 18 (5): 691–699. doi:10.1093/oxfordjournals.molbev.a003851. ISSN 0737-4038. PMID 11319253.
  4. ^ Müller, T; Rahmann, S; Rehmsmeier, M (2001). "Non-symmetric score matrices and the detection of homologous transmembrane proteins". Bioinformatics. 17 (Suppl 1): S182–9. doi:10.1093/bioinformatics/17.suppl_1.s182. PMID 11473008.
  5. ^ Rice, DW; Eisenberg, D (1997). "A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence". Journal of Molecular Biology. 267 (4): 1026–38. CiteSeerX 10.1.1.44.1143. doi:10.1006/jmbi.1997.0924. PMID 9135128.
  6. ^ Gong, Sungsam; Blundell, Tom L. (2008). Levitt, Michael (ed.). "Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures". PLOS Computational Biology. 4 (10): e1000179. Bibcode:2008PLSCB...4E0179G. doi:10.1371/journal.pcbi.1000179. PMC 2527532. PMID 18833291.
  7. ^ Goonesekere, NC; Lee, B (2008). "Context-specific amino acid substitution matrices and their use in the detection of protein homologs". Proteins. 71 (2): 910–9. doi:10.1002/prot.21775. PMID 18004781. S2CID 27443393.
  8. ^ Huang, YM; Bystroff, C (2006). "Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions". Bioinformatics. 22 (4): 413–22. doi:10.1093/bioinformatics/bti828. PMID 16352653.

Further reading

Read other articles:

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (May 2015) (Learn how and when to remove this template message) This article needs additional citations for verification. Please help improve this article by ad...

 

Kunlun Shan beralih ke halaman ini. Untuk dermaga transportasi ampibi Tiongkok, lihat Kunlun Shan (998). Untuk tempat mitologis, lihat Gunung Kunlun (mitologi). Pegunungan Kunlun Pegunungan Kunlun adalah rangkaian pegunungan terpanjang di Asia terletak di Provinsi Qinghai di Tiongkok dan melewati perbatasan Tiongkok-India.[1][2][3] Terbentang mulai dari Pamir di Tajikistan melewati perbatasan Xinjiang dan Tibet sampai ke Provinsi Qinghai. Berjarak 1200 km, mempunyai eb...

 

Havat HaNoar HaTzioni, Jerusalem Bagian dari sebuah serial tentangAliyah Imigrasi Yahudi ke Tanah Israel Konsep Tanah yang Dijanjikan Pertemuan Israel Diaspora Negasi Tanah air bagi bangsa Yahudi Zionisme Pertanyaan Yahudi Undang-undang Kepulangan Aliyah Pra-Modern Kembali ke Sion Yishuv Lama Perushim Aliyah pada zaman modern Pertama Kedua Pada Perang Dunia I Ketiga Keempat Kelima Aliyah Bet Bricha Dari negara-negara Muslim Yaman Irak Maroko Lebanon dari Uni Soviet pasca-Soviet dari Etiopia d...

Pour les personnes ayant le même patronyme, voir Kavanagh. Niamh Kavanagh Niamh Kavanagh à Oslo le 26 mai 2010.Informations générales Naissance 13 février 1968 (56 ans)Dublin, Leinster Irlande Activité principale Chanteuse Genre musical Pop Instruments Voix Années actives 1990 à aujourd'hui Labels Arista Site officiel http://www.webwrite.net/niamh.htm modifier Niamh Kavanagh est une chanteuse irlandaise née le 13 février 1968 à Dublin. Biographie En 1991, l'artiste chante sur...

 

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此條目需要补充更多来源。 (2018年3月17日)请协助補充多方面可靠来源以改善这篇条目,无法查证的内容可能會因為异议提出而被移除。致使用者:请搜索一下条目的标题(来源搜索:羅生門 (電影) — 网页、新闻、书籍、学术、图像),以检查网络上是否存在该主题的更多可靠来源(判定指引)。 �...

 

Si ce bandeau n'est plus pertinent, retirez-le. Cliquez ici pour en savoir plus. Cet article est francocentré et nécessite une internationalisation (juillet 2023). Merci de l'améliorer ou d'en discuter sur sa page de discussion ! Vous pouvez préciser les sections à internationaliser en utilisant {{section à internationaliser}}. L'épandage est une technique agricole consistant à répandre divers produits sur des zones cultivées, forêts, voies ferrées, marais (pour la démoustic...

Mount PhillipsMount Phillips (tallest mountain, at center) seen from Baldy MountainHighest pointElevation11,742 ft (3,579 m) NAVD 88[1]Prominence2,901 ft (884 m)[2]Coordinates36°28′36″N 105°09′34″W / 36.476626394°N 105.159451264°W / 36.476626394; -105.159451264[1]GeographyMount PhillipsColfax County, New Mexico, U.S. Parent rangeCimarron Range, Sangre de Cristo MountainsTopo mapUSGS Garcia Peak (NM)...

 

La ragazza nella nebbiaJean Reno e Toni Servillo in una scena del filmLingua originaleitaliano Paese di produzioneItalia Anno2017 Durata127 min Rapporto2,35:1 Generethriller, giallo, poliziesco RegiaDonato Carrisi Soggettodall'omonimo romanzo di Donato Carrisi SceneggiaturaDonato Carrisi ProduttoreMaurizio Totti, Alessandro Usai Casa di produzioneMedusa Film, Colorado Film, Gavila Distribuzione in italianoMedusa Film FotografiaFederico Masiero MontaggioMassimo Quaglia MusicheVito Lo R...

 

American Christian minister and ecumenical leader Joan Brown CampbellCampbell in 2017Born1931 (age 92–93)Youngstown, OhioOccupation(s)Baptist minister, ecumenical leader, chief executive and directorYears active1970s-2010sKnown forFirst ordained woman to be National Council of Churches president Joan Brown Campbell (born 1931) is an American Christian minister and ecumenical leader. She has standing as an ordained minister in both the Christian Church (Disciples of Christ...

GMAC Real EstateIndustryReal estateFounded1998; 26 years ago (1998)FateAcquired by Brookfield Asset ManagementSuccessorReal Living GMAC Real Estate was a real estate franchised broker. It had 13,000 agents.[1] In 2008, it was acquired by Brookfield Asset Management and in 2012, it merged into HomeServices of America. History GMAC Real Estate was founded in 1998, when GMAC (now Ally Financial) bought the Better Homes and Gardens Real Estate brand from Meredith Corpora...

 

PetatalDesaNegara IndonesiaProvinsiSumatera UtaraKabupatenBatu BaraKecamatanDatuk Tanah DatarKode pos21254Kode Kemendagri12.19.11.2009 Luas... km²Jumlah penduduk... jiwaKepadatan... jiwa/km² Gapura selamat datang di Desa Petatal Petatal merupakan salah satu desa yang ada di kecamatan Datuk Tanah Datar, Kabupaten Batu Bara, provinsi Sumatera Utara, Indonesia. Pranala luar (Indonesia) Keputusan Menteri Dalam Negeri Nomor 050-145 Tahun 2022 tentang Pemberian dan Pemutakhiran Kode, Data Wi...

 

2018 Stratosphere 200 Race details Race 3 of 23 of the 2018 NASCAR Camping World Truck Series Date March 2, 2018Official name Stratosphere 200Location North Las Vegas, Nevada, Las Vegas Motor SpeedwayCourse Permanent racing facility1.5 mi (2.41 km)Distance 134 laps, 201 mi (323.478 km)Scheduled Distance 134 laps, 201 mi (323.478 km)Average speed 122.665 miles per hour (197.410 km/h)Pole positionDriver Kyle Busch Kyle Busch MotorsportsTime 30.575Most laps ledDriver Kyle Busch Kyle Busch M...

SoekarnoPoster filmSutradaraHanung BramantyoProduserRaam PunjabiDitulis olehHanung BramantyoBen SihombingPemeranArio BayuMaudy KoesnaediTika BravaniLukman SardiFerry SalimTanta GintingAgus KuncoroSujiwo TejoSinematograferFaozan RizalPerusahaanproduksiDapur FilmDistributorMVP PicturesMahaka PicturesDapur FilmsTanggal rilis11 Desember 2013Durasi137 menitNegaraBahasaIndonesiaJawaBelandaSundaBengkuluJepang Penghargaan Festival Film Indonesia 2014 Pemeran Pendukung Wanita Terbaik: Tika Bravani Edi...

 

United States historic placeDetroit ObservatoryU.S. National Register of Historic PlacesMichigan State Historic Site Detroit ObservatoryShow map of MichiganShow map of the United StatesLocationObservatory and Ann Sts., Ann Arbor, MichiganCoordinates42°16′54″N 83°43′54″W / 42.28167°N 83.73167°W / 42.28167; -83.73167Arealess than one acreBuilt1853Built byGeorge BirdArchitectRichard Harrison BullArchitectural styleGreek Revival, ItalianateNRHP refere...

 

Irish football referee (born 1957) For the Irish civil servant, see Dermot Gallagher (civil servant). Dermot Gallagher Born (1957-05-20) 20 May 1957 (age 67)Dublin, IrelandDomesticYears League Role1985–1990 Football League Asst. referee1990–1992 Football League Referee1992–2007 Premier League RefereeInternationalYears League Role1994–2002 FIFA listed Referee Dermot Gallagher (born 20 May 1957)[1] is a retired Irish association football referee based in Banbury, Oxfordshir...

本文或本章節是關於未來的公共运输建設或計划。未有可靠来源的臆測內容可能會被移除,現時內容可能與竣工情況有所出入。 此条目讲述中国大陆處於施工或详细规划阶段的工程。设计阶段的資訊,或許与竣工后情況有所出入。无可靠来源供查证的猜测会被移除。 设想中的三条路线方案[1]。 臺灣海峽隧道或臺湾海峡橋隧(英語:Taiwan Strait Tunnel Project)是一项工程�...

 

Inuit built stone landmark or cairn Inukshuk redirects here. For the Canadian wireless network, see Inukshuk Wireless. An inuksuk at the Foxe Peninsula, Nunavut, Canada An inuksuk (plural inuksuit)[1] or inukshuk[2] (from the Inuktitut: ᐃᓄᒃᓱᒃ, plural ᐃᓄᒃᓱᐃᑦ; alternatively inukhuk in Inuinnaqtun,[3] iñuksuk in Iñupiaq, inussuk in Greenlandic) is a type of stone landmark or cairn built by, and for the use of, Inuit, Iñupiat, Kalaallit, Yupik, and...

 

Dalem PujokusumanNama sebagaimana tercantum dalamSistem Registrasi Nasional Cagar Budaya Cagar budaya IndonesiaKategoriBangunanNo. RegnasRNCB.20111017.02.000264LokasikeberadaanKelurahan Keparakan, Kěmantrèn Měrgangsan, Kota YogyakartaNo. SKSK Menteri No.PM.89/PM.007/MKP/2011Tanggal SK2011PemilikKesultanan Ngayogyakarta HadiningratPengelolaG.B.P.H. Pujokusumo Ndalem Pujokusuman atau Ndalem Danudiningratan (bahasa Jawa: ꦤ꧀ꦢꦊꦩ꧀ꦥꦸꦗꦏꦸꦱꦸꦩꦤ꧀, translit. Nda...

دوري آيسلندا الممتاز 2018 تفاصيل الموسم دوري آيسلندا الممتاز  النسخة 107  البلد آيسلندا  التاريخ بداية:27 أبريل 2018  نهاية:29 سبتمبر 2018  المنظم اتحاد آيسلندا لكرة القدم  الهابطون نادي كيفلافيك  مباريات ملعوبة 132   عدد المشاركين 12   دوري آيسلندا الممتاز 2017 ...

 

هذه المقالة تحتاج للمزيد من الوصلات للمقالات الأخرى للمساعدة في ترابط مقالات الموسوعة. فضلًا ساعد في تحسين هذه المقالة بإضافة وصلات إلى المقالات المتعلقة بها الموجودة في النص الحالي. (أغسطس 2023) الدوري التونسي لكرة اليد للرجال الموسم 1976-1977 البلد تونس  المنظم الجامعة التو...