Computational genomics

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data,[1] including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.[2]

History

The roots of computational genomics are shared with those of bioinformatics. During the 1960s, Margaret Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous protein sequences for evolutionary study.[3] Their research developed a phylogenetic tree that determined the evolutionary changes that were required for a particular protein to change into another protein based on the underlying amino acid sequences. This led them to create a scoring matrix that assessed the likelihood of one protein being related to another.

Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new challenges in the form of searching and comparing the databases of gene information. Unlike text-searching algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic similarity requires one to find strings that are not simply identical, but similar. This led to the development of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of amino acid sequences with each other by using scoring matrices derived from the earlier research by Dayhoff. Later, the BLAST algorithm was developed for performing fast, optimized searches of gene sequence databases. BLAST and its derivatives are probably the most widely used algorithms for this purpose.[4]

The emergence of the phrase "computational genomics" coincides with the availability of complete sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in 1998, providing a forum for this speciality and effectively distinguishing this area of science from the more general fields of Genomics or Computational Biology.[citation needed] The first use of this term in scientific literature, according to MEDLINE abstracts, was just one year earlier in Nucleic Acids Research.[5] The final Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry Marshall, co-discoverer of the link between Helicobacter pylori and stomach ulcers. As of 2014, the leading conferences in the field include Intelligent Systems for Molecular Biology (ISMB) and Research in Computational Molecular Biology (RECOMB).

The development of computer-assisted mathematics (using products such as Mathematica or Matlab) has helped engineers, mathematicians and computer scientists to start operating in this domain, and a public collection of case studies and demonstrations is growing, ranging from whole genome comparisons to gene expression analysis.[6] This has increased the introduction of different ideas, including concepts from systems and control, information theory, strings analysis and data mining. It is anticipated that computational approaches will become and remain a standard topic for research and teaching, while students fluent in both topics start being formed in the multiple courses created in the past few years.

Contributions of computational genomics research to biology

Contributions of computational genomics research to biology include:[2]

Genome comparison

Computational tools have been developed to assess the similarity of genomic sequences. Some of them are alignment-based distances such as Average Nucleotide Identity.[7] These methods are highly specific, while being computationally slow. Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash,[8] a probabilistic approach using minhash. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random hash function on the possible k-mers. For example, if , sketches of size 4 are being constructed and given the following hash function

(AA,0) (AC,8) (AT,2) (AG,14)
(CA,6) (CC,13) (CT,5) (CG,4)
(GA,15) (GC,12) (GT,10) (GG,1)
(TA,3) (TC,11) (TT,9) (TG,7)

the sketch of the sequence

CTGACCTTAACGGGAGACTATGATGACGACCGCAT

is {0,1,1,2} which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers (Jaccard index) of the corresponding sequences. It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000.[8]

By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.

Clusterization of genomic data

Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example, in[9] the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of biosynthetic gene clusters (BGC). In [10] successive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.

Biosynthetic gene clusters

Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data.[11] Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as Minhash,[8] and clusterization algorithms such as k-medoids and affinity propagation. Also several metrics and similarities have been developed to compare them.

Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion.

Satria et. al, 2021[12] across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.[12]

Compression algorithms

Genetics compression algorithms are the latest generation of lossless algorithms that compress data (typically sequences of nucleotides) using both conventional compression algorithms and genetic algorithms adapted to the specific datatype. In 2012, a team of scientists from Johns Hopkins University published a genetic compression algorithm that does not use a reference genome for compression. HAPZIPPER was tailored for HapMap data and achieves over 20-fold compression (95% reduction in file size), providing 2- to 4-fold better compression and is less computationally intensive than the leading general-purpose compression utilities. For this, Chanda, Elhaik, and Bader introduced MAF-based encoding (MAFE), which reduces the heterogeneity of the dataset by sorting SNPs by their minor allele frequency, thus homogenizing the dataset.[13] Other algorithms developed in 2009 and 2013 (DNAZip and GenomeZip) have compression ratios of up to 1200-fold—allowing 6 billion basepair diploid human genomes to be stored in 2.5 megabytes (relative to a reference genome or averaged over many genomes).[14][15] For a benchmark in genetics/genomics data compressors, see [16]

See also

References

  1. ^ Koonin EV (March 2001). "Computational genomics". Current Biology. 11 (5): R155–8. doi:10.1016/S0960-9822(01)00081-1. PMID 11267880. S2CID 17202180.
  2. ^ a b "Computational Genomics and Proteomics at MIT". Archived from the original on 2018-03-22. Retrieved 2006-12-29.
  3. ^ Mount D (2000). Bioinformatics, Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. pp. 2–3. ISBN 978-0-87969-597-2.
  4. ^ Brown TA (1999). Genomes. Wiley. ISBN 978-0-471-31618-3.
  5. ^ Wagner A (September 1997). "A computational genomics approach to the identification of gene networks". Nucleic Acids Research. 25 (18): 3594–604. doi:10.1093/nar/25.18.3594. PMC 146952. PMID 9278479.
  6. ^ Cristianini N, Hahn M (2006). Introduction to Computational Genomics. Cambridge University Press. ISBN 978-0-521-67191-0.
  7. ^ Konstantinidis KT, Tiedje JM (2005). "Genomic insights that advance the species definition for prokaryotes". Proc Natl Acad Sci U S A. 102 (7): 2567–72. Bibcode:2005PNAS..102.2567K. doi:10.1073/pnas.0409727102. PMC 549018. PMID 15701695.
  8. ^ a b c Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.
  9. ^ Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M (2020). "A computational framework to explore large-scale biosynthetic diversity". Nat Chem Biol. 16 (1): 60–68. doi:10.1038/s41589-019-0400-9. PMC 6917865. PMID 31768033.
  10. ^ Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". mSystems. 6 (5): e00937-21. bioRxiv 10.1101/2020.12.14.422671. doi:10.1128/msystems.00937-21. PMC 8547482. PMID 34581602.
  11. ^ Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". bioRxiv. 6 (5): e00937-21. doi:10.1101/2020.12.14.422671. PMC 8547482. PMID 34581602.
  12. ^ a b Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi:10.1093/gigascience/giaa154. PMC 7804863. PMID 33438731.
  13. ^ Chanda P, Bader JS, Elhaik E (27 Jul 2012). "HapZipper: sharing HapMap populations just got easier". Nucleic Acids Research. 40 (20): e159. doi:10.1093/nar/gks709. PMC 3488212. PMID 22844100.
  14. ^ Christley S, Lu Y, Li C, Xie X (Jan 15, 2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–5. doi:10.1093/bioinformatics/btn582. PMID 18996942.
  15. ^ Pavlichin DS, Weissman T, Yona G (September 2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–202. doi:10.1093/bioinformatics/btt362. PMID 23793748.
  16. ^ Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi:10.3390/info7040056.

Read other articles:

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Pemain Terbaik Dunia FIFA 1994 – berita · surat kabar · buku · cendekiawan · JSTOR Pemain Terbaik Dunia FIFA 1994 adalah penghargaan yang dimenangkan oleh pemain penyerang Romário dari Brasil dan klub F...

 

This article is about the 2011 Indian film. For other uses, see Metro (disambiguation). 2011 Indian filmThe MetroDirected byBipin PrabhakarWritten byVyasan EdavanakkaduProduced byDileepStarringR. SarathkumarNivin PaulyBhagath ManuelSuraj VenjaramooduBiyonArunBhavanaSuresh KrishnaCinematographySreeramEdited byMahesh NarayananMusic byShaan RahmanProductioncompanyGraand ProductionDistributed byKalasangham FilmsRelease date 21 January 2011 (2011-01-21) CountryIndiaLanguageMalayalam...

 

Abdulla AripovAbdulla OripovАбдулла АриповAripov pada tahun 2017 Perdana Menteri Uzbekistan ke-4PetahanaMulai menjabat 14 Desember 2016PresidenShavkat MirziyoyevWakil PertamaOchilboy Ramatov PendahuluShavkat MirziyoyevPenggantiPetahanaWakil Perdana Menteri UzbekistanMasa jabatan12 September 2016 – 14 Desember 2016Menjabat bersama Rustam AzimovPerdana MenteriShavkat Mirziyoyev PendahuluErgash ShoismatovPenggantiAchilbay RamatovMasa jabatan30 Mei 2002 �...

Gazella Periode Pliocene - Masa kini Chinkara dari Gurun Thar, Rajasthan, India SpesiesBeragam, lihat tekslbs Gazel adalah salah satu dari banyak spesies antelop dalam genus Gazella /ɡəˈzɛlə/.[1] Terdapat juga tujuh spesies yang termasuk dalam dua genera selanjutnya; Eudorcas dan Nanger, yang sebelumnya dianggap sebagai subgenera Gazella. Subgenus ketiga sebelumnya, Procapra, mencakup tiga spesies gazel Asia yang masih hidup. Gazel dikenal sebagai hewan yang gesit. Beberapa dapat...

 

Role-playing video game series developed by Capcom This article is about the video game series. For the first game in the series, see Breath of Fire (video game). For the Yoga breathing technique, see Pranayama. Video game seriesBreath of FireGenre(s)Role-playingDeveloper(s)CapcomPublisher(s)CapcomCreator(s)Yoshinori KawanoTokuro FujiwaraMakoto IkeharaPlatform(s)SNES, PlayStation, Game Boy Advance, PlayStation 2, Microsoft Windows, PlayStation Portable, Android, iOSFirst releaseBreath of Fire...

 

Groupe SEBJenisSociété anonyme (Euronext: SK)IndustriPerabot rumahDidirikan1857KantorpusatÉcullyTokohkunciThierry de La Tour d'Artaise, CEOKaryawan34,263Situs webwww.groupeseb.com Groupe SEB adalah perusahaan Perancis yang dikenal sebagai produsen peralatan rumah tangga kecil.[1][2] SEB adalah singkatan dari Société d'Emboutissage de Bourgogne. Pada 2019, Groupe SEB mencapai penjualan 7,354 miliar euro. Ini mempekerjakan hampir 34.263 orang di lebih dari 60 negara.&#...

Sebuah peta Dataran Tinggi Colorado. Monumen Empat Penjuru adalah monumen di mana negara bagian Colorado, New Mexico, Arizona, dan Utah bertemu. (Negara-negara bagian tersebut tercantum dalam urutan searah jarum jam.) Dataran Tinggi Colorado, juga dikenal sebagai Provinsi Dataran Tinggi Colorado, adalah sebuah wilayah fisiografis dari Dataran Tinggi Intermontana, secara kasar berpusat di wilayah Empat Penjuru di Amerika Serikat Barat Daya. Provinsi ini mencakup area seluas 337.000 km2 (1...

 

Kérosène Identification No CAS 8008-20-6 No ECHA 100.029.422 No CE 232-366-4 Propriétés physiques T° fusion −48 à −26 °C[1] T° ébullition 150 à 300 °C[1] Solubilité pratiquement insoluble (eau)[1] Masse volumique 800 kg m−3 à 15 °C[1] T° d'auto-inflammation 220 °C[1] Point d’éclair 49 à 55 °C[1] Limites d’explosivité dans l’air 0,6–6,5 %vol[1] Thermochimie PCS 46,4 MJ kg−1[2] P...

 

Chinese general and warlord (1874–1939) In this Chinese name, the family name is Wu. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Wu Peifu – news · newspapers · books · scholar · JSTOR (April 2022) (Learn how and when to remove this template message) Wu Peifu 吳佩孚Gen. Wu PeifuNickname(s)Jade Marsha...

Annie Ernaux alla consegna del Premio Strega europeo 2016 per Gli anni Premio Nobel per la letteratura 2022 Annie Ernaux, nata Duchesne (Lillebonne, 1º settembre 1940), è una scrittrice francese, autrice del romanzo Gli anni (Les Années, 2008), vincitrice dei premi Marguerite Duras, François Mauriac, del Prix de la langue française, del Premio Strega europeo 2016 e del Premio Nobel per la Letteratura 2022. Annie Ernaux allo stand de L'orma editore per l'uscita di Memoria di ragazza I...

 

2012 single by Alicia KeysBrand New MeSingle by Alicia Keysfrom the album Girl on Fire ReleasedNovember 19, 2012 (2012-11-19)RecordedThe Oven Studios(New York City, NY)Genre Pop R&B Length3:53LabelRCASongwriter(s) Alicia Keys Emeli Sandé Producer(s)Alicia KeysAlicia Keys singles chronology Girl on Fire (2012) Brand New Me (2012) New Day (2013) Music videoBrand New Me on YouTube Brand New Me is a song recorded by American singer-songwriter Alicia Keys for her fifth studio ...

 

Artikel ini membutuhkan rujukan tambahan agar kualitasnya dapat dipastikan. Mohon bantu kami mengembangkan artikel ini dengan cara menambahkan rujukan ke sumber tepercaya. Pernyataan tak bersumber bisa saja dipertentangkan dan dihapus.Cari sumber: Kereta api Kertanegara – berita · surat kabar · buku · cendekiawan · JSTOR Artikel ini bukan mengenai Kereta api Kertajaya. Kereta api Kertanegara ka KERTANEGARA Malang ⇋ Yogyakarta ⇋ Purwokerto Kereta ap...

Biografi ini tidak memiliki sumber tepercaya sehingga isinya tidak dapat dipastikan. Bantu memperbaiki artikel ini dengan menambahkan sumber tepercaya. Materi kontroversial atau trivial yang sumbernya tidak memadai atau tidak bisa dipercaya harus segera dihapus.Cari sumber: J. B. Sumarlin – berita · surat kabar · buku · cendekiawan · JSTOR (Pelajari cara dan kapan saatnya untuk menghapus pesan templat ini) J. B. Sumarlin Menteri Keuangan Indonesia ...

 

Negative attitudes and discrimination toward homosexuality and LGBT people For the Chumbawamba song, see Homophobia (song). For the 2012 short film, see Homophobia (film). Anti-homosexuality redirects here. For the two Ugandan acts of parliament, see Anti-Homosexuality Act, 2014 and Anti-Homosexuality Act, 2023. Homophobe redirects here. Not to be confused with Homophone. Boys Beware, a 1961 U.S. propaganda film warning boys to beware the predatory dangers of homosexual men. The film pushes t...

 

Hubungan Kanada–Tiongkok Kanada Tiongkok Misi diplomatik Kedutaan Besar Kanada, Beijing Kedutaan Besar Tiongkok, Ottawa, Ontario Utusan Dominic Barton Duta Besar Cong Peiwu Kedubes Kanada di Tiongkok Kedubes Tiongkok di Kanada Hubungan Kanada dengan Tiongkok, resmi terjalin pada tahun 1942, saat Kanada mengirim seorang dubes ke Tiongkok. Sebelum itu, Kanada diwakili oleh dubes Inggris. Kemenangan Komunis (1949) dalam Perang Saudara Tiongkok menyebabkan keretakan dalam hubungan yang berlang...

RAB11FIP2 التراكيب المتوفرة بنك بيانات البروتينOrtholog search: PDBe RCSB قائمة رموز معرفات بنك بيانات البروتين 2GZD, 2GZH, 2K6S, 3TSO, 4C4P المعرفات الأسماء المستعارة RAB11FIP2, Rab11-FIP2, nRip11, RAB11 family interacting protein 2 معرفات خارجية الوراثة المندلية البشرية عبر الإنترنت 608599 MGI: MGI:1922248 HomoloGene: 8937 GeneCards: 22841 علم الوجو�...

 

This article is about LGBT rights in Northern Cyprus. For LGBT rights in the Republic of Cyprus, see LGBT rights in Cyprus. LGBT rights in Northern CyprusNorthern CyprusStatusLegal since 2014Gender identityNoMilitaryYes[1]Discrimination protectionsYes[2][3]Family rightsRecognition of relationshipsNoAdoptionNo Lesbian, gay, bisexual, and transgender (LGBT) persons in TRNC (Turkish Republic of Northern Cyprus) face legal challenges not experienced by non-LGBT residents. ...

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (يناير 2020)Learn how and when to remove this message يمارس النظام ذو القدرة التكيفية العالية سلوكًا تكيفيًا معقدًا في بيئة متغيرة. تتعلق القدرة التكيفية بقدرة النظام والمؤسسات والبشر وا�...

Disambiguazione – Se stai cercando altri significati, vedi Albergo (disambigua). Disambiguazione – Hotel rimanda qui. Se stai cercando altri significati, vedi Hotel (disambigua). Questa voce o sezione sull'argomento turismo è priva o carente di note e riferimenti bibliografici puntuali. Sebbene vi siano una bibliografia e/o dei collegamenti esterni, manca la contestualizzazione delle fonti con note a piè di pagina o altri riferimenti precisi che indichino puntualmente la p...

 

Theatre on Mare Street in the London Borough of Hackney, London, England This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article's lead section may be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. (November 2019) This article needs to be updated. Please help update this ...