Compression of genomic sequencing data

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 (Arabidopsis thaliana) Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

General concepts

While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g., GenBank flat file database), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., microsatellite sequences) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.[1][2][3]

Figure 1: The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data back to text form.

Base variants

With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism (SNP) map, such as dbSNP, can be used to further improve the number of variants for storage.[4]

Relative genomic coordinates

Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates.[4] For example, representing sequence variant bases in the format ‘Position1Base1Position2Base2…’, ‘123C125T130G’ can be shortened to ‘0C2T5G’, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (‘123’ in this example).

Prior information about the genomes

Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance.[4] For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information (e.g., ‘123C125T130G’ can be abridged to ‘CTG’). This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.

Encoding genomic coordinates

Encoding schemes are used to convert coordinate integers into binary form to provide additional compression gains. Encoding designs, such as the Golomb code and the Huffman code, have been incorporated into genomic data compression tools.[5][6][7][8][9][10] Of course, encoding schemes entail accompanying decoding algorithms. Choice of the decoding scheme potentially affects the efficiency of sequence information retrieval.

Algorithm design choices

A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.

Reference sequence

Selection of a reference sequence for relative compression can affect compression performance. Choosing a consensus reference sequence over a more specific reference sequence (e.g., the revised Cambridge Reference Sequence) can result in higher compression ratio because the consensus reference may contain less bias in its data.[4] Knowledge about the source of the sequence being compressed, however, may be exploited to achieve greater compression gains. The idea of using multiple reference sequences has been proposed.[4] Brandon et al. (2009)[4] alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of mitochondrial DNA variant data as an example (see Figure 2). The authors found biased haplotype distribution in the mitochondrial DNA sequences of Africans, Asians, and Eurasians relative to the revised Cambridge Reference Sequence. Their result suggests that the revised Cambridge Reference Sequence may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals. Additionally, a reference sequence can be designed based on statistical properties [1][4] or engineered [11][12] to improve the compression ratio.

Encoding schemes

The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates.[4] Fixed codes, such as the Golomb code and the Rice code, are suitable when the variant or coordinate (represented as integer) distribution is well defined. Variable codes, such as the Huffman code, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined (this is typically the case in genomic sequence data).

List of genomic re-sequencing data compression tools

The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes.[4][5][6][7][8][9][10][13] Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported [6] for two revisions of the same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.

Genomic Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
Software Description Compression Ratio Data Used for Evaluation Approach/Encoding Scheme Link Use Licence Reference
PetaSuite Lossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files 60% to 90% Human genome sequences from the 1000 Genomes Project https://petagene.com Commercial [14]
Genozip A universal compressor for genomic files – compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files [15] [16] Human genome sequences from the 1000 Genomes Project Genozip extensible framework http://genozip.com Commercial, but free for non-commercial use [17]
Genomic Squeeze (G-SQZ) Lossless compression tool designed for storing and analyzing sequencing read data 65% to 76% Human genome sequences from the 1000 Genomes Project Huffman coding http://public.tgen.org/sqz -Undeclared- [8]
CRAM (part of SAMtools) Highly efficient and tunable reference-based compression of sequence data [18] European Nucleotide Archive deflate and rANS http://www.ebi.ac.uk/ena/software/cram-toolkit Apache-2.0 [19]
Genome Compressor (GeCo) A tool using a mixture of multiple Markov models for compressing reference and reference-free sequences Human nuclear genome sequence Arithmetic coding http://bioinformatics.ua.pt/software/geco/ or https://pratas.github.io/geco/ GPLv3 [13]
GenomSys codecs Lossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092[20] (MPEG-G) 60% to 90% Human genome sequences from the 1000 Genomes Project Context-adaptive binary arithmetic coding (CABAC) https://www.genomsys.com Commercial [21]
fastafs Compression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system (api-like )integration without the need to fully decompress archives for random / partial access. FASTA files Huffman coding as implemented by Zstd https://github.com/yhoogstrate/fastafs GPL-v2.0 [22]
Genomic Sequencing data compression tools not compatible with standard genome sequencing files formats
Software Description Compression Ratio Data Used for Evaluation Approach/Encoding Scheme Link Use License Reference
Genome Differential Compressor (GDC) LZ77-style tool for compressing multiple genomes of the same species 180 to 250-fold / 70 to 100-fold Nuclear genome sequence of human and Saccharomyces cerevisiae Huffman coding http://sun.aei.polsl.pl/gdc GPLv2 [5]
Genome Re-Sequencing (GRS) Reference sequence-based tool independent of a reference SNP map or sequence variation information 159-fold / 18,133-fold / 82-fold Nuclear genome sequence of human, Arabidopsis thaliana (different revisions of the same genome), and Oryza sativa Huffman coding https://web.archive.org/web/20121209070434/http://gmdd.shgmo.org/Computational-Biology/GRS/ free of charge for non-commercial use [6]
Genome Re-sequencing Encoding (GReEN) Probabilistic copy model-based tool for compressing re-sequencing data using a reference sequence ~100-fold Human nuclear genome sequence Arithmetic coding http://bioinformatics.ua.pt/software/green/ -Undeclared- [7]
DNAzip A package of compression tools ~750-fold Human nuclear genome sequence Huffman coding http://www.ics.uci.edu/~dnazip/ -Undeclared- [9]
GenomeZip Compression with respect to a reference genome. Optionally uses external databases of genomic variations (e.g. dbSNP) ~1200-fold Human nuclear genome sequence (Watson) and sequences from the 1000 Genomes Project Entropy coding for approximations of empirical distributions https://sourceforge.net/projects/genomezip/ -Undeclared- [10]

References

  1. ^ a b Giancarlo, R.; Scaturro, D.; Utro, F. (2009). "Textual data compression in computational biology: A synopsis". Bioinformatics. 25 (13): 1575–1586. doi:10.1093/bioinformatics/btp117. PMID 19251772.
  2. ^ Nalbantog̃Lu, O. U.; Russell, D. J.; Sayood, K. (2010). "Data Compression Concepts and Algorithms and their Applications to Bioinformatics". Entropy. 12 (1): 34. doi:10.3390/e12010034. PMC 2821113. PMID 20157640.
  3. ^ Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi:10.3390/info7040056.
  4. ^ a b c d e f g h i Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783.
  5. ^ a b c Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi:10.1093/bioinformatics/btr505. PMID 21896510.
  6. ^ a b c d Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471.
  7. ^ a b c Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC 3287168. PMID 22139935.
  8. ^ a b c Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID 20605925.
  9. ^ a b c Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi:10.1093/bioinformatics/btn582. PMID 18996942.
  10. ^ a b c Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi:10.1093/bioinformatics/btt362. PMID 23793748.
  11. ^ Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference Sequence Construction for Relative Compression of Genomes". String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 7024. pp. 420–425. doi:10.1007/978-3-642-24583-1_41. ISBN 978-3-642-24582-4. S2CID 16007637.
  12. ^ Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering Relative Compression of Genomes". arXiv:1103.2351 [cs.CE].
  13. ^ a b Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.
  14. ^ "The Importance of Data Compression in the Field of Genomics". IEEE Pulse. 2019-04-26. Retrieved 2024-02-22.
  15. ^ Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files". bioRxiv. doi:10.1101/2022.09.12.507582. S2CID 252357508.
  16. ^ Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in Genozip 15". bioRxiv. doi:10.1101/2023.07.07.548069. S2CID 259764998.
  17. ^ Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021). "Genozip: a universal extensible genomic data compressor". Bioinformatics. 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. PMC 8388020. PMID 33585897.
  18. ^ CRAM benchmarking
  19. ^ CRAM format specification (version 3.0)
  20. ^ "ISO/IEC 23092-2:2019 Information technology — Genomic information representation — Part 2: Coding of genomic information". iso.org.
  21. ^ Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation". bioRxiv 10.1101/426353.
  22. ^ Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021). "FASTAFS: file system virtualisation of random access compressed FASTA files". BMC Bioinformatics. 22 (1): 535. doi:10.1186/s12859-021-04455-3. PMC 8558547. PMID 34724897.

Read other articles:

KTV

Artikel ini bukan mengenai KTI atau Kompas TV. KTVPT Komando Media TelevisiJakarta Pusat, DKI JakartaIndonesiaSaluranDigital: 26 UHFVirtual: 127BrandingKTVSloganTelevisi Kota KitaPemrogramanJaringan televisiIndependen (2008–2011, 2015–sekarang)Spacetoon (2018–sekarang)[butuh rujukan]KepemilikanPemilikKOmando Group (2008–2011)Kompas Gramedia (2011–2018)KG Media (2018–sekarang)Stasiun seinduk Kompas TV (2015–sekarang) Gramedia TV (2016–2023) RiwayatDidirikan1 September 2...

 

 

Minskin Asal  Amerika Serikat Standar ras TICA standar Kucing domestik (Felis catus) Minskin adalah salah satu ras kucing baru dan sangat langka yang merupakan persilangan dari ras munchkin dengan sfinks. Ciri khasnya adalah berkaki pendek dan tidak memiliki bulu.[1] Sejarah Pada tahun 1998, seorang peternak kucing bernama Paul McSorley menyilangkan ras munchkin dengan ras sphynx. Setelah penyilangan, lahir ras kucing baru yang mirip dengan ras sfinks, tetapi berkaki pendek sepe...

 

 

العلاقات الكاميرونية السويسرية الكاميرون سويسرا   الكاميرون   سويسرا تعديل مصدري - تعديل   العلاقات الكاميرونية السويسرية هي العلاقات الثنائية التي تجمع بين الكاميرون وسويسرا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتي�...

Pour les articles homonymes, voir Soufflet. Jacques Soufflet Fonctions Ministre des Armées 28 mai 1974 – 21 janvier 1975(7 mois et 24 jours) Président Valéry Giscard d'Estaing Premier ministre Jacques Chirac Gouvernement Jacques Chirac I Prédécesseur Robert Galley Successeur Yvon Bourges Sénateur français 27 avril 1959 – 28 juin 1974(15 ans, 2 mois et 1 jour) Élection 26 avril 1959 Réélection 22 septembre 1968 Circonscription Seine-et-Oise (1959-1968)Yv...

 

 

NSAID analgesic and anti-inflammatory drug GlucametacinClinical dataTrade namesTeoremacOther namesIndometacin glucosamideIdentifiers IUPAC name 2-[1-(4-Chlorobenzoyl)-5-methoxy-2-methylindol-3-yl]-N-[(2R,3R,4S,5R)-3,4,5,6-tetrahydroxy-1-oxohexan-2-yl]acetamide CAS Number52443-21-7PubChem CID3033980ChemSpider2298541UNIIN1EXE5EHANKEGGD08021ChEMBLChEMBL488914CompTox Dashboard (EPA)DTXSID80200445 ECHA InfoCard100.052.640 Chemical and physical dataFormulaC25H27ClN2O8Molar mass518.95 g·mol−...

 

 

Pour les articles homonymes, voir Valentini. Lucia Valentini-TerraniBiographieNaissance 29 août 1946PadoueDécès 11 juin 1998 (à 51 ans)SeattleNationalité italienneFormation Conservatorio di Musica Cesare Pollini (d)Conservatoire Benedetto Marcello de VeniseActivité Artiste lyriqueAutres informationsTessiture Mezzo-soprano, contraltoLabels Deutsche Grammophon, Philips Records, Columbia RecordsSite web www.luciavalentiniterrani.itmodifier - modifier le code - modifier Wikidata Lucia...

Pour les articles homonymes, voir Li Jue et Jue. Li JueBiographieNaissance Xian de FupingDécès 197Activité Homme politiqueEnfant 李式 (d)modifier - modifier le code - modifier Wikidata Li Jue (? - 197/avril 198)(prononciaton : Bi-Djeu) servit Dong Zhuo depuis ses débuts lors de la repression de la rébellion de la Province de Liang et lors des combats de la campagne déclenchée contre lui, jusqu'à son assassinat. Avec plusieurs autres généraux, il tua le commanditaire indirect...

 

 

British-bred Thoroughbred racehorse BusybodyBusybody parades before the Oaks, from Illustrated London News, May 1884SirePetrarchGrandsireLord ClifdenDamSpinawayDamsireMacaroniSexMareFoaled1881CountryUnited KingdomColourBayBreederEvelyn Boscawen, 6th Viscount FalmouthOwnerLord FalmouthGeorge BairdTrainerMathew DawsonThomas Cannon, Sr.Record6:5–1–0Earnings£10,620Major winsMiddle Park Stakes (1883)Great Challenge Stakes (1883)1000 Guineas (1884)Epsom Oaks (1884) Busybody (1881–...

 

 

Portugalau Concours Eurovision 1968 Données clés Pays  Portugal Chanson Verão Interprète Carlos Mendes Langue Portugais Sélection nationale Radiodiffuseur Rádio e Televisão de Portugal (RTP) Type de sélection Finale nationaleÉmission télévisée : Festival da Canção 1968 Date 4 mars 1968 Lieu Lisbonne Concours Eurovision de la chanson 1968 Position en finale 11e (5 points) 1967 1969 modifier Le Portugal a participé au Concours Eurovision de la chanson 1968, le 6 avril ...

2001 studio album by GinuwineThe LifeStudio album by GinuwineReleasedApril 3, 2001GenreR&B[1]Length68:42LabelEpicProducerGinuwine (exec.)Loren DawsonCliff JonesRichie JonesKhris KellowTroy OliverCory RooneyRaphael SaadiqDan SheaJerry VinesRic WakeGinuwine chronology 100% Ginuwine(1999) The Life(2001) The Senior(2003) Singles from The Life There It IsReleased: January 9, 2001 DifferencesReleased: August 8, 2001 Just BecauseReleased: August 17, 2001 Tribute to a WomanRelease...

 

 

土库曼斯坦总统土库曼斯坦国徽土库曼斯坦总统旗現任谢尔达尔·别尔德穆哈梅多夫自2022年3月19日官邸阿什哈巴德总统府(Oguzkhan Presidential Palace)機關所在地阿什哈巴德任命者直接选举任期7年,可连选连任首任萨帕尔穆拉特·尼亚佐夫设立1991年10月27日 土库曼斯坦土库曼斯坦政府与政治 国家政府 土库曼斯坦宪法 国旗 国徽 国歌 立法機關(英语:National Council of Turkmenistan) ...

 

 

1974 novel by Kenyan writer Meja Mwangi Carcase For Hounds AuthorMeja MwangiCountryKenyaLanguageEnglishPublication date1974ISBN0-415-23019-5Preceded byKill Me Quick Followed byTaste of Death  Carcase for Hounds is a novel by Kenyan writer Meja Mwangi first published in 1974. The novel concerns the Mau Mau liberation struggle during the latter days of British colonial rule and attempts, by the actions of the main protagonists, to show how Mau Mau was organized and why it to...

Compuestos químicos orgánicos El querógeno es una mezcla de compuestos químicos orgánicos presente en las rocas sedimentarias. Son insolubles en los solventes orgánicos comunes, debido a su enorme peso molecular (por encima de 1000 Dalton). La porción soluble es conocida como bitumen. Al ser calentados dentro de la corteza terrestre, (ventana del petróleo a aprox. 60°-120 °C, ventana del gas natural a aprox. 120°-150 °C) algunos tipos de querógeno desprenden petró...

 

 

مساهمة في نقد الاقتصاد السياسي Zur Kritik der Politischen Ökonomie غلاف طبعة 1859 للكتاب معلومات الكتاب المؤلف كارل ماركس اللغة ألمانية تاريخ النشر 1859 الموضوع اقتصاد سياسي تعديل مصدري - تعديل   مساهمة في نقد الاقتصاد السياسي (بالألمانية: Zur Kritik der Politischen Ökonomie)‏ هو كتاب من تأليف كارل ماركس ...

 

 

Mosque in Arak, Iran Jameh Mosque of ArakReligionAffiliationShia IslamProvinceMarkazi ProvinceLocationLocationArak, IranArchitectureTypeMosqueDome(s)2 Jameh Mosque of Arak the Jumu'ah venue of Arak, which is located at the beginning of Bazaar of Arak.[1] See also Islam in Iran References ^ Encyclopaedia of the Iranian Architectural History. Cultural Heritage, Handicrafts and Tourism Organization of Iran. 19 May 2011. Archived from the original on 6 April 2015. vteMosques in IranArdabi...

Game & WatchPembuatNintendoKeluarga produkGame & WatchJenisPermainan elektronik tangan memegangGenerasiGenerasi keduaDihentikan1991TerjualSedunia: 43.400.000 unit[1]MediaSoftware pra-instalPendahuluColor TV GamePenerusGame Boy Nintendo Game & Watch Game & Watch (ゲーム&ウオッチcode: ja is deprecated , Gēmu ando Uotchi, atau G&W) adalah seri konsol permainan genggam yang di produksi oleh Nintendo dari tahun 1980 sampai dengan tahun 1991. Dirancang oleh Gun...

 

 

「楽園追放」はこの項目へ転送されています。本来の意味とそれに基づく文学作品については「失楽園」を、絵画作品については「楽園追放 (マサッチオ)」をご覧ください。 楽園追放-Expelled from Paradise-監督 水島精二脚本 虚淵玄原作 ニトロプラス東映アニメーション製作 野口光一出演者 釘宮理恵三木眞一郎神谷浩史音楽 NARASAKI主題歌 ELISA「EONIAN -イオニアン-」編集 吉...

 

 

Czech scientist (1938–2021) You can help expand this article with text translated from the corresponding article in Czech. (April 2021) Click [show] for important translation instructions. Machine translation, like DeepL or Google Translate, is a useful starting point for translations, but translators must revise errors as necessary and confirm that the translation is accurate, rather than simply copy-pasting machine-translated text into the English Wikipedia. Do not translate text tha...

Online language learning platform This article is about the software. For the German football player and coach, see Markus Babbel. Not to be confused with Babble.com. Babbel GmbHType of sitePrivateAvailable inDanish, Dutch, English, French, German, Indonesian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Turkish.HeadquartersBerlin, GermanyKey peopleArne Schepker (CEO), Thomas Holl (Co-Founder), Julie Hansen (CEO Babbel, Inc.)IndustryE-Learning, Online Educa...

 

 

Bar Bar BarLagu oleh Crayon PopDirilis20 Juni 2013 (2013-06-20)FormatCD single, Unduhan digitalDirekam2013GenreK-popDurasi3:00LabelChrome Entertainment, CJ E&M MusicPenciptaKim Yoo Min Bar Bar Bar (Hangul: 빠빠빠; RR: Ppa Ppa Ppa) adalah album singel kedua, dan singel keempat secara keseluruhan dari Crayon Pop. Singel ini dirilis secara digital pada tanggal 20 Juni 2013 oleh Chrome Entertainment dan CJ E&M Music. Bar Bar Bar menjadi viral hit pada akhir J...