Bloom filters in bioinformatics

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

Bloom filters are primarily used in bioinformatics to test the existence of a k-mer in a sequence or set of sequences. The k-mers of the sequence are indexed in a Bloom filter, and any k-mer of the same size can be queried against the Bloom filter. This is a preferable alternative to hashing the k-mers of a sequence with a hash table, particularly when the sequence is very long, since it is very demanding to store large numbers of k-mers in memory.

Applications

Sequence characterization

A visualization of querying a bloom filter of k-mers of a DNA sequence.
A visualization of querying a Bloom filter of k-mers of a DNA sequence. The first step is storing the k-mers of a sequence into a Bloom filter. Querying is done similarly where the query sequence is broken down into its corresponding k-mers, and the k-mers are used to query the Bloom filter.

The preprocessing step in many bioinformatics applications involves classifying sequences, primarily classifying reads from a DNA sequencing experiment. For example, in metagenomic studies it is important to be able to tell if a sequencing read belongs to a new species.[1] and in clinical sequencing projects it is vital to filter out reads from the genomes of contaminating organisms. There are many bioinformatics tools that use Bloom filters to classify reads by querying k-mers of a read to a set of Bloom filters generated from known reference genomes. Some tools that use this method are FACS[2] and BioBloom tools.[3] While these methods may not outclass other bioinformatics classification tools like Kraken,[4] they offer a memory-efficient alternative.

A recent area of research with Bloom filters in sequence characterization is in developing ways to query raw reads from sequencing experiments. For example, how can one determine which reads contain a specific 30-mer in the entire NCBI Sequence Read Archive? This task is similar to that which is accomplished by BLAST, however it involves querying a much larger dataset; while BLAST queries against a database of reference genomes, this task demands that specific reads that contain the k-mer are returned. BLAST and similar tools cannot handle this problem efficiently, therefore Bloom filter based data structures have been implemented to this end. Binary bloom trees[5] are binary trees of Bloom filters that facilitates querying transcripts in large RNA-seq experiments. BIGSI[6] borrows bitsliced signatures from the field of document retrieval to index and query the entirety of microbial and viral sequencing data in the European Nucleotide Archive. The signature of a given dataset is encoded as a set of Bloom filters from that dataset.

Genome assembly

Bloom filters use less memory than hash tables for de Bruijn graphs but do not preserve edge information
Comparison between a hash table and a Bloom filter to store a de Bruijn graph in memory. Note that while edge information may be preserved in a hash table, it is not stored in a Bloom filter, which complicates graph traversal. A Bloom filter of the same size as a hash table will still use less space due to not storing the values of the k-mers themselves.

The memory efficiency of Bloom filters has been used in genome assembly as a way to reduce the space footprint of k-mers from sequencing data. The contribution of Bloom filter based assembly methods is combining Bloom filters and de Bruijn graphs into a structure called a probabilistic de Bruijn graph,[7] which optimizes memory usage at the cost of the false positive rate inherent to Bloom filters. Instead of storing the de Bruijn graph in a hash table, it is stored in a Bloom filter.

Using a Bloom filter to store the de Bruijn graph complicates the graph traversal step to build the assembly, since edge information is not encoded in the Bloom filter. Graph traversal is accomplished by querying the Bloom filter for any of the four possible subsequent k-mers from the current node. For example, if the current node is for the k-mer ACT, then the next node must be for one of the k-mers CTA, CTG, CTC or CTT. If a query k-mer exists in the Bloom filter, then the k-mer is added to the path. Therefore, there are two sources for false positives in querying the Bloom filter when traversing the de Bruijn graph. There is the probability that one or more of the three false k-mers exist elsewhere in the sequencing set to return a false positive, and there is the aforementioned inherent false positive rate of the Bloom filter itself. The assembly tools that use Bloom filters must account for these sources of false positives in their methods. ABySS 2[8] and Minia[9] are examples of assemblers that uses this approach for de novo assembly.

Sequencing error correction

Next-generation sequencing (NGS) methods have allowed the generation of new genome sequences much faster and cheaper than the previous Sanger sequencing methods. However, these methods have a higher error rate,[10][11] which complicates downstream analysis of the sequence and can even give rise to erroneous conclusions. Many methods have been developed to correct the errors in NGS reads, but they use large amounts of memory which makes them impractical for large genomes, such as the human genome. Therefore, tools using Bloom filters have been developed to address these limitations, taking advantage of their efficient memory usage. Musket[12] and  BLESS[13] are examples of such tools. Both methods use the k-mer spectrum approach for error correction. The first step of this approach is to count the multiplicity of k-mers, however while BLESS only uses Bloom filters to store the counts, Musket uses Bloom filters only to count unique k-mers, and stores non-unique k-mers in a hash table, as described in a previous work[14]

RNA-Seq

Bloom filters are also employed in some RNA-Seq pipelines. RNA-Skim[15] clusters RNA transcripts and then uses Bloom filters to find sig-mers: k-mers that are only found in one of the clusters. These sig-mers are then used to estimate the transcript abundance levels. Therefore, it does not analyze every possible k-mer which results in performance and memory-usage improvements, and has been shown to work as well as previous methods.

References

  1. ^ Lundeberg, Joakim; Arvestad, Lars; Andersson, Björn; Allander, Tobias; Käller, Max; Stranneheim, Henrik (2010-07-01). "Classification of DNA sequences using Bloom filters". Bioinformatics. 26 (13): 1595–1600. doi:10.1093/bioinformatics/btq230. ISSN 1367-4803. PMC 2887045. PMID 20472541.
  2. ^ Lundeberg, Joakim; Arvestad, Lars; Andersson, Björn; Allander, Tobias; Käller, Max; Stranneheim, Henrik (2010-07-01). "Classification of DNA sequences using Bloom filters". Bioinformatics. 26 (13): 1595–1600. doi:10.1093/bioinformatics/btq230. ISSN 1367-4803. PMC 2887045. PMID 20472541.
  3. ^ Chu, Justin; Sadeghi, Sara; Raymond, Anthony; Jackman, Shaun D.; Nip, Ka Ming; Mar, Richard; Mohamadi, Hamid; Butterfield, Yaron S.; Robertson, A. Gordon (2014-12-01). "BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters". Bioinformatics. 30 (23): 3402–3404. doi:10.1093/bioinformatics/btu558. ISSN 1367-4811. PMC 4816029. PMID 25143290.
  4. ^ Wood, Derrick E.; Salzberg, Steven L. (2014-03-03). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi:10.1186/gb-2014-15-3-r46. ISSN 1474-760X. PMC 4053813. PMID 24580807.
  5. ^ Carl Kingsford; Solomon, Brad (March 2016). "Fast search of thousands of short-read sequencing experiments". Nature Biotechnology. 34 (3): 300–302. doi:10.1038/nbt.3442. ISSN 1546-1696. PMC 4804353. PMID 26854477.
  6. ^ Iqbal, Zamin; McVean, Gil; Rocha, Eduardo P. C.; Bakker, Henk C. den; Bradley, Phelim (February 2019). "Ultrafast search of all deposited bacterial and viral genomic data". Nature Biotechnology. 37 (2): 152–159. doi:10.1038/s41587-018-0010-1. ISSN 1546-1696. PMC 6420049. PMID 30718882.
  7. ^ Brown, C. Titus; Tiedje, James M.; Howe, Adina; Canino-Koning, Rosangela; Hintze, Arend; Pell, Jason (2012-08-14). "Scaling metagenome sequence assembly with probabilistic de Bruijn graphs". Proceedings of the National Academy of Sciences. 109 (33): 13272–13277. arXiv:1112.4193. Bibcode:2012PNAS..10913272P. doi:10.1073/pnas.1121464109. ISSN 0027-8424. PMC 3421212. PMID 22847406.
  8. ^ Birol, Inanc; Warren, Rene L.; Coombe, Lauren; Khan, Hamza; Jahesh, Golnaz; Hammond, S. Austin; Yeo, Sarah; Chu, Justin; Mohamadi, Hamid (2017-05-01). "ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter". Genome Research. 27 (5): 768–777. doi:10.1101/gr.214346.116. ISSN 1088-9051. PMC 5411771. PMID 28232478.
  9. ^ Chikhi, Rayan; Rizk, Guillaume (2013-09-16). "Space-efficient and exact de Bruijn graph representation based on a Bloom filter". Algorithms for Molecular Biology. 8 (1): 22. doi:10.1186/1748-7188-8-22. ISSN 1748-7188. PMC 3848682. PMID 24040893.
  10. ^ Loman, Nicholas J.; Misra, Raju V.; Dallman, Timothy J.; Constantinidou, Chrystala; Gharbia, Saheer E.; Wain, John; Pallen, Mark J. (May 2012). "Performance comparison of benchtop high-throughput sequencing platforms". Nature Biotechnology. 30 (5): 434–439. doi:10.1038/nbt.2198. ISSN 1546-1696. PMID 22522955. S2CID 5300923.
  11. ^ Wang, Xin Victoria; Blades, Natalie; Ding, Jie; Sultana, Razvan; Parmigiani, Giovanni (2012-07-30). "Estimation of sequencing error rates in short reads". BMC Bioinformatics. 13: 185. doi:10.1186/1471-2105-13-185. ISSN 1471-2105. PMC 3495688. PMID 22846331.
  12. ^ Schmidt, Bertil; Schröder, Jan; Liu, Yongchao (2013-02-01). "Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data". Bioinformatics. 29 (3): 308–315. doi:10.1093/bioinformatics/bts690. ISSN 1367-4803. PMID 23202746.
  13. ^ Hwu, Wen-Mei; Ma, Jian; Chen, Deming; Wu, Xiao-Long; Heo, Yun (2014-05-15). "BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads". Bioinformatics. 30 (10): 1354–1362. doi:10.1093/bioinformatics/btu030. ISSN 1367-4803. PMC 6365934. PMID 24451628.
  14. ^ Pellow, David; Filippova, Darya; Kingsford, Carl (2017-06-01). "Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters". Journal of Computational Biology. 24 (6): 547–557. doi:10.1089/cmb.2016.0155. ISSN 1066-5277. PMC 5467106. PMID 27828710.
  15. ^ Zhang, Zhaojun; Wang, Wei (2014-06-15). "RNA-Skim: a rapid method for RNA-Seq quantification at transcript level". Bioinformatics. 30 (12): i283–i292. doi:10.1093/bioinformatics/btu288. ISSN 1367-4803. PMC 4058932. PMID 24931995.

Read other articles:

MRTNama resmiManor Racing MRTKantor pusatBanbury, Oxfordshire, Britania RayaKepala timStephen FitzpatrickDave RyanDirektur teknisLuca FurbattoPat FryJohn McQuilliamNikolas TombazisPembalap terkenalRio HaryantoEsteban OconPascal WehrleinNama sebelumnyaManor Marussia F1 TeamSejarah dalam ajang Formula SatuMesinMercedes-BenzGelar Konstruktor0Gelar Pembalap0Jumlah lomba21Menang0Podium0Poin1Posisi pole0Putaran tercepat0Lomba pertamaGrand Prix Australia 2016Lomba terakhirGrand Prix Abu Dhabi 2016 ...

 

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Agustus 2020. Jordan Chiles— Pesenam —Nama lengkapJordan Lucella Elizabeth ChilesNegara Amerika SerikatLahir15 April 2001 (umur 22)Tualatin, OregonKota asalVancouver, WashingtonTempat tinggalSpring, TexasDisiplinSenam berirama putriTingkatSen...

 

Pesawat DC-4 Flying Dutchman milik Dutch Dakota Association. Douglas DC-4Douglas DC-4 milik Pacific Western Airlines pada 1959TipePesawat terbang, pesawat transporTerbang perdana14 Februari 1942 (produksi seri)[1]Tahun produksi1942-Agustus 1947Jumlah produksi80[2] DC-4 dan 1,163 C-54/R5DAcuan dasarDouglas DC-4EVarianC-54 Skymaster Canadair North Star Aviation Traders ATL-98 Carvair Douglas DC-4 adalah pesawat penumpang sipil (airliner) sayap rendah (low wing) yang diproduksi o...

陆军第十四集团军炮兵旅陆军旗存在時期1950年 - 2017年國家或地區 中国效忠於 中国 中国共产党部門 中国人民解放军陆军種類炮兵功能火力支援規模约90门火炮直屬南部战区陆军參與戰役1979年中越战争 中越边境冲突 老山战役 成都军区对越轮战 紀念日10月25日 陆军第十四集团军炮兵旅(英語:Artillery Brigade, 14th Army),是曾经中国人民解放军陆军第十四集团军下属...

 

Ne doit pas être confondu avec Garde-corps. Si ce bandeau n'est plus pertinent, retirez-le. Cliquez ici pour en savoir plus. Cet article ne s'appuie pas, ou pas assez, sur des sources secondaires ou tertiaires (décembre 2019). Pour améliorer la vérifiabilité de l'article ainsi que son intérêt encyclopédique, il est nécessaire, quand des sources primaires sont citées, de les associer à des analyses faites par des sources secondaires. Garde du corpsGardes du corps du président fran...

 

Vlade Divac Fiche d’identité Nationalité Serbie Naissance 3 février 1968 (56 ans)Prijepolje, Yougoslavie Taille 2,16 m (7′ 1″) Poids 111 kg (244 lb) Situation en club Numéro 12, 21 Poste Pivot Draft de la NBA Année 1989 Position 26e Franchise Lakers de Los Angeles Carrière professionnelle * SaisonClubMoy. pts 1983-19851986-19891989-19901990-19911991-19921992-19931993-19941994-19951995-19961996-19971997-199819981998-19991999-20002000-20012001-20022002-2003...

British actor (1875–1960) Arthur WontnerBorn(1875-01-21)21 January 1875London, England[1]Died10 July 1960(1960-07-10) (aged 85)London, EnglandOccupationActorYears active1916–1955Spouses Rosecleer Alice Amelia Blanche Kingwell ​ ​(m. 1903; died 1943)​ Florence Eileen Lainchbury ​ ​(m. 1947)​ Arthur Wontner (21 January 1875 – 10 July 1960[2]) was a British actor best known for play...

 

Swedish canoe racer (1942–2021) Gunnar UtterbergGunnar Utterberg competing for Sweden in 1964Personal informationBorn(1942-11-28)28 November 1942Jönköping, SwedenDied12 September 2021(2021-09-12) (aged 78)Height1.82 m (6 ft 0 in)Weight69 kg (152 lb)SportSportCanoe racingClubNyköpings Kanotklubb Medal record Representing  Sweden Olympic Games 1964 Tokyo K-2 1000 m Canoe Sprint European Championships 1967 Duisburg K-2 1000 m 1967 Duisburg K-2 10000 m 1969...

 

1943 speeches delivered by Himmler and referring to the Holocaust Authorised by Himmler himself, this original page of the final edition of his speech made on 4 October 1943 bears the Reichsführer-SS's statements to his audience that the extermination of the Jews, a policy of the Nazi state, is being carried out. The Posen speeches were two speeches made by Heinrich Himmler, the head of the SS of Nazi Germany, on 4 and 6 October 1943 in the town hall of Posen (Poznań), in German-occupied Po...

この記事は検証可能な参考文献や出典が全く示されていないか、不十分です。出典を追加して記事の信頼性向上にご協力ください。(このテンプレートの使い方)出典検索?: コルク – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジャパンサーチ · TWL(2017年4月) コルクを打ち抜いて作った瓶の栓 コルク(木栓、...

 

Jamoat in Sughd Region, TajikistanVorukh Ворух (Russian and Tajik)JamoatMap showing the three main exclaves in Kyrgyzstan. Vorukh is in blue on the bottom left,VorukhLocation in TajikistanCoordinates: 39°51′12″N 70°34′37″E / 39.85333°N 70.57694°E / 39.85333; 70.57694Country TajikistanRegionSughd RegionCityIsfaraArea • Total96.7 km2 (37.3 sq mi)Population (2022) • Total45,000Time zoneUTC+5 (TJT)Official l...

 

Fotografía de pintura de animales de Lascaux. Este artículo detalla una cronología de la historia de la zoología. En la prehistoria El hombre inició la zoología en la prehistoria: a medida que tenía contacto con los seres vivos (en especial los animales) los fue conociendo y utilizando de acuerdo a sus necesidades. La calidad de los conocimientos que tenía acerca de la naturaleza se encuentra en las pinturas rupestres dejadas en cavernas por el hombre de Cro-Magnon. Con la domesticaci...

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Maret 2016. SMP Negeri 5 BalikpapanInformasiRentang kelasVII, VIII, IXKurikulumKurikulum Tingkat Satuan PendidikanAlamatLokasiJl. Marsma R. Iswahyudi 7, Balikpapan, Kalimantan TimurMoto SMP Negeri (SMPN) 5 Balikpapan, merupakan salah satu Sekolah Menengah Pertama Ne...

 

American political activist Mark MecklerMeckler in 2011BornMark Jay Meckler (1962-03-10) March 10, 1962 (age 62)Los Angeles County, California, U.S.EducationSan Diego State University (BA)University of the Pacific (JD)OccupationPolitical activistKnown forCo-founder of Tea Party Patriots, founder of Citizens for Self-Governance Mark Jay Meckler (born March 10, 1962) is an American political activist, attorney, and business executive.[1] He currently serves as President of Cit...

 

This article is about the Australian channel. For channels in other countries, see List of Syfy TV channels. Television channel SyfyCountryAustraliaBroadcast areaAustraliaProgrammingLanguage(s)EnglishPicture format1080i HDTV(downscaled to 16:9 576i for the SDTV feed)Timeshift serviceSyfy + 2OwnershipOwnerUniversal Networks International(NBCUniversal)Sister channels13th StreetCNBC AustraliaE!Style NetworkUniversal TVEuronewsHistoryLaunched1 January 2014ReplacedSF ChannelClosed31 December 2019R...

Segment of time corresponding to a specific number of beats For other uses, see Bar (disambiguation). Types of bar lines In musical notation, a bar (or measure) is a segment of music bounded by vertical lines, known as bar lines (or barlines), usually indicating one of more recurring beats. The length of the bar, measured by the number of note values it contains, is normally indicated by the time signature. Types of bar lines Regular bar lines consist of a thin vertical line extending from th...

 

صاروخ تجارب من نوع بلاك برانت يحمل أجهزة علمية. صاروخ تجارب (بالإنجليزية: Sounding rocket أو Research rocket) هو صاروخ يستخدم لإجراء الأبحاث العلمية. يحمل الصاروخ الأجهزة العلمية الخاصة بدراسة متعلقة بطبقات الجو العليا ويطلق في مسار لا يصل به إلى مدار حول الأرض. مثال على ذلك قياس الأشعة ا...

 

American painter and sculptor (1848–1912) Francis Davis MilletPhotographed circa 1910Born(1848-11-03)November 3, 1848Mattapoisett, Massachusetts, USDiedApril 15, 1912(1912-04-15) (aged 63)Atlantic OceanOccupation(s)Painter, sculptorSpouseElizabeth Merrill (m. 1879)Children4Signature Francis Davis Millet (November 3, 1848[1] – April 15, 1912) was an American academic classical painter, sculptor, and writer who died in the sinking of the RMS Titanic on April 15, 1912. Early lif...

Voici la liste des médaillés masculins et féminins des épreuves de biathlon aux Jeux olympiques d'hiver depuis 1960. Article connexe : Biathlon aux Jeux olympiques. Hommes Individuel Édition Or Argent Bronze Squaw Valley 1960 Klas Lestander (SWE) Antti Tyrväinen (FIN) Alexander Privalov (URS) Innsbruck 1964 Vladimir Melanine (NOR) Alexander Privalov (URS) Olav Jordet (NOR) Grenoble 1968 Magnar Solberg (NOR) Alexandre Tikhonov (URS) Vladimir Gundartsev (URS) Sapporo 1972 Magnar Solb...

 

1987 compilation album by HawkwindOut & IntakeCompilation album by HawkwindReleasedApril 1987RecordedRockfield Studios, 1982 and 1986, live 1982GenreSpace rockLabelFlicknife RecordsProducerHawkwindHawkwind chronology Hawkwind Anthology(1985) Out & Intake(1987) BBC Radio 1 Live in Concert(1991) Professional ratingsReview scoresSourceRatingAllmusic[1]The Encyclopedia of Popular Music[2]Kerrang! [3] Out and Intake is a 1987 (see 1987 in music) live/studio...