Apache Nutch

Apache Nutch
Original author(s)Doug Cutting, Mike Cafarella
Developer(s)Apache Software Foundation
Stable release
1.x1.20 / 24 April 2024; 8 months ago (2024-04-24)[1]
2.x2.4 / 11 October 2019; 5 years ago (2019-10-11)[1]
RepositoryNutch Github Repository
Written inJava
Operating systemCross-platform
TypeWeb crawler
LicenseApache License 2.0
Websitenutch.apache.org

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Features

Nutch robot mascot

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[2]

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.[3]

While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.[citation needed]

Release history

1.x

Branch

2.x

Branch

Release date Description
1.1 2010-06-06 This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
1.2 2010-10-24 This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.3 2011-06-07 This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB).
1.4 2011-11-26 This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing.
1.5 2012-06-07 This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
2.0 2012-07-07 This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores.
1.5.1 2012-07-10 This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
2.1 2012-10-05 This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
1.6 2012-12-06 This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
2.2 2013-06-08 This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
1.7 2013-06-24 This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
2.2.1 2013-07-02 This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
1.8 2014-03-17 Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.
2.3 2015-01-22 Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated.[4]
1.10 2015-05-06 This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features.[5]
1.11 2015-12-07 This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features.[6]
2.3.1 2016-01-21 This bug fix release contains around 40 issues addressed.
1.12 2016-06-18
1.13 2017-04-02
1.14 2017-12-23
1.15 2018-08-09
1.16 2019-10-11
2.4 2019-10-11 Expected to be the last release on the 2.X series, as "no committer is actively working on it".[7]
1.17 2020-07-02
1.18 2021-01-24
1.19 2022-08-22
1.20 2024-04-09

Scalability

IBM Research studied the performance[8] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[9] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[10]

  • Hadoop – Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch

See also

References

  1. ^ a b "Apache Nutch™ - Downloads". Retrieved 11 June 2024.
  2. ^ "Apache Nutch -". nutch.apache.org.
  3. ^ a b "Common Crawl's Move to Nutch – Common Crawl – Blog". blog.commoncrawl.org. Retrieved 2015-10-14.
  4. ^ "Nutch 2.3 Release". Apache Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January 2016.
  5. ^ "Nutch 1.10 Release Notes". ASF JIRA. The Apache Software Foundation. 6 May 2015. Retrieved 18 January 2016.
  6. ^ "Nutch 1.11 Release Notes". ASF JIRA. The Apache Software Foundation. 7 December 2015. Retrieved 18 January 2016.
  7. ^ "Nutch 2.4 Release". Apache Nutch News. The Apache Software Foundation. 11 October 2019. Retrieved 20 May 2022.
  8. ^ "Scalability of the Nutch search engine" (PDF).
  9. ^ "Base Operating System Provisioning and Bringup for a Commercial Supercomputer" (PDF). Archived from the original (PDF) on December 3, 2008.
  10. ^ The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  11. ^ "Our Updated Search". Creative Commons. 2004-09-03.
  12. ^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. Archived from the original on 2010-01-07.
  13. ^ "New CC search UI". Creative Commons. 2006-08-02.
  14. ^ "Where can I get the source code for Wikia Search?". Archived from the original on 2011-11-04. Retrieved 2010-02-12.
  15. ^ "Update on Wikia – doing more of what's working | Jimmy Wales". 31 March 2009.

Bibliography

Read other articles:

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Oktober 2016. artikel ini perlu dirapikan agar memenuhi standar Wikipedia. Tidak ada alasan yang diberikan. Silakan kembangkan artikel ini semampu Anda. Merapikan artikel dapat dilakukan dengan wikifikasi atau membagi artikel ke paragraf-paragraf. Jika sudah dirapik...

 

Anisocerini Anisocerus stellatus Klasifikasi ilmiah Kerajaan: Animalia Filum: Arthropoda Kelas: Insecta Ordo: Coleoptera Famili: Cerambycidae Subfamili: Lamiinae Tribus: Anisocerini Anisocerini merupakan salah satu suku kumbang tanduk panjang (familia Cerambycidae) yang berasal dari subfamilia Lamiinae.[1] Genus Acanthotritus Anisocerus Badenella Batesbeltia Caciomorpha Chalastinus Chapareia Cyclopeplus Demophoo Eusthenomus Fredlanella Gounellea Gymnocerina Gymnocerus Homoephloeus Ho...

 

.sv

.sv البلد السلفادور  الموقع الموقع الرسمي  تعديل مصدري - تعديل   sv. هو نطاق إنترنت من صِنف مستوى النطاقات العُليا في ترميز الدول والمناطق، للمواقع التي تنتمي إلى السلفادور.[1][2] مراجع ^ النطاق الأعلى في ترميز الدولة (بالإنجليزية). ORSN [الإنجليزية]. Archived from the original o...

العلاقات الكويتية الفانواتية الكويت فانواتو   الكويت   فانواتو تعديل مصدري - تعديل   العلاقات الكويتية الفانواتية هي العلاقات الثنائية التي تجمع بين الكويت وفانواتو.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتين: وجه المقا...

 

1956 Japanese filmKoi Sugata Kitsune GotenOriginal Japanese movie posterDirected byNobuo NakagawaWritten byHideji Hōjō (writer)Ryozo Kasahara (writer)Produced byTakarazuka Eiga Company Ltd.Sadao Sugihara (producer)CinematographyKozo OkazakiMusic byMasao YoneyamaDistributed byTohoRelease date May 17, 1956 (1956-05-17)[1] Running time90 minutesCountryJapanLanguageJapanese Koi Sugata Kitsune Goten (恋すがた狐御殿) is a 1956 black and white Japanese film directed ...

 

Marko Vešović Vešović berseragam Red Star Belgrade pada tahun 2012Informasi pribadiTanggal lahir 28 Agustus 1991 (umur 32)Tempat lahir Titograd, SFR YugoslaviaTinggi 1,77 m (5 ft 9+1⁄2 in)Posisi bermain BekInformasi klubKlub saat ini Rijeka (pinjaman dari Torino)Nomor 29Karier junior Mladost Podgorica BudućnostKarier senior*Tahun Tim Tampil (Gol)2008–2010 Budućnost 11 (1)2008 → Mladost Podgorica (pinjaman) 12 (3)2010–2013 Red Star Belgrade 84 (3)2014– ...

Cinema TheaterCinema Theater in 2018General informationArchitectural styleArt DecoAddress957 S Clinton Ave, Rochester, NY 14620Town or cityRochester, NYCountryUnited StatesOpened1914Renovated1949Other informationParkingParking lot, street parkingWebsitecinemarochester.com The Cinema Theater is a motion picture theater in Rochester, New York. Opened as a neighborhood motion picture theater in 1914, it is one of the oldest continuously operated motion picture theaters in the United States.[...

 

Romanian politician (born 1930) Iliescu redirects here. For other people with the surname, see Iliescu (surname). This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Ion Iliescu – news · newspapers · books ...

 

Armored jawless fish of the Paleozoic Various ostracoderms of the class Osteostraci ('bony-shields') Cardipeltis bryanti, a lower Devonian ostracoderm from the Bighorn Mountains of Wyoming. Ventral (underside) exposed. Ostracoderms (lit. 'shell-skins') are the armored jawless fish of the Paleozoic Era. The term does not often appear in classifications today because it is paraphyletic (excluding jawed fishes) (may also be polyphyletic if anaspids are closer to cyclostomes) and thus does ...

American filmmaker The topic of this article may not meet Wikipedia's notability guideline for biographies. Please help to demonstrate the notability of the topic by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond a mere trivial mention. If notability cannot be shown, the article is likely to be merged, redirected, or deleted.Find sources: Tyler MacNiven – news · newspapers · books · scholar&#...

 

Sporting event delegationChinese Taipei at the2016 Summer ParalympicsChinese Taipei Paralympic flagIPC codeTPENPCChinese Taipei Paralympic Committeein Rio de JaneiroCompetitors13 in 6 sportsFlag bearer Lin Tzu-hui[1]MedalsRanked 68th Gold 0 Silver 1 Bronze 1 Total 2 Summer Paralympics appearances (overview)199219962000200420082012201620202024 Chinese Taipei competed at the 2016 Summer Paralympics in Rio de Janeiro, Brazil, from 7 to 18 September 2016. Delegation Chinese Taipei se...

 

1963 single by Eddie CochranMy WayUK 45 rpm single.Single by Eddie Cochranfrom the album My Way B-sideRock 'n' Roll BluesReleasedApril 1963RecordedJanuary 17, 1959GenreRock and rollLabelLiberty Records, EMISongwriter(s)Eddie Cochran, Jerry CapehartProducer(s)Eddie CochranEddie Cochran singles chronology Never (1962) My Way (1963) Drive-In Show (1963) My Way is a song co-written and recorded by Eddie Cochran. It was recorded in January 1959 and released posthumously as a single on Liberty Reco...

周處除三害The Pig, The Snake and The Pigeon正式版海報基本资料导演黃精甫监制李烈黃江豐動作指導洪昰顥编剧黃精甫主演阮經天袁富華陳以文王淨李李仁謝瓊煖配乐盧律銘林孝親林思妤保卜摄影王金城剪辑黃精甫林雍益制片商一種態度電影股份有限公司片长134分鐘产地 臺灣语言國語粵語台語上映及发行上映日期 2023年10月6日 (2023-10-06)(台灣) 2023年11月2日 (2023-11-02)(香�...

 

For the village in Zhytomyr Oblast, see Poliske, Chernobyl Exclusion Zone. Former urban-type settlement in Kyiv Oblast, UkrainePoliske ПоліськеFormer urban-type settlementAbandoned administrative building, 2009PoliskeLocation of Poliske in UkraineShow map of Kyiv OblastPoliskePoliske (Ukraine)Show map of UkraineCoordinates: 51°14′27.27″N 29°23′13.11″E / 51.2409083°N 29.3869750°E / 51.2409083; 29.3869750Country UkraineOblast Kyiv OblastRaio...

 

Part of a series on theBiodiversity of Scotland BiodiversityFloraFaunaBirdsScottish breedsHighland fauna ConservationSpecial Areas of Conservation OrganisationsWildlife Trusts Scottish Wildlife Trust Areas Local nature reserves National nature reserves Protected areas National parks National Scenic Areas Natural historyNatural history vte This is a list of domestic animal breeds originating in Scotland. To be considered domesticated, a population of animals must have their behaviour, life cy...

British peer This article's lead section may be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. (February 2019) The Right HonourableThe Viscount WeymouthFRS PCPresident of the Board of TradeIn office8 January 1702 – 1705Preceded byThe Earl of StamfordSucceeded byThe Earl of Stamford Personal detailsBorn1640Died28 July 1714 (aged 73–74)SpouseFrances Finch Thomas Thynne, ...

 

Town in Massachusetts, United StatesMansfield, MassachusettsTownNorth Main Street, Mansfield, MA SealLocation in Bristol County in MassachusettsCoordinates: 42°02′00″N 71°13′10″W / 42.03333°N 71.21944°W / 42.03333; -71.21944CountryUnited StatesStateMassachusettsCountyBristolSettled1658Incorporated1775Government • TypeOpen town meeting • Town   ManagerKevin Dumas • Select BoardMichael A. Trowbridge, Sr. (Ch...

 

Not to be confused with the Latin letter Ō. Cyrillic letter Cyrillic letterO with macronThe Cyrillic scriptSlavic lettersАА̀А̂А̄ӒБВГҐДЂЃЕЀЕ̄Е̂ЁЄЖЗЗ́ЅИІЇꙆЍИ̂ӢЙЈКЛЉМНЊОО̀О̂ŌӦПРСС́ТЋЌУУ̀У̂ӮЎӰФХЦЧЏШЩꙎЪЪ̀ЫЬѢЭЮЮ̀ЯЯ̀Non-Slavic lettersӐА̊А̃Ӓ̄ӔӘӘ́Ә̃ӚВ̌ԜГ̑Г̇Г̣Г̌Г̂Г̆Г̈г̊ҔҒӺҒ̌ғ̊ӶД́Д̌Д̈Д̣Д̆ӖЕ̃Ё̄Є̈ԐԐ̈ҖӜӁЖ̣ҘӞЗ̌З̣З̆ӠИ̃ӤҊҚӃҠҞҜК̣к̊қ�...

Artikel ini perlu diwikifikasi agar memenuhi standar kualitas Wikipedia. Anda dapat memberikan bantuan berupa penambahan pranala dalam, atau dengan merapikan tata letak dari artikel ini. Untuk keterangan lebih lanjut, klik [tampil] di bagian kanan. Mengganti markah HTML dengan markah wiki bila dimungkinkan. Tambahkan pranala wiki. Bila dirasa perlu, buatlah pautan ke artikel wiki lainnya dengan cara menambahkan [[ dan ]] pada kata yang bersangkutan (lihat WP:LINK untuk keterangan lebih lanjut...

 

Kuwaiti footballer Nawaf Al Khaldi Nawaf in December 2010Personal informationFull name Nawaf Khaled Al KhaldiDate of birth (1981-05-25) 25 May 1981 (age 43)Place of birth KuwaitHeight 1.81 m (5 ft 11 in)[1]Position(s) GoalkeeperYouth career1995–1997 KhaitanSenior career*Years Team Apps (Gls)1997–2001 Khaitan 46 (0)2001–2017 Al-Qadsia 325 (0)Total 371 (0)International career‡2000–2014 Kuwait 115 (0) *Club domestic league appearances and goals‡ National t...