BookCorpus

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.^[1] It was the main corpus used to train the initial GPT model by OpenAI,^[2] and has been used as training data for other early large language models including Google's BERT.^[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.^[3]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.^[4] The dataset was initially hosted on a University of Toronto webpage.^[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.^[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.^[4]^[1]

References

^ ^a ^b ^c Bandy, Jack; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus". NeurIPS.
^ "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
^ ^a ^b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
^ ^a ^b ^c Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.

Read other articles:

Seruling

Berbagai macam seruling. Orang Nubian Mesir sedang bermain seruling Seruling atau suling adalah alat musik dari keluarga alat musik tiup kayu atau terbuat dari bambu. Suara seruling berciri lembut dan dapat dipadukan dengan alat musik lainnya dengan baik.[1] Seruling modern untuk para ahli umumnya terbuat dari perak, emas, atau campuran keduanya, sedangkan seruling untuk pelajar umumnya terbuat dari nikel-perak, atau logam yang dilapisi perak. Seruling konser standar ditalakan di C da...

Jozef De Kesel

Belgian Roman Catholic bishop His EminenceJozef De KeselCardinal, Archbishop Emeritus of Mechelen-BrusselsChurchRoman Catholic ChurchArchdioceseMechelen-BrusselsAppointed6 November 2015Installed12 December 2015Term ended22 June 2023PredecessorAndré-Joseph LéonardSuccessorLuc TerlindenOther post(s)Cardinal-Priest of Ss. Giovanni e PaoloOrdersOrdination26 August 1972by Leo-Karel De KeselConsecration26 May 2002by Godfried DanneelsCreated cardinal19 November 2016by Pope FrancisRankCar...

Nancy Elizabeth Prophet

American sculptor Nancy Elizabeth ProphetElizabeth Prophet, sculptor and teacher (Harmon Foundation)BornNancy Elizabeth ProfittMarch 19, 1890Warwick, Rhode Island, USDiedDecember 13, 1960(1960-12-13) (aged 70)Providence, Rhode Island, USAlma materRhode Island School of DesignKnown forSculptorMovementNew Negro movementSpouseFrancis Ford (m. 1915; div. 1932) Nancy Elizabeth Prophet (born Nancy Elizabeth Profitt; March 19, 1890 – December 13, 1960) was an American artist of Afri...

Inti keplanetan

Struktur dalam suatu planet. Inti keplanetan terdiri dari lapisan-lapisan paling dalam suatu planet. Inti mungkin terdiri dari lapisan padat dan cair, dan hal ini bergantung kepada diferensiasi yang terjadi pada masa awal pembentukan.[1] Inti Mars dan Venus sepenuhnya padat karena tidak memiliki medan magnet yang dihasilkan dari dalam.[2] Di Tata Surya, ukuran inti dapat bervariasi antara 20% (Bulan) hingga 85% jari-jari benda langit (Merkurius). Beberapa satelit alami, astero...

Prayut Chan-o-cha

JenderalPrayut Chan-o-chaMPCh MWM TChW RMK PCChan pada tahun 2022 Penasihat Pribadi Raja ThailandPetahanaMulai menjabat 29 November 2023Penguasa monarkiBhumibol AdulyadejVajiralongkornPerdana Menteri ThailandMasa jabatan22 Mei 2014 – 22 Agustus 2023Ditangguhkan sejak 24 Agustus 2022 - 30 September 2022 (Penjabat: Prawit Wongsuwan)Perdana MenteriAbhisit VejjajivaYingluck ShinawatraNiwatthamrong Boonsongpaisan (Pelaksana tugas)Prayut Chan-o-Cha PendahuluNiwatthamrong Boonsongpais...

Propel

Cet article est une ébauche concernant l’informatique. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Propel est un ORM pour PHP 5. Le développement de Propel est accessible sur GitHub[1]. Licence : Jusqu'à la version 1.4 : GNU GPL v3 ou suivantes ; À partir de la version 1.5 : Licence MIT. Il est possible de l'utiliser avec le framework Symfony (versions 1 et 2). À partir de Symfony ...

List of Jewish American jurists

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: List of Jewish American jurists – news · newspapers · books · scholar · JSTOR (June 2023) (Learn how and when to remove this message) This is a dynamic list and may never be able to satisfy particular standards for completeness. You can help by adding missing ...

Sa'ad bin Mu'adz

Sa'ad bin Mu'adz (Arab: سعد بن معاذ) adalah Sahabat Nabi Muhammad yang juga pemimpin Bani Aus di Madinah. Biografi Sa'ad memeluk Islam pada tahun 622 M (1 H), ketika Nabi Muhammad tiba di Madinah. Ia adalah salah satu dari figur kuat di antara golongan Anshar. Sa'ad adalah sahabat dari Umayyah bin Khalaf.[1] Ketika Sa'ad berada di Mekkah, ia akan tinggal di rumah Umayah dan ketika Umayah ke Madinah, ia akan tinggal di rumah Sa'ad.[1] Beberapa saat sebelum terjadi ...

Несторианство

ХристианствоБиблия Ветхий Завет Новый Завет Евангелие Десять заповедей Нагорная проповедь Апокрифы Бог, Троица Бог Отец Иисус Христос Святой Дух История христианства Апостолы Хронология христианства Раннее христианство Гностическое христианство Вселенские соборы Н...

Shinji Hashimoto

Japanese game producer (born 1958) Shinji Hashimoto橋本真司Hashimoto in 2015Born (1958-05-24) May 24, 1958 (age 65)Kitakyushu, Fukuoka, JapanAlma materKomazawa UniversityOccupationVideo game producerYears active1995–2022Employer(s)Square (1995–2003)Square Enix (2003–2022)Sony Music Entertainment Japan (2022–present)TitleSenior Advisor Shinji Hashimoto (橋本真司, Hashimoto Shinji, born May 24, 1958) is a Japanese former game producer at Square Enix and current...

Johnny Rutherford

American racing driver (born 1938) For the Major League Baseball pitcher, see Johnny Rutherford (baseball). For other people, see John Rutherford (disambiguation). Johnny RutherfordRutherford at the 1984 Pocono 500BornJohn Sherman Rutherford III (1938-03-12) March 12, 1938 (age 86)Coffeyville, Kansas, U.S.Championship titlesUSAC Sprint Car (1965) CART Championship Car (1980) Major victories Indianapolis 500 (1974, 1976, 1980) Pocono 500 (1974) Michigan 500 (1986)Champ Car career314 races...

Legio XII Fulminata

Legio XII Fulminata Carte de l'Empire romain en 125, sous l'empereur Hadrien, montrant la Legio XII Fulminata, stationné à Mélitène (aujourd'hui Malatya en Turquie), dans la province de Cappadoce, de 71 jusqu'au IVe siècle Création 58 av. J.-C. Dissolution Ve siècle Pays République romaine et Empire romain Type Légion romaine Rôle Infanterie lourde et contingent de cavalerie légère Effectif 5 120 légionnaires et 120 jinetes (effectifs théoriques), soit 5 240...

Averrhoa bilimbi

Species of tree Averrhoa bilimbi Scientific classification Kingdom: Plantae Clade: Tracheophytes Clade: Angiosperms Clade: Eudicots Clade: Rosids Order: Oxalidales Family: Oxalidaceae Genus: Averrhoa Species: A. bilimbi Binomial name Averrhoa bilimbiL. Synonyms[1] Averrhoa abtusangulata Stokes Averrhoa obtusangula Stokes Averrhoa bilimbi (commonly known as bilimbi, cucumber tree, or tree sorrel[2]) is a fruit-bearing tree of the genus Averrhoa, family Oxalidaceae. It is b...

James Barry (surgeon)

British military surgeon (c. 1789–1865) Not to be confused with James Berry (surgeon). James BarryPortrait claimed to be of Barry, c. 1820sBornMargaret Anne Bulkleyc. 1789[a]Cork, Kingdom of IrelandDied25 July 1865(1865-07-25) (aged 75–76)London, EnglandOther namesJames Miranda Steuart Barry[b]Alma materUniversity of Edinburgh Medical SchoolOccupationSurgeonRelativesJames Barry (uncle) James Barry (born Margaret Anne Bulkley, or Bulkeley;[7&#...

العلاقات الأوكرانية الإندونيسية

العلاقات الأوكرانية الإندونيسية أوكرانيا إندونيسيا أوكرانيا إندونيسيا تعديل مصدري - تعديل العلاقات الأوكرانية الإندونيسية هي العلاقات الثنائية التي تجمع بين أوكرانيا وإندونيسيا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية...

Dragon House

This article is about the house in Potsdam. For the buildings in Greece, see Dragon houses. For the novel, see The Dragon House. Historical building in Potsdam, GermanyDragon HouseDrachenhausThe Dragon House from the SouthGeneral informationTypeHistorical buildingArchitectural styleChinoiserieTown or cityPotsdamCountryGermanyConstruction started1770Completed1772Renovated1787 Dragon House (German Drachenhaus) is a historical building in Potsdam, Germany, built by King Frederick the Great of Pr...

الرملة

لمعانٍ أخرى، طالع الرملة (توضيح). الرملة منظر عام لمدينة الرملة. الرملةالشعار الإسرائيلي للمدينة تاريخ التأسيس 716م تقسيم إداري البلد فلسطين[1] عاصمة لـ المنطقة الوسطى المنطقة اللواء الأوسط المسؤولون رئيس البلدية ميخائيل فيدال خصائص جغرافية إحداثيات 31°59′00″N ...

Jubilee Walkway

Official walking route in London, England 51°30′30″N 0°06′45″W / 51.5082°N 0.1126°W / 51.5082; -0.1126 Jubilee WalkwayA ground marker for the Jubilee WalkwayLength15 miLocationLondon, United KingdomSeasonAll year The Jubilee Walkway is an official walking route in London. It was originally opened as the Silver Jubilee Walkway to commemorate Queen Elizabeth II's accession; the Queen herself opened it on 9 June 1977 during her silver jubilee celebrations. The...

1924 United States presidential election

35th quadrennial U.S. presidential election This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (January 2021) (Learn how and when to remove this message) 1924 United States presidential election ← 1920 November 4, 1924 1928 → 531 members of the Electoral College266 electoral votes needed to winTurnout48.9%[1] 0.3 pp Nominee Ca...

Toulouse Féminin Handball

Cet article est une ébauche concernant un club de handball français et Toulouse. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Toulouse Féminin Handball Généralités Surnoms TFHLes Roses Noms précédents Toulouse Cheminots Marengo Sports (TCMS)Stade Toulousain HandballToulouse 31 HB Fondation ? (TCMS)juillet 2006 (Toulouse Féminin Handball) Salle Gymnase des ArgouletsPalais des sports André-Brouat (385...

/profillengkap.com/index.php/article/Leon Gouré

/profillengkap.com/index.php/article/James F. Reilly

/profillengkap.com/index.php/article/1954–55 Colchester United F.C. season

Dewan Perwakilan Rakyat Daerah Provinsi Riau

Ekonomi Nauru

Engelbert III dari Istria

Love Don't Cost a Thing (film)

Ambulu, Sumberasih, Probolinggo

Xenotransplantasi

Post-election lawsuits related to the 2020 United States presidential election from Georgia

2007 Maldon District Council election

Dylan Ferguson (ice hockey)