Language model

A language model is a probabilistic model of a natural language.[1] In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.[2]

Language models are useful for a variety of tasks, including speech recognition[3] (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation,[4] natural language generation (generating more human-like text), optical character recognition, route optimization,[5] handwriting recognition,[6] grammar induction,[7] and information retrieval.[8][9]

Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

Pure statistical models

Models based on word n-grams

A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network–based models, which have been superseded by large language models.[10] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.[11] Special tokens were introduced to denote the start and end of a sentence and .

To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

Exponential

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

where is the partition function, is the parameter vector, and is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word n-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over.[12]

Formally, a k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.

For example, in the input text:

the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.

In skip-gram model, semantic relations between words are represented by linear combinations, capturing a form of compositionality. For example, in some such models, if v is the function that maps a word w to its n-d vector representation, then

where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[13][14]

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).[15] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, furtherly causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.[16]

Large language models

A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.[17]

The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks, or be guided by prompt engineering.[18] These models acquire predictive power regarding syntax, semantics, and ontologies[19] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data on which they are trained.[20]

Although sometimes matching human performance, it is not clear whether they are plausible cognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.[21]

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.[22]

Various data sets have been developed for use in evaluating language processing systems.[23] These include:

  • Corpus of Linguistic Acceptability[24]
  • GLUE benchmark[25]
  • Microsoft Research Paraphrase Corpus[26]
  • Multi-Genre Natural Language Inference
  • Question Natural Language Inference
  • Quora Question Pairs[27]
  • Recognizing Textual Entailment[28]
  • Semantic Textual Similarity Benchmark
  • SQuAD question answering Test[29]
  • Stanford Sentiment Treebank[30]
  • Winograd NLI
  • BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.[31] (LLaMa Benchmark)

See also

References

  1. ^ Jurafsky, Dan; Martin, James H. (2021). "N-gram Language Models". Speech and Language Processing (3rd ed.). Archived from the original on 22 May 2022. Retrieved 24 May 2022.
  2. ^ Rosenfeld, Ronald (2000). "Two decades of statistical language modeling: Where do we go from here?". Proceedings of the IEEE. 88 (8): 1270–1278. doi:10.1109/5.880083. S2CID 10959945.
  3. ^ Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
  4. ^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" Archived 15 August 2020 at the Wayback Machine. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
  5. ^ Liu, Yang; Wu, Fanyou; Liu, Zhiyuan; Wang, Kai; Wang, Feiyue; Qu, Xiaobo (2023). "Can language models be used for real-world urban-delivery route optimization?". The Innovation. 4 (6): 100520. doi:10.1016/j.xinn.2023.100520. PMC 10587631.
  6. ^ Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" Archived 11 November 2020 at the Wayback Machine. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
  7. ^ Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" Archived 14 August 2022 at the Wayback Machine. arXiv:1808.10000.
  8. ^ Ponte, Jay M.; Croft, W. Bruce (1998). A language modeling approach to information retrieval. Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008.
  9. ^ Hiemstra, Djoerd (1998). A linguistically motivated probabilistically model of information retrieval. Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34.
  10. ^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (1 March 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.
  11. ^ Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
  12. ^ David Guthrie; et al. (2006). "A Closer Look at Skip-gram Modelling" (PDF). Archived from the original (PDF) on 17 May 2017. Retrieved 27 April 2014.
  13. ^ Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space". arXiv:1301.3781 [cs.CL].
  14. ^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado irst4=Greg S.; Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality (PDF). Advances in Neural Information Processing Systems. pp. 3111–3119. Archived (PDF) from the original on 29 October 2020. Retrieved 22 June 2015.{{cite conference}}: CS1 maint: numeric names: authors list (link)
  15. ^ Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks". Archived from the original on 1 November 2020. Retrieved 27 January 2019.
  16. ^ Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. Vol. 3. p. 3881. Bibcode:2008SchpJ...3.3881B. doi:10.4249/scholarpedia.3881. Archived from the original on 26 October 2020. Retrieved 28 August 2015.
  17. ^ "Better Language Models and Their Implications". OpenAI. 14 February 2019. Archived from the original on 19 December 2020. Retrieved 25 August 2019.
  18. ^ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (December 2020). Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.). "Language Models are Few-Shot Learners" (PDF). Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 1877–1901. Archived (PDF) from the original on 17 November 2023. Retrieved 14 March 2023.
  19. ^ Fathallah, Nadeen; Das, Arunav; De Giorgis, Stefano; Poltronieri, Andrea; Haase, Peter; Kovriguina, Liubov (26 May 2024). NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning (PDF). Extended Semantic Web Conference 2024. Hersonissos, Greece.
  20. ^ Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus. 151 (2): 127–138. doi:10.1162/daed_a_01905. S2CID 248377870. Archived from the original on 17 November 2023. Retrieved 9 March 2023.
  21. ^ Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (9 January 2018). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5. Archived from the original on 16 April 2023. Retrieved 11 December 2021.
  22. ^ Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations", International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
  23. ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (10 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
  24. ^ "The Corpus of Linguistic Acceptability (CoLA)". nyu-mll.github.io. Archived from the original on 7 December 2020. Retrieved 25 February 2019.
  25. ^ "GLUE Benchmark". gluebenchmark.com. Archived from the original on 4 November 2020. Retrieved 25 February 2019.
  26. ^ "Microsoft Research Paraphrase Corpus". Microsoft Download Center. Archived from the original on 25 October 2020. Retrieved 25 February 2019.
  27. ^ Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
  28. ^ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment" (PDF). Archived from the original (PDF) on 9 August 2017. Retrieved 24 February 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
  29. ^ "The Stanford Question Answering Dataset". rajpurkar.github.io. Archived from the original on 30 October 2020. Retrieved 25 February 2019.
  30. ^ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". nlp.stanford.edu. Archived from the original on 27 October 2020. Retrieved 25 February 2019.
  31. ^ Hendrycks, Dan (14 March 2023), Measuring Massive Multitask Language Understanding, archived from the original on 15 March 2023, retrieved 15 March 2023

Further reading

  • J M Ponte; W B Croft (1998). "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval. pp. 275–281. CiteSeerX 10.1.1.117.4237.
  • F Song; W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280. CiteSeerX 10.1.1.21.6467.
  • Chen, Stanley; Joshua Goodman (1998). An Empirical Study of Smoothing Techniques for Language Modeling (Technical report). Harvard University. CiteSeerX 10.1.1.131.5458.

Read other articles:

Julie GonzaloGonzalo pada 2013LahirJulieta Susana Gonzalo09 September 1981 (umur 42)Buenos Aires, ArgentinaPekerjaanAktrisTahun aktif2001–sekarang Julieta Susana Gonzalo (pengucapan bahasa Spanyol: [ˈʝuli ɣonˈsalo]; lahir 9 September 1981) adalah seorang aktris Argentina-Amerika.[1] Dia terkenal karena bermain sebagai Pamela Rebecca Barnes dalam opera sabun televisi berjudul Dallas (2012-2014). Dia juga muncul dalam film-film seperti Freaky Friday (2003), A Cinde...

 

Scottish minister and scientist (1748–1819) John PlayfairFRS, FRSEBorn(1748-03-10)10 March 1748[citation needed]Benvie, Forfarshire, ScotlandDied20 July 1819(1819-07-20) (aged 71)Burntisland, Fife, ScotlandResting placeOld Calton Burial GroundAlma materUniversity of St AndrewsUniversity of EdinburghKnown forPlayfair's axiomPlayfair (lunar crater)Playfair (Martian crater)playfairiteScientific careerFieldsMathematics, natural philosophy, geologyInstitutionsUniversity of...

 

Artikel ini bukan mengenai Rajawali Televisi. RCTIJenisJaringan televisiMotoKebanggaan Bersama Milik BangsaSloganRCTI OkeNegaraIndonesiaBahasaBahasa IndonesiaPendiriPeter F. Gontha dan Bambang Trihatmodjo (Bimantara Citra) Peter Sondakh (Rajawali Wira Bhakti Utama)Tanggal siaran perdana13 November 1988 (siaran percobaan)Tanggal peluncuran24 Agustus 1989Kantor pusatMNC Studios, Jl. Raya Perjuangan No. 1, Kebon Jeruk, Jakarta Barat, IndonesiaWilayah siaranNasionalPemilikMedia Nusantara CitraInd...

History museum in Pennsylvania, USThe State Museum of PennsylvaniaThe State Museum of Pennsylvania at 300 North Street in Harrisburg, PennsylvaniaFormer nameWilliam Penn Memorial MuseumEstablishedMarch 28, 1905 (1905-03-28)Location300 North St, Harrisburg, Pennsylvania, USCoordinates40°15′56″N 76°53′09″W / 40.265672°N 76.885812°W / 40.265672; -76.885812TypeHistory museumCollectionsPennsylvania cultural and natural historyCollection size3 mill...

 

American broadcasting executive Benjamin Homel, known professionally as Randy Michaels, is an American broadcasting executive and a former member of the National Association of Broadcasters TV Board.[1] Biography Michaels has been involved in large market radio broadcasting since the early 1970s, first in front of the mike as evening personality at adult contemporary WGR in Buffalo. He later moved into management, and was CEO of Jacor Communications in the 1990s selling to Clear Chann...

 

Cet article est une ébauche concernant un homme politique français. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Octave DepeyreFonctionsSénateur de la Troisième RépubliqueLot30 janvier 1876 - 4 janvier 1879Garde des Sceaux, ministre de la Justice26 novembre 1873 - 21 mai 1874Jean ErnoulAdrien TailhandDéputé françaisAssemblée nationaleHaute-Garonne8 février 1871 - 7 mars 1876BiographieNaissance 15 oc...

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada November 2022. Jiří PleskotLahir(1922-05-03)3 Mei 1922Milostín, Cekoslowakia (kini Republik Ceko)Meninggal1 Desember 1997(1997-12-01) (umur 75)Praha, Republik CekoPekerjaanPemeranTahun aktif1958 – 1990 Jiří Pleskot (3 Mei 1922 – 1 De...

 

Kurva Hubbert, digambar oleh M. King Hubbert, adalah sebuah model matematika persediaan minyak pada masa depan. Puncak teori Hubbert, juga dikenal dengan puncak minyak adalah sebuah teori berpengaruh mengenai pengambilan dan penghabisan jangka-panjang dari minyak bumi konvensional (dan bahan bakar fosil lainnya). Teori ini dinamakan atas seorang geofisikawan Amerika M. King Hubbert, yang menciptakan sebuah model dari persediaan yang diketahui, dan mengusulkan sebuah teori. Pada 1956, Hubbert ...

 

Political entities in medieval Italy Part of a series on the History of Italy Early Prehistoric Italy Nuragic civilization (18th–3rd c. BC) Etruscan civilization (12th–6th c. BC) Magna Graecia (8th–3rd c. BC) Ancient Rome Kingdom (753 BC–509 BC) Republic (509 BC–27 BC) Roman expansion in Italy Roman Italy Populares and Optimates Empire (27 BC–286 AD) Western Empire (286 AD–476 AD) Praetorian prefecture of Italy Romano-Barbarian Kingdoms Odoacer's 476&...

MephistoSampul DVDSutradaraIstván SzabóDitulis olehPéter DobaiKlaus Mann (novel)István SzabóPemeranKlaus Maria BrandauerKrystyna JandaIldikó BánságiTanggal rilis 29 April 1981 (1981-04-29) (Germany) 8 Oktober 1981 (1981-10-08) (Hungary) Durasi144 menitNegaraHungariaBahasaInggris Hungaria Jerman Esperanto Mephisto adalah sebuah judul dari sebuah adaptasi film 1981 dari novel Klaus Mann Mephisto, yang disutradarai oleh István Szabó, dan dibintangi oleh Klaus Maria Br...

 

Вулиця Іоанна Павла IIКиїв Місцевість Саперне полеРайон ПечерськийНазва на честь Іоанна Павла IIКолишні назви Новотверська, Патріса ЛумумбиЗагальні відомостіПротяжність 1250 мКоординати початку 50°25′13″ пн. ш. 30°31′34″ сх. д. / 50.420306° пн. ш. 30.526250° с...

 

Key visual of the series Komi Can't Communicate is an anime television series based on the manga series of the same name written and illustrated by Tomohito Oda [ja]. The series, produced by OLM, was announced in May 2021. The series was directed by Kazuki Kawagoe, with Ayumu Watanabe serving as chief director, scripts by Deko Akao, character designs by Atsuko Nakajima and music by Yukari Hashimoto.[1][2][3] The series aired on TV Tokyo from October 7 to ...

 本表是動態列表,或許永遠不會完結。歡迎您參考可靠來源來查漏補缺。 潛伏於中華民國國軍中的中共間諜列表收錄根據公開資料來源,曾潛伏於中華民國國軍、被中國共產黨聲稱或承認,或者遭中華民國政府調查審判,為中華人民共和國和中國人民解放軍進行間諜行為的人物。以下列表以現今可查知時間為準,正確的間諜活動或洩漏機密時間可能早於或晚於以下所歸�...

 

Invented claim or trivial fact The Great Wall of China is often incorrectly said to be visible from space with the naked eye. A factoid is either an invented or assumed statement presented as a fact,[1][2] or a true but brief or trivial item of news or information. The term was coined in 1973 by American writer Norman Mailer to mean a piece of information that becomes accepted as a fact even though it is not actually true, or an invented fact believed to be true because it app...

 

جليلة السلمان جليلة محمد رضا السلمان كلمة جليلة السلمان في الوقفة التضامنية التي نظمها تيار الوفاء الإسلامي في 6 مايو 2014 معلومات شخصية الميلاد 1965 (العمر 59 سنة)البحرين  الإقامة السهلة الجنسية بحرينية العرق بحرانية نشأت في المنامة (فريج المخارقة) الديانة الإسلام المذهب ا�...

Melazzo komune di Italia Melazzo (it) Tempat Negara berdaulatItaliaDaerah di ItaliaPiemonteProvinsi di ItaliaProvinsi Alessandria NegaraItalia Ibu kotaMelazzo PendudukTotal1.219  (2023 )GeografiLuas wilayah19,74 km² [convert: unit tak dikenal]Ketinggian254 m Berbatasan denganAcqui Terme Bistagno Cartosio Castelletto d'Erro Cavatore Terzo SejarahHari liburpatronal festival (en) Santo pelindungBartolomeus Informasi tambahanKode pos15010 Zona waktuUTC+1 UTC+2 Kode telepon0144 ID ISTAT...

 

Лечение знахаркой девочки, больной припадками. 1914 И. М. Львов. У знахаря. Народная славянская медицина — область традиционного знания и система лечебно-профилактических приёмов славян, направленных на избавление человека от болезни и восстановление здоровья; древнейши�...

 

Japanese dark fantasy shōnen manga and anime series The Ancient Magus' BrideCover of the first volume魔法使いの嫁(Mahō Tsukai no Yome)GenreDark fantasy[1][2]Mystery[3]Supernatural[4] MangaWritten byKore YamazakiPublished byMag Garden (2013–2023)Bushiroad Works (2023–present)English publisherNA: Seven Seas EntertainmentMagazineMonthly Comic Blade(November 30, 2013 – July 1, 2014)Monthly Comic Garden (September 1, 2014 – March 10, 20...

В Википедии есть статьи о других людях с фамилией Стрикленд. Эту страницу предлагается переименовать в «Стрикленд, Ширли».Пояснение причин и обсуждение — на странице Википедия:К переименованию/28 июля 2022. Пожалуйста, основывайте свои аргументы на правилах именования �...

 

Penumbral lunar eclipse 30 December 2001 Penumbral Lunar Eclipse30 December 2001 Series (and member) 144 (15 of 71) Gamma 1.0731 Magnitude -0.11 Duration (hr:mn:sc) Penumbral 4:03:32 Contacts (UTC) P1 8:27:35 Greatest 10:29:18 P4 12:31:07 The moon's hourly motion across the Earth's shadow in the constellation of Gemini. A penumbral lunar eclipse took place on Sunday 30 December 2001, the last of three lunar eclipses in 2001. At maximum eclipse, 89.477% of the Moon's disc was partially shaded ...