GloVe

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.[1] Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

It is developed as an open-source project at Stanford[2] and was launched in 2014. It was designed as a competitor to word2vec, and the original paper noted multiple improvements of GloVe over word2vec. As of 2022, both approaches are outdated, and Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.[3]

Definition

You shall know a word by the company it keeps (Firth, J. R. 1957:11)[4]

The idea of GloVe is to construct, for each word , two vectors , such that the relative positions of the vectors capture part of the statistical regularities of the word . The statistical regularity is defined as the co-occurrence probabilities. Words that resemble each other in meaning should also resemble each other in co-occurrence probabilities.

Word counting

Let the vocabulary be , the set of all possible words (aka "tokens"). Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details.[1]

If two words occur close to each other, then we say that they occur in the context of each other. For example, if the context length is 3, then we say that in the following sentence

GloVe1, coined2 from3 Global4 Vectors5, is6 a7 model8 for9 distributed10 word11 representation12

the word "model8" is in the context of "word11" but not the context of "representation12".

A word is not in the context of itself, so "model8" is not in the context of the word "model8", although, if a word appears again in the same context, then it does count.

Let be the number of times that the word appears in the context of the word over the entire corpus. For example, if the corpus is just "I don't think that that is a problem." we have since the first "that" appears in the second one's context, and vice versa.

Let be the number of words in the context of all instances of word . By counting, we have(except for words occurring right at the start and end of the corpus)

Probabilistic modelling

Let be the co-occurrence probability. That is, if one samples a random occurrence of the word in the entire document, and a random word within its context, that word is with probability . Note that in general. For example, in a typical modern English corpus, is close to one, but is close to zero. This is because the word "ado" is almost only used in the context of the archaic phrase "much ado about", but the word "much" occurs in all kinds of contexts.

For example, in a 6 billion token corpus, we have

Table 1 of [1]
Probability and Ratio

Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam").

The idea is to learn two vectors for each word , such that we have a multinomial logistic regression:and the terms are unimportant parameters.

This means that if the words have similar co-occurrence probabilities , then their vectors should also be similar: .

Logistic regression

Naively, logistic regression can be run by minimizing the squared loss:However, this would be noisy for rare co-occurrences. To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrences increases:whereand are hyperparameters. In the original paper, the authors found that seem to work well in practice.

Use

Once a model is trained, we have 4 trained parameters for each word: . The parameters are irrelevant, and only are relevant.

The authors recommended using as the final representation vector for word , because empirically it worked better than or alone.

Applications

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc. However, the unsupervised learning algorithm is not effective in identifying homographs, i.e., words with the same spelling and different meanings. This is as the unsupervised learning algorithm calculates a single set of vectors for words with the same morphological structure.[5] The algorithm is also used by the SpaCy library to build semantic word embedding features, while computing the top list words that match with distance measures such as cosine similarity and Euclidean distance approach.[6] GloVe was also used as the word representation framework for the online and offline systems designed to detect psychological distress in patient interviews.[7]

See also

References

  1. ^ a b c Pennington, Jeffrey; Socher, Richard; Manning, Christopher (October 2014). Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.). "GloVe: Global Vectors for Word Representation". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics: 1532–1543. doi:10.3115/v1/D14-1162.
  2. ^ GloVe: Global Vectors for Word Representation (pdf) Archived 2020-09-03 at the Wayback Machine "We use our insights to construct a new model for word representation which we call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model."
  3. ^ Von der Mosel, Julian; Trautsch, Alexander; Herbold, Steffen (2022). "On the validity of pre-trained transformers for natural language processing in the software engineering domain". IEEE Transactions on Software Engineering. 49 (4): 1487–1507. arXiv:2109.04738. doi:10.1109/TSE.2022.3178469. ISSN 1939-3520. S2CID 237485425.
  4. ^ Firth, J. R. (1957). Studies in Linguistic Analysis (PDF). Wiley-Blackwell.
  5. ^ Wenig, Phillip (2019). "Creation of Sentence Embeddings Based on Topical Word Representations: An approach towards universal language understanding". Towards Data Science.
  6. ^ Singh, Mayank; Gupta, P. K.; Tyagi, Vipin; Flusser, Jan; Ören, Tuncer I. (2018). Advances in Computing and Data Sciences: Second International Conference, ICACDS 2018, Dehradun, India, April 20-21, 2018, Revised Selected Papers. Singapore: Springer. p. 171. ISBN 9789811318122.
  7. ^ Abad, Alberto; Ortega, Alfonso; Teixeira, António; Mateo, Carmen; Hinarejos, Carlos; Perdigão, Fernando; Batista, Fernando; Mamede, Nuno (2016). Advances in Speech and Language Technologies for Iberian Languages: Third International Conference, IberSPEECH 2016, Lisbon, Portugal, November 23-25, 2016, Proceedings. Cham: Springer. p. 165. ISBN 9783319491691.

Read other articles:

Artikel ini perlu diwikifikasi agar memenuhi standar kualitas Wikipedia. Anda dapat memberikan bantuan berupa penambahan pranala dalam, atau dengan merapikan tata letak dari artikel ini. Untuk keterangan lebih lanjut, klik [tampil] di bagian kanan. Mengganti markah HTML dengan markah wiki bila dimungkinkan. Tambahkan pranala wiki. Bila dirasa perlu, buatlah pautan ke artikel wiki lainnya dengan cara menambahkan [[ dan ]] pada kata yang bersangkutan (lihat WP:LINK untuk keterangan lebih lanjut...

 

 

Island within Ryukyu Islands For indigenous language of the island, see Ryukyuan languages. TokunoshimaNative name: 徳之島, TokunoshimaTukunushimaAerial view of the island. (2011)GeographyLocationEast China SeaCoordinates27°49′12″N 128°55′56″E / 27.82000°N 128.93222°E / 27.82000; 128.93222ArchipelagoAmami IslandsArea247.77 km2 (95.66 sq mi)Length25 km (15.5 mi)Width18 km (11.2 mi)Coastline80 km (50 mi)Highest&...

 

 

H.Devi Suhartoni Bupati Musi Rawas Utara ke-2PetahanaMulai menjabat 26 Februari 2021PresidenJoko WidodoGubernurHerman DeruWakilInnayatullah PendahuluSyarif Hidayat Alwi Roham (Plh.)PenggantiPetahana Wakil Bupati Musi Rawas Utara ke-1Masa jabatan17 Februari 2016 – 17 Februari 2021PresidenJoko WidodoGubernurAlex Noerdin Herman DeruBupatiSyarif Hidayat Pendahulutidak ada, jabatan baruPenggantiInnayatullah Informasi pribadiLahir1 Juni 1969 (umur 54)Palembang, Sumatera ...

МифологияРитуально-мифологическийкомплекс Система ценностей Сакральное Миф Мономиф Теория основного мифа Ритуал Обряд Праздник Жречество Мифологическое сознание Магическое мышление Низшая мифология Модель мира Цикличность Сотворение мира Мировое яйцо Мифическое �...

 

 

Indian cinematographer Madhu AmbatBorn (1949-03-06) 6 March 1949 (age 75)Ernakulam, Kerala, IndiaNationalityIndianAlma materFTIIOccupation(s)Cinematographer, documentary producer, film directorParent(s)K.BhagyanathSulochanaWebsitewww.madhuambat.com Madhu Ambat is an Indian cinematographer who predominately works in Malayalam and Tamil-language films apart from a few English, Hindi, Telugu, Kannada, Bengali and Sanskrit films.[1] With a career spanning over 40 years, he is on...

 

 

Gereja CahayaInterior. Bangku terbuat dari papan yang sebelumnya digunakan untuk scaffoldingInformasi umumKotaIbaraki, OsakaNegaraJepangData teknisSistem strukturbeton bertulangDesain dan konstruksiArsitekTadao Ando Gereja Cahaya (atau biasa disebut Church with Light) adalah sebuah tempat ibadah kristiani yang utama milik gereja Ibaraki Kasugaoka, anggota dari Persatuan Gereja Kristiani Jepang. Gereja ini dibangun pada tahun 1989, di kota Ibaraki, Osaka, Osaka Prefecture. Gereja ini merupakan...

American TV series or program Cash ExplosionAlso known asCash Explosion (Double Play)C.E.GenreGame showPresented byBob GrossiPaul TapiéMichael ArmstrongMichele DudaDavid McCrearySharon BicknellLeilani BarrettCherie McClainAlissa HenrySpecial Correspondent:Barb McCannLottery Drawing Hosts:Karen HarrisBob BeckerKaren Kawolics Eric ParksCherie McClainSharon BicknellDavid SagiliaNarrated byTom BushJim PitcockJohn E. DouglasDavid McCrearyTheme music composerNickelbackCountry of originUnited Stat...

 

 

Leelee SobieskiLahirLiliane Rudabet Gloria Elsveta SobieskiPekerjaanAktrisTahun aktif1995–sekarang Liliana Rudabet Gloria Elsveta Sobieski dan lebih dikenal dengan panggilan Leelee Sobieski (lahir 10 Juni 1983) merupakan seorang aktris berkebangsaan Amerika Serikat. Dia berkarier di dunia film sejak tahun 1997. Filmografi Tahun Film Sebagai Keterangan lain 1995 A Horse for Danny Danny Bara 1997 Jungle 2 Jungle Karen 1998 Deep Impact Sarah Hotchner A Soldier's Daughter Never Cries Char...

 

 

British politician (1885–1965) This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (November 2020) (Learn how and when to remove this message) The Right HonourableThe Earl Alexander of HillsboroughKG CH PCChancellor of the Duchy of LancasterIn office28 February 1950 – 26 October 1951Prime MinisterClement AttleePreceded byHugh DaltonSucceeded b...

Province of Spain Province in Andalusia, SpainHuelvaProvinceA wetland area of Doñana National Park FlagCoat of armsMap of Spain with Huelva highlightedCoordinates: 37°33′N 6°55′W / 37.550°N 6.917°W / 37.550; -6.917CountrySpainAutonomous community AndalusiaCapitalHuelvaGovernment • BodyProvincial Deputation of Huelva • PresidentIgnacio Caraballo (PSOE)Area • Total10,148 km2 (3,918 sq mi) • RankRanke...

 

 

Secular meditation techniques This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: Shambhala Training – news · newspapers · books · scholar · JSTOR (November 2014) (Learn how and when to remove this message) Part of a series onShambhala BuddhismKing of Shambhala Core teachings and texts Shambhala Training Cutting Through Spiritual Materialism Shambhala: The ...

 

 

  لمعانٍ أخرى، طالع الصوان (توضيح). الصوان  - منطقة سكنية -  تقسيم إداري البلد الأردن  المحافظة محافظة إربد لواء لواء الكورة قضاء قضاء الكورة (الأردن) السكان التعداد السكاني 32 نسمة (إحصاء 2015)   • الذكور 23   • الإناث 9   • عدد الأسر 9 تعديل مصدري - تعديل ...

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Desember 2023. Sam ByrneInformasi pribadiNama lengkap Sam John Byrne[1]Tanggal lahir 23 Juli 1995 (umur 29)Tempat lahir Dublin, Republik IrlandiaTinggi 1,84 m (6 ft 1⁄2 in)[2]Posisi bermain PenyerangKarier junior St Joseph's...

 

 

USS LSM(R)-196 LSM(R)-196 (foreground), LSM(R)-190 (middle) and LSM(R)-199 (background) firing rockets off Tokishi Shima, March 1945 History United States NameUSS LSM(R)-190 Laid down13 September 1944 Launched12 October 1944 Commissioned8 December 1944 Decommissioned26 March 1946 Honors andawardsOne Battle Star FateSold, 11 September 1947 General characteristics Class and typeLSM(R)-188-class Landing Ship Medium (Rocket) Length203 ft 6 in (62.03 m) Beam34 ft (10 m) D...

 

 

Disambiguazione – Se stai cercando altre liste con una denominazione analoga, vedi Blocco Nazionale (disambigua). Questa voce o sezione sull'argomento partiti politici italiani non cita le fonti necessarie o quelle presenti sono insufficienti. Puoi migliorare questa voce aggiungendo citazioni da fonti attendibili secondo le linee guida sull'uso delle fonti. Questa voce sull'argomento partiti politici italiani è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni d...

French traveler, naturalist, writer and diplomat (1517-1564) Pierre BelonPierre BelonBorn1517Souletière near Cérans-FoulletourteDiedApril 1564ParisNationalityFrenchScientific careerFieldsIchthyologyNatural history Pierre Belon (1517–1564) was a French traveller, naturalist, writer and diplomat. Like many others of the Renaissance period, he studied and wrote on a range of topics including ichthyology, ornithology, botany, comparative anatomy, architecture and Egyptology. He is sometimes k...

 

 

Royal Green JacketsRoyal Green Jackets cap badgeActive1 January 1966 – 1 February 2007Allegiance United KingdomBranch British ArmyTypeRiflesRoleLight InfantrySize5 battalionsPart ofLight DivisionGarrison/HQ1st Battalion – Weeton 2nd Battalion – BulfordNickname(s)The Black MafiaMotto(s)Celer et Audax (Latin: Swift and Bold)MarchQuick – Huntsman's Chorus/Italian SongDouble Pass – The Road to the IslesAnniversariesWaterloo (18 June)CommandersLast Colonel-in-ChiefQueen E...

 

 

渡島大島 画像中央右が最高峰の江良岳、中央やや左に見える火口が寛保岳。二重カルデラの様子が分かる。国土交通省 国土地理院 地図・空中写真閲覧サービスの空中写真を基に作成。(1976年撮影の3枚を合成作成。)所在地 日本座標 北緯41度30分40秒 東経139度21分30秒 / 北緯41.51111度 東経139.35833度 / 41.51111; 139.35833座標: 北緯41度30分40秒 東経139度21分30秒...

← 2019 2018 2017 2020 in Bangladesh → 2021 2022 2023 Centuries: 20th 21st Decades: 2000s 2010s 2020s See also:Other events of 2020List of years in Bangladesh 2020 (MMXX) was a leap year starting on Wednesday of the Gregorian calendar, the 2020th year of the Common Era (CE) and Anno Domini (AD) designations, the 20th year of the 3rd millennium and the 21st century, and the 1st year of the 2020s decade. Calendar year The year 2020 was the 49th year after the independe...

 

 

Voce principale: Spezia Calcio. Spezia CalcioStagione 1998-1999Sport calcio Squadra Spezia Allenatore Luciano Filippi Presidente Sergio Borgo Serie C25º nel girone A Coppa ItaliaFase a gironi Maggiori presenzeCampionato: Adami (34) Miglior marcatoreCampionato: Andreini (9) StadioStadio Alberto Picco 1997-1998 1999-2000 Si invita a seguire il modello di voce Questa pagina raccoglie le informazioni riguardanti lo Spezia Calcio nelle competizioni ufficiali della stagione 1998-1999. Indice...