In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[9] and sentiment analysis.[10]
Development and history of the approach
In distributional semantics, a quantitative methodological approach for understanding meaning in observed language, word embeddings or semantic feature space models have been used as a knowledge representation for some time.[11] Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth,[12] but also has roots in the contemporaneous work on search systems[13] and in cognitive psychology.[14]
The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval.[15][16][17] Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the random indexing approach for collecting word co-occurrence contexts.[18][19][20][21] In 2000, Bengio et al. provided in a series of papers titled "Neural probabilistic language models" to reduce the high dimensionality of word representations in contexts by "learning a distributed representation for words".[22][23][24]
A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings.[25]
Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in Lavelli et al., 2004.[26] Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.[27] Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, after foundational work done by Yoshua Bengio[28][circular reference] and colleagues.[29][30]
The approach has been adopted by many research groups after theoretical advances in 2010 had been made on the quality of vectors and the training speed of the model, as well as after hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.[31]
Polysemy and homonymy
Historically, one of the main limitations of static word embeddings or word vector space models is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.[32][33]
Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.[34] Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG)[35] performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA)[36] labels word-senses through an unsupervised and knowledge-based approach, considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.[37]
As of the late 2010s, contextually-meaningful embeddings such as ELMo and BERT have been developed.[40] Unlike static word embeddings, these embeddings are at the token-level, in that each occurrence of a word has its own embedding. These embeddings better reflect the multi-sense nature of words, because occurrences of a word in similar contexts are situated in similar regions of BERT’s embedding space.[41][42]
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[43] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad[43] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.
Game design
Word embeddings with applications in game design have been proposed by Rabii and Cook[44] as a way to discover emergent gameplay using logs of gameplay data. The process requires transcribing actions that occur during a game within a formal language and then using the resulting text to create word embeddings. The results presented by Rabii and Cook[44] suggest that the resulting vectors can capture expert knowledge about games like chess that are not explicitly stated in the game's rules.
The idea has been extended to embeddings of entire sentences or even documents, e.g. in the form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.[45] A more recent and popular approach for representing sentences is Sentence-BERT, or SentenceTransformers, which modifies pre-trained BERT with the use of siamese and triplet network structures.[46]
For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.[54]
Ethical implications
Word embeddings may contain the biases and stereotypes contained in the trained dataset, as Bolukbasi et al. points out in the 2016 paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” that a publicly available (and popular) word2vec embedding trained on Google News texts (a commonly used data corpus), which consists of text written by professional journalists, still shows disproportionate word associations reflecting gender and racial biases when extracting word analogies.[55] For example, one of the analogies generated using the aforementioned word embedding is “man is to computer programmer as woman is to homemaker”.[56][57]
Research done by Jieyu Zhou et al. shows that the applications of these trained word embeddings without careful oversight likely perpetuates existing bias in society, which is introduced through unaltered training data. Furthermore, word embeddings can even amplify these biases .[58][59]
^Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Distributed Representations of Words and Phrases and their Compositionality". arXiv:1310.4546 [cs.CL].
^Lebret, Rémi; Collobert, Ronan (2013). "Word Emdeddings through Hellinger PCA". Conference of the European Chapter of the Association for Computational Linguistics (EACL). Vol. 2014. arXiv:1312.5542.
^Firth, J.R. (1957). "A synopsis of linguistic theory 1930–1955". Studies in Linguistic Analysis: 1–32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952–1959. London: Longman.
^Luhn, H.P. (1953). "A New Method of Recording and Searching Information". American Documentation. 4: 14–16. doi:10.1002/asi.5090040104.
^Osgood, C.E.; Suci, G.J.; Tannenbaum, P.H. (1957). The Measurement of Meaning. University of Illinois Press.
^Salton, Gerard (1962). "Some experiments in the generation of word and document associations". Proceedings of the December 4-6, 1962, fall joint computer conference on - AFIPS '62 (Fall). pp. 234–250. doi:10.1145/1461518.1461544. ISBN9781450378796. S2CID9937095.
^Karlgren, Jussi; Sahlgren, Magnus (2001). Uesaka, Yoshinori; Kanerva, Pentti; Asoh, Hideki (eds.). "From words to understanding". Foundations of Real-World Intelligence. CSLI Publications: 294–308.
^Sahlgren, Magnus (2005) An Introduction to Random Indexing, Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, Denmark
^Lavelli, Alberto; Sebastiani, Fabrizio; Zanoli, Roberto (2004). Distributional term representations: an experimental comparison. 13th ACM International Conference on Information and Knowledge Management. pp. 615–624. doi:10.1145/1031171.1031284.
^Morin, Fredric; Bengio, Yoshua (2005). "Hierarchical probabilistic neural network language model"(PDF). In Cowell, Robert G.; Ghahramani, Zoubin (eds.). Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. Vol. R5. pp. 246–252.
^"word2vec". Google Code Archive. Retrieved 23 July 2021.
^Reisinger, Joseph; Mooney, Raymond J. (2010). Multi-Prototype Vector-Space Models of Word Meaning. Vol. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, California: Association for Computational Linguistics. pp. 109–117. ISBN978-1-932432-65-7. Retrieved October 25, 2019.
^Huang, Eric. (2012). Improving word representations via global context and multiple word prototypes. OCLC857900050.
^Camacho-Collados, Jose; Pilehvar, Mohammad Taher (2018). "From Word to Sense Embeddings: A Survey on Vector Representations of Meaning". arXiv:1805.04032 [cs.CL].
^Neelakantan, Arvind; Shankar, Jeevan; Passos, Alexandre; McCallum, Andrew (2014). "Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 1059–1069. arXiv:1504.06654. doi:10.3115/v1/d14-1113. S2CID15251438.
^ abAkbik, Alan; Blythe, Duncan; Vollgraf, Roland (2018). "Contextual String Embeddings for Sequence Labeling". Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics: 1638–1649.
^Li, Jiwei; Jurafsky, Dan (2015). "Do Multi-Sense Embeddings Improve Natural Language Understanding?". Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 1722–1732. arXiv:1506.01070. doi:10.18653/v1/d15-1200. S2CID6222768.
^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (June 2019). "Proceedings of the 2019 Conference of the North". Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics: 4171–4186. doi:10.18653/v1/N19-1423. S2CID52967399.
^Lucy, Li, and David Bamman. "Characterizing English variation across social media communities with BERT." Transactions of the Association for Computational Linguistics 9 (2021): 538-556.
^Reif, Emily, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, and Been Kim. "Visualizing and measuring the geometry of BERT." Advances in Neural Information Processing Systems 32 (2019).
^Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982-3992. 2019.
^"Embedding Viewer". Embedding Viewer. Lexical Computing. Archived from the original on 8 February 2018. Retrieved 7 Feb 2018.
^Bolukbasi, Tolga; Chang, Kai-Wei; Zou, James; Saligrama, Venkatesh; Kalai, Adam (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings". arXiv:1607.06520 [cs.CL].
^Bolukbasi, Tolga; Chang, Kai-Wei; Zou, James; Saligrama, Venkatesh; Kalai, Adam (2016-07-21). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings". arXiv:1607.06520 [cs.CL].
Roberto CavalliOn RunawayLahir15 November 1940 (umur 83)Florence, ItaliaKebangsaanItaliaPendidikanAcademy of Art of Florence, ItaliaLabelRoberto Cavalli, just cavalli, Roberto Cavalli Junior, Roberto Cavalli Parfums, Roberto Cavalli Home [1] Roberto Cavalli ialah seorang fashion designer asal Italia yang lahir pada 15 November 1940 di Florence, Itali.[2] Cavalli merupakan anak dari Giorgio Cavalli (surveyor pertambangan) dan Marcela Cavalli (penjahit).[2][3...
XL Airways France IATA ICAO Kode panggil SE XLF STARWAY PenghubungBandar Udara Internasional Charles de GaulleArmada6Perusahaan indukXL Leisure GroupKantor pusatParis, PrancisSitus webhttp://fr.xl.com XL Airways France merupakan sebuah maskapai penerbangan Prancis yang berbasis di Bandar Udara Internasional Charles de Gaulle, merupakan bagian dari XL Leisure Group. Dibentuk dengan nama Star Airlines, maskapai ini berganti nama setelah diambil alih oleh XL Leisure Group. Maskapai ini mengopera...
2014 studio album by Banks GoddessStudio album by BanksReleasedSeptember 5, 2014 (2014-09-05)Studio Lion Aboard (London)[a] Young Turks (London)[b] Mixed Management (Los Angeles)[c] Werewolf Heart (Los Angeles)[d] One Yard (London)[e] Sorona (Los Angeles)[f] Golden Touch (Los Angeles)[g] Capitol (Los Angeles)[h] TEED (London)[i] Genre Electronica[1] R&B[2] Length59:37LabelHarvestProd...
For other ships with the same name, see USS Rushmore. History United States NameUSS Rushmore NamesakeMount Rushmore National Memorial Ordered11 December 1985 BuilderAvondale Shipyards Cost$149 million Laid down9 November 1987 Launched6 May 1989 Sponsored byMrs. Meredith Brokaw Christened6 May 1989 Commissioned1 June 1991 HomeportSasebo MottoNobility Power Statusin active service Badge General characteristics Class and typeWhidbey Island-class dock landing ship Displacement 10,560 tons (light)...
Peteinosaurus Periode Trias Akhir, 221–210 jtyl PreЄ Є O S D C P T J K Pg N TaksonomiKerajaanAnimaliaFilumChordataKelasReptiliaOrdoPterosauriaFamiliDimorphodontidaeGenusPeteinosaurus Wild, 1978 lbs Peteinosaurus (Pelafalan Inggris:/pɛˌtaɪnəˈsɔːrəs/ peh-TY-nə-SOR-əs;[1] berarti kadal bersayap) merupakan genus pterosaurus prasejarah. Pterosaurus ini hidup pada periode Trias Akhir pada Norian akhir (sekitar 221 to 210 juta tahun yang lalu) dan sayapnya sepanjang 60...
Cousine dan Cousin (kanan), dilihat dari Praslin Pulau Cousine adalah sebuah pulau granit kecil (25 ha) di Seychelles 6 km sebelah barat Pulau Praslin. Penyu sisik Indo-Pasifik diketahui bertelur di pulau ini.[1] Pulau ini berukuran 62 hektare.[2] Referensi ^ Hitchins, P. M. (2004-04-27). Nesting success of hawksbill turtles (Eretmochelys imbricata) on Cousine Island, Seychelles ([pranala nonaktif] – Scholar search). Journal of Zoology. Cambridge University Press, Th...
جنوب ويمبيلدونمعلومات عامةالتقسيم الإداري London Borough of Merton (en) البلد المملكة المتحدة شبكة المواصلات مترو لندن الموقع على الشبكة OSGR: TQ2582170023[1] الخطوط Northern line (en) المحطات المجاورة كولييرز وود[2]على الخط: Northern line (en) باتجاه: إيجوير، شرق ميل هيل، هاي بارنيت — موردين[...
Publisher of educational and reference books For other uses, see Grolier (disambiguation). GrolierParent companyScholasticFounded1909; 115 years ago (1909)FounderWalter M. JacksonDefunct2000; 24 years ago (2000)Country of originUnited StatesHeadquarters locationDanbury, ConnecticutPublication typesBooksOfficial websitego.scholastic.com Grolier was one of the largest American publishers of general encyclopedias, including The Book of Knowledge (1910), The Ne...
ضريح درويش علم بازيمعلومات عامةنوع المبنى ضريحالمكان بابل[1] المنطقة الإدارية مقاطعة بابل البلد إيرانالاستعمال ضريح الصفة التُّراثيَّةتصنيف تراثي المعالم الوطنية الإيرانية[1] (2001 – ) تعديل - تعديل مصدري - تعديل ويكي بيانات ضريح درويش علم بازي (بالفارسية: آرامگا�...
Season of television series Season of television series South ParkSeason 3Home media release coverNo. of episodes17ReleaseOriginal networkComedy CentralOriginal releaseApril 7, 1999 (1999-04-07) –January 12, 2000 (2000-01-12)Season chronology← PreviousSeason 2Next →Season 4List of episodes The third season of South Park, an American animated television comedy series, aired on Comedy Central from April 7, 1999, to January 12, 2000.[1] The season was head...
Type of personnel groupPatroller redirects here. For other uses, see Patrol (disambiguation).This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Patrol – news · newspapers · books · scholar · JSTOR (January 2013) (Learn how and when to remove this message) United States Air Force Security Forces personnel patrol...
Operasi Madago RayaBagian dari Operasi Anti-Terorisme di IndonesiaKorps Brigade Mobil di Kecamatan Poso Pesisir Kabupaten Poso pada 8 April 2016Tanggal10 Januari 2016 – 29 September 2022 (8 tahun, 5 bulan dan 3 hari)LokasiSulawesi Tengah (daerah pegunungan Kabupaten Poso, Kabupaten Parigi Moutong, dan Kabupaten Sigi), IndonesiaStatus Kemenangan Indonesia Kematian Santoso dan penangkapan Basri Akhir dari faksi Santoso-Basri Ali Kalora terpilih sebagai pemimpin Kematian Ali Kal...
Peta menunjukkan lokasi Lopez Data sensus penduduk di Lopez Tahun Populasi Persentase 200078.694—200786.6601.34%Est. 201195.00020.72% Lopez adalah munisipalitas yang terletak di provinsi Quezon, Filipina. Pada tahun 2010, munisipalitas ini memiliki populasi sebesar 86.600 jiwa dan 16.826 rumah tangga. Pembagian wilayah Secara administratif Lopez terbagi menjadi 95 barangay, yaitu: Burgos (Poblacion) Danlagan (Poblacion) Gomez (Poblacion) Magsaysay (Poblacion) Rizal (Poblacion) San Lorenzo R...
1973 nonfiction book by Joseph Wambaugh This article is about Joseph Wambaugh's book, and the events it describes. For the film adaptation, see The Onion Field (film). The Onion Field First edition coverAuthorJoseph WambaughCover artistPaul Bacon[1]LanguageEnglishGenrenonfictionPublisherDelacorte PressPublication date1973; 51 years ago (1973)Publication placeUnited StatesMedia typePrint (Hardcover)Pages427 ppISBN9780440066927 The Onion Field is a 1973 nonfi...
ملخص معلومات الملف الوصف صورة الأمير بندر بن عبد العزيز آل سعود المصدر جريدة الرياض التاريخ المنتج جريدة الرياض الإذن(إعادة الاستخدام) انظر أدناه ترخيص: هذه صورةٌ لشخصية متوفاة وهي محميةٌ بحقوق التأليف والنشر. في ويكيبيديا يسمح برفع واستخدام صور للشخصيات المتوفاة تحت الا�...
Overview of wind power in Russia This article needs to be updated. Please help update this article to reflect recent events or newly available information. (May 2019) Wind turbine near an Omni Hotel, Murmansk. The wind power potential of Murmansk Oblast is one of the largest among regions of Russian. Wind power in Russia has a long history of small-scale use, but the country has not yet developed large-scale commercial wind energy production. Most of its current limited wind production is loc...
Editorial Universitaria Fundación 1947Fundador Juvenal Hernández JaquePaís Chile ChileLocalización Avenida Libertador General Bernardo O'Higgins 1050, SantiagoSitio web[editar datos en Wikidata] Editorial Universitaria es una editorial universitaria chilena. Comenzando como una central de apuntes por iniciativa de un grupo de estudiantes de ingeniería, fue posteriormente conformada como editorial con aportes de la Universidad de Chile y de accionistas particulares en 1947,...
فرديناند ماركوس (بالإنجليزية: Ferdinand Emmanuel Edralin Marcos)، و(بالتاغالوغية: Ferdinand Emmanuel Edralin Marcos)، و(بالإسبانية: Ferdinand Emmanuel Edralin Marcos)، و(بالإيلوكانوية: Ferdinand Emmanuel Edralin Marcos) مناصب عضو مجلس النواب في الفلبين عضو خلال الفترة30 ديسمبر 1949 – 30 ديسمبر 1959 فترة برلمانية كون...
German organic chemist (1829–1896) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: August Kekulé – news · newspapers · books · scholar · JSTOR (June 2015) (Learn how and when to remove this message) August KekuléBornFriedrich August Kekulé(1829-09-07)7 September 1829Darmstadt, Grand Duchy of HesseDied13...