Sentence embedding

In natural language processing, a sentence embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information.[1][2][3][4][5][6][7]

State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. BERT pioneered an approach involving the use of a dedicated [CLS] token prepended to the beginning of each sentence inputted into the model; the final hidden state vector of this token encodes information about the sentence and can be fine-tuned for use in sentence classification tasks. In practice however, BERT's sentence embedding with the [CLS] token achieves poor performance, often worse than simply averaging non-contextual word embeddings. SBERT later achieved superior sentence embedding performance[8] by fine tuning BERT's [CLS] token embeddings through the usage of a siamese neural network architecture on the SNLI dataset.

Other approaches are loosely based on the idea of distributional semantics applied to sentences. Skip-Thought trains an encoder-decoder structure for the task of neighboring sentences predictions; this has been shown to achieve worse performance than approaches such as InferSent or SBERT.

An alternative direction is to aggregate word embeddings, such as those returned by Word2vec, into sentence embeddings. The most straightforward approach is to simply compute the average of word vectors, known as continuous bag-of-words (CBOW).[9] However, more elaborate solutions based on word vector quantization have also been proposed. One such approach is the vector of locally aggregated word embeddings (VLAWE),[10] which demonstrated performance improvements in downstream text classification tasks.

Applications

In recent years, sentence embedding has seen a growing level of interest due to its applications in natural language queryable knowledge bases through the usage of vector indexing for semantic search. LangChain for instance utilizes sentence transformers for purposes of indexing documents. In particular, an indexing is generated by generating embeddings for chunks of documents and storing (document chunk, embedding) tuples. Then given a query in natural language, the embedding for the query can be generated. A top k similarity search algorithm is then used between the query embedding and the document chunk embeddings to retrieve the most relevant document chunks as context information for question answering tasks. This approach is also known formally as retrieval-augmented generation[11]

Though not as predominant as BERTScore, sentence embeddings are commonly used for sentence similarity evaluation which sees common use for the task of optimizing a Large language model's generation parameters is often performed via comparing candidate sentences against reference sentences. By using the cosine-similarity of the sentence embeddings of candidate and reference sentences as the evaluation function, a grid-search algorithm can be utilized to automate hyperparameter optimization [citation needed].

Evaluation

A way of testing sentence encodings is to apply them on Sentences Involving Compositional Knowledge (SICK) corpus[12] for both entailment (SICK-E) and relatedness (SICK-R).

In [13] the best results are obtained using a BiLSTM network trained on the Stanford Natural Language Inference (SNLI) Corpus. The Pearson correlation coefficient for SICK-R is 0.885 and the result for SICK-E is 86.3. A slight improvement over previous scores is presented in:[14] SICK-R: 0.888 and SICK-E: 87.8 using a concatenation of bidirectional Gated recurrent unit.

See also

References

  1. ^ Barkan, Oren; Razin, Noam; Malkiel, Itzik; Katz, Ori; Caciularu, Avi; Koenigstein, Noam (2019). "Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding". arXiv:1908.05161 [cs.LG].
  2. ^ The Current Best of Universal Word Embeddings and Sentence Embeddings
  3. ^ Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St.; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (2018). "Universal Sentence Encoder". arXiv:1803.11175 [cs.CL].
  4. ^ Wu, Ledell; Fisch, Adam; Chopra, Sumit; Adams, Keith; Bordes, Antoine; Weston, Jason (2017). "StarSpace: Embed All the Things!". arXiv:1709.03856 [cs.CL].
  5. ^ Sanjeev Arora, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings.", 2016; openreview:SyK00v5xx.
  6. ^ Trifan, Mircea; Ionescu, Bogdan; Gadea, Cristian; Ionescu, Dan (2015). "A graph digital signal processing method for semantic analysis". 2015 IEEE 10th Jubilee International Symposium on Applied Computational Intelligence and Informatics. pp. 187–192. doi:10.1109/SACI.2015.7208196. ISBN 978-1-4799-9911-8. S2CID 17099431.
  7. ^ Basile, Pierpaolo; Caputo, Annalina; Semeraro, Giovanni (2012). "A Study on Compositional Semantics of Words in Distributional Spaces". 2012 IEEE Sixth International Conference on Semantic Computing. pp. 154–161. doi:10.1109/ICSC.2012.55. ISBN 978-1-4673-4433-3. S2CID 552921.
  8. ^ Reimers, Nils; Gurevych, Iryna (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 [cs.CL].
  9. ^ Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
  10. ^ Ionescu, Radu Tudor; Butnaru, Andrei (2019). "Vector of Locally-Aggregated Word Embeddings (". Proceedings of the 2019 Conference of the North. Minneapolis, Minnesota: Association for Computational Linguistics. pp. 363–369. doi:10.18653/v1/N19-1033. S2CID 85500146.
  11. ^ Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". arXiv:2005.11401 [cs.CL].
  12. ^ Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. "A SICK cure for the evaluation of compositional distributional semantic models." In LREC, pp. 216-223. 2014 [1].
  13. ^ Conneau, Alexis; Kiela, Douwe; Schwenk, Holger; Barrault, Loic; Bordes, Antoine (2017). "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data". arXiv:1705.02364 [cs.CL].
  14. ^ Subramanian, Sandeep; Trischler, Adam; Bengio, Yoshua; Christopher J Pal (2018). "Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning". arXiv:1804.00079 [cs.CL].

Read other articles:

Malayalam cinema Before 1960 1960s 1960 1961 1962 1963 19641965 1966 1967 1968 1969 1970s 1970 1971 1972 1973 19741975 1976 1977 1978 1979 1980s 1980 1981 1982 1983 19841985 1986 1987 1988 1989 1990s 1990 1991 1992 1993 19941995 1996 1997 1998 1999 2000s 2000 2001 2002 2003 20042005 2006 2007 2008 2009 2010s 2010 2011 2012 2013 20142015 2016 2017 2018 2019 2020s 2020 2021 2022 2023 2024 vte The following is a list of Malayalam films released in the year 2000. Title Director Screenplay Cast A...

 

Archaeological site in Illinois, United States United States historic placeKoster SiteU.S. National Register of Historic Places College students and archaeologists at the Koster Site in 1973Show map of IllinoisShow map of the United StatesLocation200 yards (180 m) east of the Eldred-Hillview road, 5.5 miles (8.9 km) south of EldredNearest cityEldred, IllinoisCoordinates39°12′33″N 90°32′57″W / 39.20917°N 90.54917°W / 39.20917; -90.54917Area25 acres...

 

Cet article est une ébauche concernant le syndicalisme, l’Allemagne et une chronologie ou une date. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Cet article présente la chronologie (c'est-à-dire les dates et les évènements importants) du syndicalisme en Allemagne. Les origines XIXe siècle Logo du syndicat IG Metall 1871 : Autorisation des syndicats en Allemagne 1878 : Interdiction des s...

Samarate komune di Italia Samarate (it) Tempat Negara berdaulatItaliaRegion di ItaliaLombardyProvinsi di ItaliaProvinsi Varese NegaraItalia Ibu kotaSamarate PendudukTotal16.026  (2023 )Bahasa resmiItalia GeografiLuas wilayah16,01 km² [convert: unit tak dikenal]Ketinggian221 m Berbatasan denganBusto Arsizio Cardano al Campo Ferno Gallarate Lonate Pozzolo Magnago Somma Lombardo Vanzaghello SejarahSanto pelindungRochus Informasi tambahanKode pos21017 Zona waktuUTC+1 UTC+2 Kode telepon...

 

My Strange HeroPoster promosiHangul복수가 돌아왔다 GenreKomedi romantisPembuatHan Jung-hwanDitulis olehHam Joon-hoSutradaraKim Yoon-youngPemeranYoo Seung-hoJo Bo-ahKwak Dong-yeonNegara asalKorea SelatanBahasa asliKoreaJmlh. episode32ProduksiProduser eksekutifJeon Sung-taekShin In-sooYoo Hong-guPengaturan kameraSingle-cameraDurasi35 menitRumah produksiSuper Moon PicturesAniplus [ko]DistributorSBSVikiRilis asliJaringanSBS TVFormat gambar1080i (HDTV)Format audioDolby Digital...

 

The topic of this article may not meet Wikipedia's notability guideline for books. Please help to demonstrate the notability of the topic by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond a mere trivial mention. If notability cannot be shown, the article is likely to be merged, redirected, or deleted.Find sources: Ordered to Die – news · newspapers · books · scholar · JSTOR (January 202...

34th quadrennial U.S. presidential election 1920 United States presidential election ← 1916 November 2, 1920 1924 → 531 members of the Electoral College266 electoral votes needed to winTurnout49.2%[1] 12.6 pp   Nominee Warren G. Harding James M. Cox Party Republican Democratic Home state Ohio Ohio Running mate Calvin Coolidge Franklin D. Roosevelt Electoral vote 404 127 States carried 37 11 Popular vote 16,166,126 9,140,256 Percentag...

 

Jon ForemanJon Foreman in April 2008Informasi latar belakangNama lahirJonathan Mark ForemanNama lainJonLahir22 Oktober 1976 (umur 47), San Bernardino County, CaliforniaAsalAmerika SerikatPekerjaanPenyanyi, Pengarang laguInstrumenVokal, Gitar, harmonika, piano, trompet, mandolinTahun aktif1996–sekarangLabellowercase people recordsArtis terkaitSwitchfoot, Fiction FamilySitus webjonforeman.com Jonathan Mark Foreman (lahir 22 Oktober 1976) adalah seorang gitaris dan salah satu pendiri grup...

 

Railway station in North Korea For the station in South Korea, see Dunjeon station. Tunjŏn둔전Korean nameHangul둔전역Hanja屯田驛Revised RomanizationDunjeon-nyeokMcCune–ReischauerTunjŏn-nyŏk General informationLocationSudong-gu,South HamgyŏngNorth KoreaOwned byKorean State RailwayLine(s)Kowŏn Colliery Line,P'yŏngra LineHistoryElectrifiedyesServices Preceding station Korean State Railway Following station Terminus Kowŏn Colliery Line Sudongtowards Changdong Sŏngnaetowards P'y�...

Person who experiences strong hostility due to divine connection as defined in Catholic theology This article uses bare URLs, which are uninformative and vulnerable to link rot. Please consider converting them to full citations to ensure the article remains verifiable and maintains a consistent citation style. Several templates and tools are available to assist in formatting, such as reFill (documentation) and Citation bot (documentation). (September 2022) (Learn how and when to remove this m...

 

第三十一届夏季奧林匹克運動會男子100公尺蛙式比賽比賽場館奧林匹克水上運動中心日期8月6日-8月7日参赛选手46位選手,來自38個國家和地區冠军成绩57.13 WR奖牌获得者01 ! 亞當·佩蒂  英国02 ! 卡梅伦·范德伯格  南非03 ! 科迪·米勒  美国← 20122020 → 2016年夏季奧林匹克運動會游泳比賽 自由式 50公尺   男子   女子 100公尺 男子 女子 200公...

 

Extinct subspecies of Homo erectus For the album by Victor Willis, see Solo Man (album). Solo ManTemporal range: Late Pleistocene 0.117–0.108 Ma PreꞒ Ꞓ O S D C P T J K Pg N ↓ Cast of Skull XI at the Hall of Human Origins, Washington, D.C. Scientific classification Domain: Eukaryota Kingdom: Animalia Phylum: Chordata Class: Mammalia Order: Primates Suborder: Haplorhini Infraorder: Simiiformes Family: Hominidae Subfamily: Homininae Tribe: Hominini Genus: Homo Species: †H....

オリックス劇場ORIX Theater 情報正式名称 オリックス劇場旧名称 大阪厚生年金会館完成 1968年4月14日開館 2012年4月8日開館公演 新日本フィルハーモニー交響楽団特別演奏会収容人員 2,400人客席数 2,400席(1F 1255席・2F 383席・3F 762席)設備 オーケストラピット・ラウンジなど用途 劇場・コンサートホール旧用途 多目的ホール運営 大阪シティドーム所在地 〒550-0013大阪府大阪�...

 

Henry Kissinger Henry Alfred Kissinger Geboren 27 mei 1923Fürth (Weimarrepubliek) Overleden 29 november 2023Kent (Connecticut) Politieke partij Republikeinse Partij Partner Ann Fleischer (1949–1964) Nancy Kissinger (sinds 1974) Beroep Politicus Diplomaat Auteur Politicoloog Hoogleraar Religie Joods Handtekening 56e minister van Buitenlandse Zaken Aangetreden 22 september 1973 Einde termijn 20 januari 1977 President Richard Nixon (1973–1974) Gerald Ford (1974–1977) Voorganger Will...

 

此條目需要擴充。 (2016年12月6日)请協助改善这篇條目,更進一步的信息可能會在討論頁或扩充请求中找到。请在擴充條目後將此模板移除。 最後一戰:光環傳奇Halo Legends特別版DVD/藍光光碟的封面基本资料导演Frank O'ConnorJoseph Chou监制Bonnie RossJohn Ledford(英语:John Ledford)编剧John Powell制片商STUDIO 4℃Production I.GCasio Entertainment東映動畫BonesBEE TRAIN華納兄弟343 Industries片长120 分鐘�...

Danijel Subašić Subašić bermain untuk Kroasia pada 2013Informasi pribadiTanggal lahir 27 Oktober 1984 (umur 39)Tempat lahir Zadar, RS Kroasia, SFR YugoslaviaTinggi 1,91 m (6 ft 3 in)Posisi bermain Penjaga gawangInformasi klubKlub saat ini MonacoNomor 1Karier junior ZadarKarier senior*Tahun Tim Tampil (Gol)2003–2008 Zadar 81 (0)2008 → Hajduk Split (pinjaman) 31 (0)2009–2012 Hajduk Split 64 (0)2012– Monaco 87 (1)Tim nasional‡2006 Kroasia U-21 6 (0)2009– Kroa...

 

Este artículo o sección necesita referencias que aparezcan en una publicación acreditada. Busca fuentes: «Ducado de Carniola» – noticias · libros · académico · imágenesEste aviso fue puesto el 22 de mayo de 2011. Ducado de CarniolaVojvodina Kranjska (sl)Herzogtum Krain (de) Ducado y tierra de la corona 1364-1918BanderaEscudo Ubicación de Ducado de CarniolaCapital LiublianaEntidad Ducado y tierra de la corona • País Sacro Imperio Romano GermánicoImperi...

 

Disambiguazione – Se stai cercando il singolo dei Duck Sauce, vedi Barbra Streisand (singolo). Questa voce o sezione sugli argomenti attori statunitensi e cantanti statunitensi non cita le fonti necessarie o quelle presenti sono insufficienti. Puoi migliorare questa voce aggiungendo citazioni da fonti attendibili secondo le linee guida sull'uso delle fonti. Segui i suggerimenti dei progetti di riferimento 1, 2. Barbra StreisandBarbra Streisand nel 2018 Nazionalità Stati Unit...

هذه المقالة تحتاج للمزيد من الوصلات للمقالات الأخرى للمساعدة في ترابط مقالات الموسوعة. فضلًا ساعد في تحسين هذه المقالة بإضافة وصلات إلى المقالات المتعلقة بها الموجودة في النص الحالي. (سبتمبر 2017)   لمعانٍ أخرى، طالع حمولة (توضيح). درع حقل المؤيد القمة إكليل العباءة خوذة...

 

Protestant Reformation leader in Switzerland, Swiss Reformed Church founder (1484–1531) Zwingli redirects here. For the skier, see Werner Zwingli. For the main-belt asteroid, see 7908 Zwingli. Huldrych ZwingliPortrait by Hans Asper, 1531 (Kunstmuseum Winterthur)Born1 January 1484Wildhaus, Protectorate of the Princely Abbey of Saint GallDied11 October 1531(1531-10-11) (aged 47)Kappel, Canton of Zürich, Swiss ConfederationEducationUniversity of ViennaUniversity of BaselOccupat...