Language identification

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Overview

There are several statistical approaches to language identification using different techniques to classify the data. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.[citation needed] Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques.

Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.

For a more recent method, see Řehůřek and Kolkus (2009). This method can detect multiple languages in an unstructured piece of text and works robustly on short texts of only a few words: something that the n-gram approaches struggle with.

An older statistical method by Grefenstette was based on the prevalence of certain function words (e.g., "the" in English).

A common non-statistical intuitive approach (though highly uncertain) is to look for common letter combinations, or distinctive diacritics or punctuation.[1][2]

Identifying similar languages

One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them.

In 2014 the DSL shared task[3] has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014.

Software

  • Apache OpenNLP includes char n-gram based statistical detector and comes with a model that can distinguish 103 languages
  • Apache Tika contains a language detector for 18 languages

See also

References

References

  1. ^ Stock, Wolfgang G.; Stock, Mechtild (2013-07-31). Handbook of Information Science. Walter de Gruyter. pp. 180–181. ISBN 978-3-11-023500-5.
  2. ^ Hagiwara, Masato (2021-12-14). Real-World Natural Language Processing: Practical Applications with Deep Learning. Simon and Schuster. pp. 105–106. ISBN 978-1-61729-642-0.
  3. ^ "VarDial Workshop @ COLING 2014".

Read other articles:

Homai VyarawallaPenghargaan Foto Nasional Pertama - Prestasi Seumur Hidup 2010.Lahir(1913-12-09)9 Desember 1913Navsari, negara bagian Bombay, India BritaniaMeninggal15 Januari 2012(2012-01-15) (umur 98)Vadodara, Gujarat, IndiaKebangsaanIndiaPendidikanSir J. J. School of ArtPekerjaanPhotojournalistSuami/istriManekcshaw Vyarawala (d. 1969)AnakFarouq[1] Homai Vyarawalla (9 Desember 1913 – 15 Januari 2012), yang lebih dikenal dengan pseudonimnya Dalda 13, adalah foto...

 

العلاقات السودانية الفيتنامية السودان فيتنام   السودان   فيتنام تعديل مصدري - تعديل   العلاقات السودانية الفيتنامية هي العلاقات الثنائية التي تجمع بين السودان وفيتنام.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتين: وجه الم...

 

MargaretSutradaraKenneth LonerganProduserSydney PollackGary GilbertScott RudinDitulis olehKenneth LonerganPemeranAnna PaquinJ. Smith-CameronJean RenoJeannie BerlinAllison JanneyMatthew BroderickKieran CulkinMark RuffaloMatt DamonPenata musikNico MuhlySinematograferRyszard LenczewskiPenyuntingAnne McCabeMichael FayPerusahaanproduksiCamelot PicturesGilbert Films Mirage EnterprisesScott Rudin ProductionsDistributorSearchlight PicturesTanggal rilis 30 September 2011 (2011-09-30) Durasi...

Zhou DunyiLahir1017Meninggal1073 (umur 56 tahun)EraNeo-KonfusianismeKawasanFilsafat TionghoaAliranNeo-Konfusianisme Dipengaruhi Konfusius, Mensius Memengaruhi Cheng Hao, Cheng Yi, Zhu Xi Zhou Dunyi Patung perunggu Zhou Dunyi di Akademi Grotto Rusa Putih Hanzi tradisional: 周敦頤 Hanzi sederhana: 周敦颐 Alih aksara Mandarin - Hanyu Pinyin: Zhōu Dūnyí - Wade-Giles: Chou Tun-i Nama lahir Hanzi: 周敦實 Alih aksara Mandarin - Hanyu Pinyin: Zhōu Dūnshí Zhou Dunyi (1017–1073 Masehi...

 

Canadian racing driver NASCAR driver Amber BalcaenBalcaen at Daytona in 2022Born (1992-03-07) March 7, 1992 (age 32)Winnipeg, Manitoba, CanadaNASCAR Canada Series career1 race run over 1 yearFirst race2023 Pinty's Fall Brawl (Delaware Speedway) Wins Top tens Poles 0 0 0 ARCA Menards Series career27 races run over 3 yearsARCA no., teamNo. 22 (Venturini Motorsports)First race2022 Lucas Oil 200 (Daytona)Last race2024 General Tire 150 (Dover) Wins Top tens Poles 0 9 0 ARCA Menards Series Eas...

 

ХристианствоБиблия Ветхий Завет Новый Завет Евангелие Десять заповедей Нагорная проповедь Апокрифы Бог, Троица Бог Отец Иисус Христос Святой Дух История христианства Апостолы Хронология христианства Раннее христианство Гностическое христианство Вселенские соборы Н...

History and description of British bus transport This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Bus transport in the United Kingdom – news · newspapers · books · scholar · JSTOR (January 2024) (Learn how and when to remove this message) Buses are the most widespread and most commonly used form of public tra...

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Gargunnock – news · newspapers · books · scholar · JSTOR (April 2011) (Learn how and when to remove this message) Human settlement in ScotlandGargunnockThe Main Street in GargunnockGargunnockLocation within the Stirling council areaPopulation720 (mid-2020...

 

土库曼斯坦总统土库曼斯坦国徽土库曼斯坦总统旗現任谢尔达尔·别尔德穆哈梅多夫自2022年3月19日官邸阿什哈巴德总统府(Oguzkhan Presidential Palace)機關所在地阿什哈巴德任命者直接选举任期7年,可连选连任首任萨帕尔穆拉特·尼亚佐夫设立1991年10月27日 土库曼斯坦土库曼斯坦政府与政治 国家政府 土库曼斯坦宪法 国旗 国徽 国歌 立法機關(英语:National Council of Turkmenistan) ...

此條目可能包含不适用或被曲解的引用资料,部分内容的准确性无法被证實。 (2023年1月5日)请协助校核其中的错误以改善这篇条目。详情请参见条目的讨论页。 各国相关 主題列表 索引 国内生产总值 石油储量 国防预算 武装部队(军事) 官方语言 人口統計 人口密度 生育率 出生率 死亡率 自杀率 谋杀率 失业率 储蓄率 识字率 出口额 进口额 煤产量 发电量 监禁率 死刑 国债 ...

 

Governing body of association football in Herefordshire, England Herefordshire County Football AssociationPurposeFootball AssociationHeadquartersCounty Ground OfficesWidemarsh CommonLocationHereford HR4 9NAChief ExecutiveAlan DarfiWebsiteherefordshirefa.com The Herefordshire Football Association, simply known as the Herefordshire FA or HFA, is the governing body of football in the county of Herefordshire. It runs several league and cup competitions in the county. Administration The Herefordsh...

 

2020年夏季奥林匹克运动会波兰代表團波兰国旗IOC編碼POLNOC波蘭奧林匹克委員會網站olimpijski.pl(英文)(波兰文)2020年夏季奥林匹克运动会(東京)2021年7月23日至8月8日(受2019冠状病毒病疫情影响推迟,但仍保留原定名称)運動員206參賽項目24个大项旗手开幕式:帕维尔·科热尼奥夫斯基(游泳)和马娅·沃什乔夫斯卡(自行车)[1]闭幕式:卡罗利娜·纳亚(皮划艇)&#...

Частина серії проФілософіяLeft to right: Plato, Kant, Nietzsche, Buddha, Confucius, AverroesПлатонКантНіцшеБуддаКонфуційАверроес Філософи Епістемологи Естетики Етики Логіки Метафізики Соціально-політичні філософи Традиції Аналітична Арістотелівська Африканська Близькосхідна іранська Буддій�...

 

Halaman ini berisi artikel tentang serial televisi AS. Untuk kegunaan lain, lihat Growing Pains (disambiguasi). Growing PainsGenreSitkomPembuatNeal MarlensPemeranAlan ThickeJoanna KernsKirk CameronTracey GoldJeremy MillerAshley JohnsonLeonardo DiCaprioPenggubah lagu temaJohn BettisSteve DorffLagu pembukaAs Long As We Got Each OtherLagu penutupAs Long As We Got Each OtherPenata musikSteve DorffNegara asalAmerika SerikatBahasa asliInggrisJmlh. musim7Jmlh. episode166 (daftar episode)Produk...

 

French author and Nobel laureate (1869–1951) André GideBornAndré Paul Guillaume Gide(1869-11-22)22 November 1869Paris, FranceDied19 February 1951(1951-02-19) (aged 81)Paris, FranceResting placeCimetière de Cuverville, Cuverville, Seine-MaritimeOccupationNovelist, essayist, dramatistEducationLycée Henri-IVNotable worksThe ImmoralistStrait Is the Gate Les caves du Vatican (The Vatican Cellars; sometimes published in English under the title Lafcadio's Adventures) The Pastoral Symphony...

History of wine in ancient Rome A Roman statue of Bacchus, god of wine (c. 150 AD, copied from a Hellenistic original, Prado Museum, Madrid). Ancient Rome played a pivotal role in the history of wine. The earliest influences on the viticulture of the Italian peninsula can be traced to ancient Greeks and the Etruscans. The rise of the Roman Empire saw both technological advances in and burgeoning awareness of winemaking, which spread to all parts of the empire. Rome's influence has had a profo...

 

Detalle de un bánh mì mostrando su interior. El bánh mì es un bocadillo típico de la cocina vietnamita elaborado con una baguette de pan blanco y harina de arroz.[1]​ El bocadillo contiene algunos encurtidos de zanahorias, daikon, cebollas, cilantro, y bien carne o tofu. Los rellenos más populares del bánh mì incluyen cerdo, paté, pollo y chicharros. El contraste de sabores y texturas es una de las características de este bocadillo, así como su bajo coste, lo que lo convierte...

 

Ahmed Hassan al-Bakr Nama dalam bahasa asli(ar) أحمد حسن البكر BiografiKelahiran1r Juli 1914 Tikrit Kematian4 Oktober 1982 (68 tahun)Bagdad Tempat pemakamanBagdad Galat: Kedua parameter tahun harus terisi! 49 Minister of Defence (en) 11 November 1974 – 15 Oktober 1977 – Adnan Khayr Allah (en)  → Daftar Perdana Menteri Irak 31 Juli 1968 – 16 Juli 1979 ← Abd ar-Razzaq an-Naif (en) – Saddam Hussein → Daftar Presiden Irak 17 Juli 1968...

Стиль этой статьи неэнциклопедичен или нарушает нормы литературного русского языка. Статью следует исправить согласно стилистическим правилам Википедии. Содержание 1 Искусство 1.1 Архитектура 1.2 Скульптура 1.3 Живопись 1.3.1 Стенная живопись 2 Литература 2.1 Книгоиздательст...

 

Questa voce sull'argomento cestisti israeliani è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Segui i suggerimenti del progetto di riferimento. Michael BriskerNazionalità Israele Altezza185 cm Peso80 kg Pallacanestro RuoloGuardia Squadra Ironi Kiryat Ata CarrieraGiovanili  Maccabi Ra'anana Squadre di club 2016-2017 Maccabi Ra'anana2017-2019 Hapoel Eilat0 (0)2017-2019→  Maccabi Ra'anana2019-2020 Hapoel Tel Aviv8 (...