Morphological dictionary

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information (for example the part of speech, gender and number). In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

Notable examples and formalisms

Universal Morphologies

Inspired by the success of the Universal Dependencies for cross-linguistic annotation of syntactic dependencies, similar efforts have emerged for morphology, e.g., UniMorph[1] and UDer.[2] These feature simple tabular (tab-separated) formats with one form in a row, and its derivation (UDer), resp., inflection information (UniMorph):

aalen   aalend  V.PTCP;PRS

aalen   aalen   V;IND;PRS;1;PL

aalen   aalen   V;IND;PRS;3;PL

aalen   aalen   V;NFIN

(UniMorph, German. Columns are LEMMA, FORM, FEATURES)

In UDer, additional information (part of speech) is encoded within the columns:

abändern_V      Abänderung_Nf   dVN07>

Abarbeiten_Nn   abarbeiten_V    dNV09>

abartig_A       Abartigkeit_Nf  dAN03>

Abart_Nf        abartig_A       dNA05>

abbaggern_V     Abbaggern_Nn    dVN09>

(UDer, German DErivBase 0.5. Columns are BASE, DERIVED, RULE)

At the time of writing (2021), all of these are non-aligned morphological dictionaries (see below). Their simplistic format is particularly well-suited for the application of machine learning techniques, and UniMorph in particular, has been subject of numerous shared tasks.

Finite State Transducers

Finite State Transducers (FSTs) are a popular technique for the computational handling of morphology, esp., inflectional morphology. In rule-based morphological parsers, both lexicon and rules are normally formalized as finite state automata and subsequently combined. They thus require morphological dictionaries with specific processing instructions (which often have a linguistic interpretation, but, technically, are just treated like arbitrary string symbols).[3] Popular FST packages such as SFST[4] (as available from the fst package in Debian and Ubuntu) allow to define application-specific file formats for morphological lexica, that bundle different pieces of morphological information with every individual morpheme. These are thus aligned morphological dictionaries, but very rich (and also, idiosyncratic) in structure.


Sample data from SMOR[5] (German SFST grammar):

<Base_Stems>Aachen<NN><base><nativ><Name-Neut_s>

<Base_Stems>Aal<NN><base><nativ><NMasc_es_e>

<Base_Stems>Aarau<NN><base><nativ><Name-Neut_s>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<>:e<>:n<NN><SUFF><kompos><frei>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><base><frei><NMasc_en_en>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><deriv><frei>

Interlinear Glossed Text editors

Interlinear Glossed Text (IGT) is a popular formalism in language documentation, linguistic typology and other branches of linguistics and the philologies. Although IGT can be created without any specialized software (but just with a conventional editor), such specialized software has been developed, with notable examples such as Toolbox,[6] the FieldWorks Language Explorer (FLEx)[7] or open source alternatives such as Xigt.[8] Toolbox and FLEx support semi-automated annotation by means of an internal morphological dictionary. Whenever a morphological segment is encountered for which an annotation in the dictionary can be found, this annotations is applied. Whenever a morphological segment is newly annotated, the annotation is stored in the dictionary. FLEx and Toolbox provide different editor functionalities for annotating text and editing dictionaries, so that additional information beyond that found in annotations can be added, but at its core, their formats provide aligned morphological dictionaries.

FLEx and Xigt are based on XML formats, Toolbox uses a plain text format with idiosyncratic "markers". FLEx and Toolbox are not directly interoperable with each other, but a semiautomated converter for Toolbox to FLEx does exist. Xigt comes with FLEx and Toolbox importers, but is less widely used that either FLEx or Toolbox. Their formats of FLEx and Toolbox are not intended for human consumption, nor are they well-supported by any processing software other than their native tools.

OntoLex-Morph: A community standard for morphological dictionaries

OntoLex is a community standard for machine-readable dictionaries on the web. In 2019, the OntoLex-Morph module has been proposed to facilitate data modelling of morphology in lexicography, as well as to provide a data model for morphological dictionaries for Natural Language Processing.[9] OntoLex-Morph does support both aligned and non-aligned morphological dictionaries. A specific goal is to establish interoperability between and among IGT dictionaries, FST lexicons and morphological dictionaries used for machine learning.

Types and structure of morphological dictionaries

Aligned morphological dictionaries

In an aligned morphological dictionary, the correspondence between the surface form and the lexical form of a word is aligned at the character level, for example:

(h,h) (o,o) (u,u) (s,s) (e,e) (s,⟨n⟩), (θ,⟨pl⟩)

Where θ is the empty symbol and ⟨n⟩ signifies "noun", and ⟨pl⟩ signifies "plural".

In the example the left hand side is the surface form (input), and the right hand side is the lexical form (output). This order is used in morphological analysis where a lexical form is generated from a surface form. In morphological generation this order would be reversed.

Formally, if Σ is the alphabet of the input symbols, and is the alphabet of the output symbols, an aligned morphological dictionary is a subset , where:

is the alphabet of all the possible alignments including the empty symbol. That is, an aligned morphological dictionary is a set of string in .

Non-aligned morphological dictionaries (full-form dictionaries)

A non-aligned morphological dictionary (or full-form dictionary) is simply a set of pairs of input and output strings. A non-aligned morphological dictionary would represent the previous example as:

(houses, house⟨n⟩⟨pl⟩)

It is possible to convert a non-aligned dictionary into an aligned dictionary. Besides trivial alignments to the left or to the right, linguistically motivated alignments which align characters to their corresponding morphemes are possible.

Lexical ambiguities

Frequently there exists more than one lexical form associated with a surface form of a word. For example, "house" may be a noun in the singular, /haʊs/, or may be a verb in the present tense, /haʊz/. As a result of this it is necessary to have a function which relates input strings with their corresponding output strings.

If we define the set of input words such that , the correspondence function would be defined as .

References

  1. ^ Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui et al. "UniMorph 2.0: universal morphology." In LREC (2018).
  2. ^ Kyjánek, L., Žabokrtský, Z., Ševčíková, M., & Vidra, J. (2019, September). Universal derivations kickoff: a collection of harmonized derivational resources for eleven languages. In Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology (pp. 101-110).
  3. ^ "A Short History of Two-Level Morphology". www.ling.helsinki.fi. Retrieved 2021-11-30.
  4. ^ Schmid, Helmut. "A programming language for finite state transducers." In FSMNLP, vol. 4002, pp. 308-309. 2005.
  5. ^ Schmid, Helmut, Arne Fitschen, and Ulrich Heid. "SMOR: A German computational morphology covering derivation, composition and inflection." In LREC, pp. 1-263. 2004.
  6. ^ "Field Linguist's Toolbox". software.sil.org. 10 May 2017. Retrieved 2021-11-27.
  7. ^ "FieldWorks". software.sil.org. 9 December 2014. Retrieved 2021-11-27.
  8. ^ "XIGT". XIGT. Retrieved 2021-11-27.
  9. ^ Klimek, B., McCrae, J. P., Bosque-Gil, J., Ionov, M., Tauber, J. K., & Chiarcos, C. (2019). Challenges for the representation of morphology in ontology lexicons. Proceedings of eLex.

Read other articles:

Computed tomography of the heartIntervensiKontras yang ditingkatkan menggunakan CT scan dua sumber radiasiICD-9-CM87.41Kode OPS-3013-224[sunting di Wikidata] Angiogram CT scan jantung atau CT coronary angiogram (CTA) adalah prosedur untuk mengetahui penyumbatan/hambatan pada arteri jantung, biasanya untuk mendiagnosa penyakit arteri koroner. Pasien diinjeksi intravena (IV) dengan cairan/pewarna yodium dan jantungnya discan menggunakan CT scan berkecepatan tinggi, minimal CT scan 64 s...

 

Kisah untuk GeriInggrisGeri's Story Genre Drama Roman Remaja BerdasarkanKisah untuk Gerioleh Erisca FebrianiSkenarioQueen B.SutradaraMonty TiwaPemeran Angga Yunanda Syifa Hadju Jennifer Coppen Elina Joerg Antonio Blanco Lagu pembukaTak Lagi Sendiri oleh AdivaLagu penutupCinta Hebat oleh Syifa Hadju feat Angga YunandaNegara asalIndonesiaBahasa asliBahasa IndonesiaJmlh. musim1Jmlh. episode9ProduksiProduser eksekutif Dhamoo Punjabi Jeff Han Kaichen Li Lesley Simpson ProduserManoj PunjabiSinemat...

 

Largely ceremonial office in China Vice President of the People's Republic of China中华人民共和国副主席National Emblem of the People's Republic of ChinaFlag of the People's Republic of ChinaIncumbentHan Zhengsince 10 March 2023StyleMr Vice President (副主席)(informal)His Excellency (阁下)(diplomatic)TypeDeputy state representativeStatusSub-national leader level officialResidenceZhongnanhaiSeatZhongnanhai West Building, Beijing[1]NominatorPresidium of the National P...

1937–38 Soviet ethnic cleansing of Poles Polish Operation of the NKVDPart of the Great Purge[1][2]Memorial in KrakówLocation Soviet Union, modern-day Russia, Ukraine, Belarus, Kazakhstan and othersDate1937–1938TargetPolesAttack typePrison shootingsDeaths+/− 111,091Victims22% of the Polish population of the Soviet Union was sentenced by the operation (140,000 people)[3]PerpetratorsNikolai Yezhov (NKVD), Joseph Stalin The Polish Operation of the NKVD (Soviet ...

 

Medieval social phenomena This article is about the social phenomenon sometimes formerly called St. Vitus' dance. For the infection resulting in jerking movements more often called by that name, see Sydenham's chorea. Dancing mania on a pilgrimage to the church at Sint-Jans-Molenbeek, a 1642 engraving by Hendrick Hondius after a 1564 drawing by Pieter Brueghel the Elder Dancing mania (also known as dancing plague, choreomania, St. John's Dance, tarantism and St. Vitus' Dance) was a social phe...

 

Mode of transport in India Railway Transport In India Clockwise from top left: A Vande Bharat train-set; a WAP-7 electric locomotive; a metro train and a suburban trainOperationNational railwayIndian RailwaysSystem lengthTotalRoute: 68,907 km (42,817 mi)Regular/suburban: 68,012 km (42,261 mi)[1]Metro: 895 km (556 mi)[2]Double track38,415 km (23,870 mi) (2023)[1][2]Electrified61,315 km (38,099 mi)[3][...

« Montand » redirige ici. Pour l’article homophone, voir Montant. Pour les articles homonymes, voir Livi (homonymie). Yves Montand Yves Montand en 1965 dans une loge du Théâtre Royal de La Haye (Pays-Bas).Cette illustration a été retouchée par une IA. Données clés Nom de naissance Ivo Livi Naissance 13 octobre 1921Monsummano Terme (Italie) Nationalité Italienne Française (depuis 1929) Décès 9 novembre 1991 (à 70 ans)Senlis (France) Profession ActeurChanteur Fil...

 

The factual accuracy of parts of this article (those related to article) may be compromised due to out-of-date information. Please help update this article to reflect recent events or newly available information. (August 2011) Negai☆″Mission typeTechnologyOperatorSoka UniversityCOSPAR ID2010-020C SATCAT no.36575Websitekuro.t.soka.ac.jp/cube/what/index.htmlMission duration37 days Spacecraft propertiesSpacecraft type1U CubeSatLaunch mass1 kilogram (2.2 lb)[1]Dimensions10-c...

 

Venezia Santa LuciaStasiun kereta apiStasiun kereta api Venezia Santa LuciaLokasiFondamenta Santa Lucia, 30121, Venesia, VenetoItaliaKoordinat45°26′27″N 12°19′15″E / 45.44083°N 12.32083°E / 45.44083; 12.32083Koordinat: 45°26′27″N 12°19′15″E / 45.44083°N 12.32083°E / 45.44083; 12.32083PemilikRete Ferroviaria ItalianaOperatorGrandi Stazioni (stasiun) Trenitalia (kereta api)JalurJalur kereta api Milan–Venesia Jalur kereta ...

Artikel ini membutuhkan rujukan tambahan agar kualitasnya dapat dipastikan. Mohon bantu kami mengembangkan artikel ini dengan cara menambahkan rujukan ke sumber tepercaya. Pernyataan tak bersumber bisa saja dipertentangkan dan dihapus.Cari sumber: Norodom Ranariddh – berita · surat kabar · buku · cendekiawan · JSTOR Ini adalah sebuah nama Kamboja; nama keluarganya adalah Norodom. Sesuai dengan kebiasaan di Kamboja, tokoh ini harus disebut dengan nama d...

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Sapahar Pilot High School – news · newspapers · books · scholar · JSTOR (October 2015) (Learn how and when to remove this message) Non govt. school in BangladeshSapahar Pilot High School, Naogaonসাপাহার পাইলট উচ্চ বিদ্য...

 

R.A. Dickey Dickey con la maglia dei Blue Jays nel 2016 Nazionalità  Stati Uniti Altezza 193 cm Peso 98 kg Baseball Ruolo Lanciatore partente Termine carriera 2018 CarrieraSquadre di club 2001,2003-2006 Texas Rangers2008 Seattle Mariners2009 Minnesota Twins2010-2012 New York Mets2013-2016 Toronto Blue Jays2017 Atlanta Braves Statistiche Batte destro Lancia destro Basi su ball 663 Strikeout 1 477 Punti concessi 930 Media PGL 4,04 Inning totali 2 0...

Newspaper in Newcastle, NSW, Australia Newcastle HeraldTypeDaily newspaperFormatTabloidOwner(s)Australian Community MediaEditorLisa AllanFounded1858HeadquartersNewcastleWebsitewww.newcastleherald.com.au The Newcastle Herald (formerly branded as The Herald) is a local tabloid newspaper published daily, Monday to Saturday, in Newcastle, New South Wales, Australia. It is the only local newspaper that serves the greater Hunter Region and Central Coast region six days a week. It is owned by Austra...

 

  نيكاراغوا (بالفرنسية: la République du Nicaragua)‏[1]  نيكاراغواعلم نيكاراغوا  نيكاراغواشعار نيكاراغوا    الشعار الوطني(بالويلزية: Ymddiriedwn yn Nuw)‏  النشيد: سالف أ تي  [لغات أخرى]‏  الأرض والسكان إحداثيات 13°N 85°W / 13°N 85°W / 13; -85   [2] أعلى قمة م�...

 

この記事は検証可能な参考文献や出典が全く示されていないか、不十分です。出典を追加して記事の信頼性向上にご協力ください。(このテンプレートの使い方)出典検索?: コルク – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジャパンサーチ · TWL(2017年4月) コルクを打ち抜いて作った瓶の栓 コルク(木栓、�...

尼古拉·雷日科夫Николай Рыжков攝於2019年 俄羅斯聯邦委員會议员任期2003年9月17日—2023年9月25日选区别尔哥罗德州 俄羅斯国家杜马议员任期1995年12月17日—2003年9月17日选区别尔哥罗德州 苏联部長會議主席任期1985年9月27日—1991年1月14日总统米哈伊尔·谢尔盖耶维奇·戈尔巴乔夫前任尼古拉·亚历山德罗维奇·吉洪诺夫继任瓦连京·谢尔盖耶维奇·帕夫洛夫(总�...

 

2005 United States gubernatorial elections ← 2004 November 8, 2005 2006 → 3 governorships2 states; 1 territory   Majority party Minority party   Party Republican Democratic Seats before 28 22 Seats after 28 22 Seat change Seats up 0 2 Seats won 0 2 Map of the results     Democratic hold      Covenant gain     No election United States gubernatorial elections were held on ...

 

SirignanoKomuneComune di SirignanoLokasi Sirignano di Provinsi AvellinoNegaraItaliaWilayah CampaniaProvinsiAvellino (AV)Luas[1] • Total6,19 km2 (2,39 sq mi)Ketinggian[2]270 m (890 ft)Populasi (2016)[3] • Total2.878 • Kepadatan460/km2 (1,200/sq mi)Zona waktuUTC+1 (CET) • Musim panas (DST)UTC+2 (CEST)Kode pos83020Kode area telepon081Situs webhttp://www.comune.sirignano.av.it Sirignano adal...

American construction-equipment manufacturer For other uses, see Caterpillar (disambiguation). Caterpillar Inc.Former headquarters in Peoria, IllinoisTrade nameCATCompany typePublicTraded asNYSE: CATDJIA componentS&P 100 componentS&P 500 componentIndustryHeavy equipmentEnginesFinancial servicesPredecessorC. L. Best Tractor CompanyHolt Manufacturing CompanyFoundedApril 15, 1925; 99 years ago (1925-04-15) in Wisconsin, U.S.FoundersC.L. BestBenjamin Holt[1]...

 

Federally recognized Native American tribe Ethnic group Kaw NationTotal population3,559[2]Regions with significant populationsUnited States (OklahomaKansas)LanguagesEnglish, historically KansaReligionNative American Church, Christianity, traditional tribal religionRelated ethnic groupsother Siouan and Dhegihan peoples Water tower of the Kaw nation, along I-35 in Oklahoma KnoShr, Kansa Chief, 1853 The Kaw Nation (or Kanza or Kansa) is a federally recognized Native American tribe in Okl...