Linguistic material used for various types of language research and processing
In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."[1]
According to Bird & Simons (2003),[2] this includes
data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar",[2]
tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data",[2] and
advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or "(community) standards".[2]
In a narrower sense, language resource is specifically applied to resources that are available in digital form, and then, "encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management".[1]
Typology
As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap,[3] METASHARE,[4] and, for data, the LLOD classification). Important classes of language resources include
vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata),[4] the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource),[5] or the Glottolog database (identifiers for language varieties and bibliographical database).[6]
Language resource publication, dissemination and creation
A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include:
the European Language Resources Association (ELRA, EU-based), and the Linguistic Data Consortium (LDC, US-based), which represent commercial hosting and dissemination platforms for language resources,
the Language Resources and Evaluation Journal (LREJ),[7]
the European Language Grid is a European platform for language technologies (eg services), data and resources.
As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including
ISO Technical Committee 37: Terminology and other language and content resources (ISO/TC 37), developing standards for all aspects of language resources,
W3C Community Group Best Practices for Multilingual Linked Open Data (BPMLOD),[8] working on best practice recommendations for publishing language resources as Linked Data or in RDF,
W3C Community Group Linked Data for Language Technology (LD4LT),[9] working on linguistic annotations on the web and language resource metadata,
W3C Community Group Ontology-Lexica (OntoLex),[10] working on lexical resources,
^ abMcCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". In Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.). The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Vol. 9341. Cham: Springer International Publishing. pp. 271–282. doi:10.1007/978-3-319-25639-9_42. ISBN978-3-319-25639-9.
^Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In 6th International Conference on Language Resources and Evaluation (LREC 2008).
^Nordhoff, Sebastian (2012), Chiarcos, Christian; Nordhoff, Sebastian; Hellmann, Sebastian (eds.), "Linked Data for Linguistic Diversity Research: Glottolog/Langdoc and ASJP Online", Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, pp. 191–200, doi:10.1007/978-3-642-28249-2_18, ISBN978-3-642-28249-2