MateCat

MateCat is a web-based computer-assisted translation (CAT) tool, released as open-source software under the Lesser General Public License (LGPL).

Overview

MateCat ("Machine Translation Enhanced Computer Assisted Translation") is a 3-year research project (Nov 2011 – Oct 2014) funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement No. 287688.[1] It has received over €2,500,000 of European funds.[2]

The project consortium was led by FBK (Fondazione Bruno Kessler), an international research center based in Trento, Italy, and included Translated, an AI-based language solution provider founded by Marco Trombetti and Isabelle Andrieu, Université du Maine, and the University of Edinburgh.

CAT tools

CAT tools provide access to translation memories (TMs), terminology databases, concordance tools and, more recently, to machine translation (MT) engines. The integration of suggestions from an MT engine as a complement to TM matches is motivated by recent studies,[3][4][5] which have shown that post-editing MT suggestions improves the level of accuracy in translations.

MateCat facilitates editing machine translation results and manages the localization workflow. It leverages knowledge of field-specific language (for example, legal terminology) to improve translation suggestions, and also uses machine learning to automatically improve suggestions over time.[6] It's designed to function both as a translation workbench product and as a research platform for integrating new MT functions, running post-editing experiments, and measuring user productivity.

Technology

Statistical MT

MateCat runs as a web server that connects with other services via open APIs: the TM service MyMemory,[7] the commercial Google Translate (GT) service, ModernMT, and a list of Moses-based[8] services specified in a configuration file. While MyMemory and GT are always available, Moses servers have to be installed and set-up. Moses allows MateCat to extend the GT API to support self-tuning, user-adaptive, and informative MT functions. The open-source version of MateCat natively supports the XLIFF[9] file format, but converters can be configured to support other formats. The tool supports Unicode (UTF-8) encoding, including non-Latin alphabets and right-to-left languages, and handles texts that embed mark-up tags. It supports concordances, terminology databases, and customizable quality estimation components, and provides an API for the Moses Toolkit that can be customized to languages and domains.

MT support

The tool supports Moses-based servers able to provide an enhanced CAT-MT communication. In particular, the GT API is augmented with feedback information provided to the MT engine every time a segment is post-edited as well as enriched MT output, including confidence scores, word lattices, etc. The developed MT server supports multi-threading to serve multiple translators, handles text segments including tags and adapts from the post-edits performed by each user[10]

Context-aware translation

MateCat also provides suggestions by MT which are consistent with respect not only to the already edited segments but also, in theory, to the whole document. This context information will be embedded in the statistical models and should enable better disambiguation, for instance, between lexical alternatives. The context-based models will combine information about recurring terms and expressions extracted during the document analysis with the corresponding chosen and confirmed translations as soon as they become available. In particular, translation constraints related to inter-sentence and intra-sentence anaphoric expressions, to syntactic concordances, and to lexical coherence will be taken into account by means of specific statistical models.

Real-time processing

The core components of traditional MT systems, that is, the translation and the language models, are generally static: they never change after an initial training phase. This means that they are unsuitable for a dynamic environment like the one that MateCat is designing for translators. In order to model the dynamic changes depicted in the two previous tasks, MateCat developed innovative data-structures that can be rapidly and effectively updated as soon as a new translation is supplied by the user, and innovative, efficient algorithms for performing this adaptation in such a way that the whole process takes place in real time and is transparent to the translator. Moreover, efficiency will be improved by taking advantage of single CPU multithreading, as well as distributed computing facilities running on private clusters or computer clouds.

Edit log

During post-editing the tool collects timing information for each segment, which is updated every time the segment is opened and closed. Moreover, for each segment, information is collected about the generated suggestions and the one that has actually been post-edited. This information is accessible at any time through a link in the Editing Page, named Editing Log. The Editing Log page (Figure 1) shows a summary of the overall editing performed so far on the project, such as the average translation speed and post-editing effort and the percentage of top suggestions coming from MT or the TM. Moreover, for each segment, sorted from the slowest to the fastest in terms of translation speed, detailed statistics about the performed edit operations are reported. This information, with even more details, can be also downloaded as a CSV file to perform a more detailed post-editing analysis. While the information shown in the Edit Log page is very useful to monitor progress of a translation project in real time, the CSV file is a fundamental source of information for detailed productivity analyses once the project is ended.

Applications

MateCat has been used by the MateCat project to investigate new MT functions[11] and to evaluate them in a real professional setting, in which translators have at their disposal all the sources of information they are used to working with. Moreover, taking advantage of its flexibility and ease of use, the tool has been recently used for data collection and education purposes (a course on CAT technology for students in translation studies). An initial version of the tool has also been leveraged by the CasmaCat project[12] to create a workbench,[13] particularly suitable for investigating advanced interaction modalities such as interactive MT, eye tracking, and handwritten input. Currently the tool is employed by the translation agency Translated for their internal translation projects and is being tested by several international companies, both language service providers and IT companies. This has made it possible to collect continuous feedback from hundreds of translators, which, besides helping us to improve the robustness of the tool, is also influencing the way new MT functions will be integrated to supply the best help to the final user.


References

  1. ^ José, M., & Machado, B. (2014). Free and open-source software — a translator’s good friend, 3. Retrieved from http://ec.europa.eu/translation/portuguese/magazine
  2. ^ EUROPEAN COMMISSION. (2017). EUROPEAN COMMISSION STAFF WORKING DOCUMENT INTERIM EVALUATION of HORIZON 2020 ANNEX 2. Brussels. Retrieved from http://ec.europa.eu/transparency/regdoc/rep/10102/2017/EN/SWD-2017-221-F1-EN-MAIN-PART-12.PDF
  3. ^ Marcello Federico; Alessandro Cattelan; Marco Trombetti (2012). "Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)" (PDF). Amta2012.amtaweb.org. Archived from the original (PDF) on 30 October 2014. Retrieved 30 October 2014.
  4. ^ Spence Green; Jeffrey Heer; Christopher D Manning (2013). The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Chi '13. Dl.acm.org. pp. 439–448. doi:10.1145/2470654.2470718. ISBN 9781450318990. S2CID 119828. Retrieved 30 October 2014.
  5. ^ Samuel Läubli; Mark Fishel; Gary Massey; Maureen Ehrensberger-Dow; Martin Volk (2013). "Assessing Post-Editing Efficiency in a Realistic Translation Environment. In Michel Simard Sharon O'Brien and Lucia Specia (eds.), editors, Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice" (PDF). Nice, France: Mt-archive.info. pp. 83–91. Retrieved 30 October 2014.
  6. ^ "MateCat".
  7. ^ "MyMemory is the world's largest Translation Memory (TM) built collaboratively via MT and human contributions". Mymemory.translated.net. Retrieved 30 October 2014.
  8. ^ "Moses is the most popular open source statistical MT toolkit". Statmt.org. Retrieved 30 October 2014.
  9. ^ "Docs.oasis-open.org". Docs.oasis-open.org. Retrieved 30 October 2014.
  10. ^ Nicola Bertoldi, Mauro Cettolo, and Marcello Federico. 2013. Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation. In Proceedings of the MT Summit XIV, pages 35–42, Nice, France, September.
  11. ^ Bertoldi et al., 2013; Cettolo et al., 2013; Turchi et al., 2013; Turchi et al., 2014
  12. ^ "Casmacat.eu". Casmacat.eu. Retrieved 30 October 2014.
  13. ^ Vicent Alabau, Ragnar Bonk, Christian Buck, Michael Carl, Francisco Casacuberta, Mercedes Garca-Martiınez, Jesus Gonzalez, Philipp Koehn, Luis Leiva, Bartolomé Mesa-Lao, Daniel Oriz, Hervé Saint-Amand, German Sanchis, and Chara Tsiukala. 2013. Advanced computer-aided translation with a web-based workbench. In Proceedings of Workshop on Post-editing Technology and Practice, pages 55–62.