LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

  • Marjan Van de Kauter LT3, Language and Translation Technology Team - Ghent University
  • Geert Coorman LT3, Language and Translation Technology Team - Ghent University
  • Els Lefever LT3, Language and Translation Technology Team - Ghent University
  • Bart Desmet LT3, Language and Translation Technology Team - Ghent University
  • Lieve Macken LT3, Language and Translation Technology Team - Ghent University
  • Véronique Hoste LT3, Language and Translation Technology Team - Ghent University

Abstract

This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools.

Published
2013-12-01
How to Cite
Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L., & Hoste, V. (2013). LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal, 3, 103-120. Retrieved from https://clinjournal.org/clinj/article/view/28
Section
Articles