LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

Marjan Van de Kauter; Geert Coorman; Els Lefever; Bart Desmet; Lieve Macken; Véronique Hoste

Authors

Marjan Van de Kauter LT3, Language and Translation Technology Team - Ghent University
Geert Coorman LT3, Language and Translation Technology Team - Ghent University
Els Lefever LT3, Language and Translation Technology Team - Ghent University
Bart Desmet LT3, Language and Translation Technology Team - Ghent University
Lieve Macken LT3, Language and Translation Technology Team - Ghent University
Véronique Hoste LT3, Language and Translation Technology Team - Ghent University

Abstract

This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools.

LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)