Building an NLP pipeline within a digital publishing workflow

  • Hans Paulussen iMinds-ITEC, KU Leuven, Kortrijk, Belgium
  • Pedro Debevere ∗iMinds-MMLab, UGent, Gent, Belgium
  • Francisco Bonachela Capdevila iMinds-ITEC, KU Leuven, Kortrijk, Belgium
  • Maribel Montero Perez iMinds-ITEC, KU Leuven, Kortrijk, Belgium
  • Martin Vanbrabant iMinds-ITEC, KU Leuven, Kortrijk, Belgium
  • Wesley De Neve iMinds-MMLab, UGent, Gent, Belgium
  • Stefan De Wannemacker iMinds-ITEC, KU Leuven, Kortrijk, Belgium

Abstract

Outside the laboratory environment, NLP tool developers have always been obliged to use robust techniques in order to clean and streamline the ubiquitous formats of authentic texts. In most cases, the cleaned version simply consisted of the bare text discarded of all typographical information, tokenised in such a way that even the reconstruction of a simple sentence resulted in a displeasing layout. In order to integrate the NLP output within the production workflow of digital publications, it is necessary to keep track of the original layout. In this paper, we present an example of an NLP pipeline developed to meet the requirements of real-world applications of digital publications.

The NLP pipeline presented here was developed within the framework of the iRead+ project, a cooperative research project between several industrial and academic partners in Flanders. The pipeline aims at enabling automatic enrichment of texts with word-specific and contextual information in order to create an enhanced reading experience on tablets and to support automatic generation of grammatical exercises. The enriched documents contain both linguistic annotations (part-of-speech and lemmata) and semantic annotations based on the recognition and disambiguation of named entities. The whole enrichment process, provided via a web service, can be integrated into an XML-based production flow. The input of the NLP enrichment engine consists of two documents: a well-formed XML source file and a control file containing XPath expressions describing the nodes in the source file to be annotated and enriched. As nodes may contain a pre-defined set of mixed data, reconstruction of the original document (with selected enrichments) is enabled.

Published
2014-12-01
How to Cite
Paulussen, H., Debevere, P., Bonachela Capdevila, F., Montero Perez, M., Vanbrabant, M., De Neve, W., & De Wannemacker, S. (2014). Building an NLP pipeline within a digital publishing workflow. Computational Linguistics in the Netherlands Journal, 4, 71-84. Retrieved from https://clinjournal.org/clinj/article/view/41
Section
Articles