Using skipgrams and PoS-based feature selection for patent classification

Eva D'hondt; Suzan Verberne; Niklas Weber; Kees Koster; Lou Boves

Authors

Eva D'hondt Radboud University Nijmegen
Suzan Verberne Radboud University Nijmegen
Niklas Weber Radboud University Nijmegen
Kees Koster Radboud University Nijmegen
Lou Boves Radboud University Nijmegen

Abstract

Until recently, phrases were deemed suboptimal features for text classification because of their sparseness (Lewis 1992). In recent work (Koster et al. 2011, D’hondt et al. Forthcoming), however, it was found that for classifying English patent documents, combining phrasal and unigram representations leads to significantly better classification results, because phrases are better suited to catch the Multi-Word Terms (MWT) abundant in the terminology-rich technical patent texts.

In this article, we consider the task of patent classification of English abstracts at the class level (about 120 classes) of the International Patent Classification (IPC). We compare (a) the impact of two types of phrases to capture meaningful information (bigrams and skipgrams); and (b) the impact of performing additional filtering of the classification features, based on their Part of Speech (PoS). For this purpose we performed a series of classification experiments using different phrasal text representations and feature selection to determine which representation is most beneficial to English patent classification. We further investigated which type of information (as captured by the PoS-filtered skipgrams) has most impact during classification.

The results show that combining unigrams and PoS-filtered skipgrams leads to a significant improvement in classification scores over the unigram baseline. Additional experiments show that the most important phrasal features are bigrams and additional useful phrases can be captured by allowing at most 2 skips in the skipgram approach. Deeper analysis revealed that the noun-noun combinations and – to a lesser extent – the adjectival-noun combinations are the most informative phrasal features for patent classification.

Using skipgrams and PoS-based feature selection for patent classification

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)