Using skipgrams and PoS-based feature selection for patent classification
Until recently, phrases were deemed suboptimal features for text classification because of their sparseness (Lewis 1992). In recent work (Koster et al. 2011, D’hondt et al. Forthcoming), however, it was found that for classifying English patent documents, combining phrasal and unigram representations leads to significantly better classification results, because phrases are better suited to catch the Multi-Word Terms (MWT) abundant in the terminology-rich technical patent texts.
In this article, we consider the task of patent classification of English abstracts at the class level (about 120 classes) of the International Patent Classification (IPC). We compare (a) the impact of two types of phrases to capture meaningful information (bigrams and skipgrams); and (b) the impact of performing additional filtering of the classification features, based on their Part of Speech (PoS). For this purpose we performed a series of classification experiments using different phrasal text representations and feature selection to determine which representation is most beneficial to English patent classification. We further investigated which type of information (as captured by the PoS-filtered skipgrams) has most impact during classification.
The results show that combining unigrams and PoS-filtered skipgrams leads to a significant improvement in classification scores over the unigram baseline. Additional experiments show that the most important phrasal features are bigrams and additional useful phrases can be captured by allowing at most 2 skips in the skipgram approach. Deeper analysis revealed that the noun-noun combinations and – to a lesser extent – the adjectival-noun combinations are the most informative phrasal features for patent classification.