Activating Qualified Thesaurus Terms for Automatic Indexing with taxonomy-based WSD
Many thesauri contain a number of descriptors consisting of the term proper plus a suffix in brackets meant to explain the term’s intended interpretation. For instance, the MeSH thesaurus contains a term Polymorphism (Genetics). For different thesauri, these terms account for 1%-5% of all descriptors. For automatic indexing based on recognizing term occurrences in free text, these terms are practically useless —free text never or very rarely contains term references of this form. A naive text annotation method, matching these terms with their bracketed qualifiers stripped off (the ‘bare’ terms) results in frequently wrong interpretations. We investigated to what extent short forms of qualified terms (viz. Polymorphism) can be disambiguated by looking for concepts in their textual environment that are ontologically related to the represented concepts (in casu, Genetic Polymorphism), or to the concepts used to qualify (Genetics).
Using the NLP framework of the Elsevier Fingerprint Enginer we created a set-up to test disambiguation for a set of 30 qualified terms from the NAL thesaurus, that we annotated in approximately 1500 scientific abstracts from the agricultural domain found in Scopusr. By their ambiguity with respect to the NAL Thesaurus we distinguished three groups of test terms: Terms with unqualified homonyms, terms with qualified homonyms and terms without homonyms inside the thesaurus. For all three groups, the best results (65-75% recall, 83-93% precision) are found when both the concept hosting the qualified terms and the qualifier concept are used to identify supporting concepts in the terms’ contexts. Like similar Word Sense Disambiguation (WSD) techniques our approach is attractive as the system is informed by existing knowledge and therefore does not require huge knowledge-intensive investments. At the same time the system delivers reasonable precision. For these reasons we will seek to refine it to bring up recall scores.