Automatic syllabification using segmental conditional random fields

Kseniya Rogova; Kris Demuynck; Dirk Van Compernolle

Authors

Kseniya Rogova KU Leuven
Kris Demuynck Ghent University
Dirk Van Compernolle KU Leuven

Abstract

In this paper we present a statistical approach for the automatic syllabification of phonetic word transcriptions. A syllable bigram language model forms the core of the system. Given the large number of syllables in non-syllabic languages, sparsity is the main issue, especially since the available syllabified corpora tend to be small. Traditional back-off mechanisms only give a partial solution to the sparsity problem. In this work we use a set of features for back-off purposes: on the one hand probabilities such as consonant cluster probabilities, and on the other hand a set of rules based on generic syllabification principles such as legality, sonority and maximal onset. For the combination of these highly correlated features with the baseline bigram feature we employ segmental conditional random fields (SCRFs) as statistical framework. The resulting method is very versatile and can be used for any amount of data of any language.

The method was tested on various datasets in English and Dutch with dictionary sizes varying between 1 and 60 thousand words. We obtained a 97.96% word accuracy for supervised syllabification and a 91.22% word accuracy for unsupervised syllabification for English. When including the top-2 generated syllabifications for a small fraction of the words, virtual perfect syllabification is obtained in supervised mode.

Automatic syllabification using segmental conditional random fields

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)