Exploring the use of pre-trained ASR models for automatic assessment of children’s oral reading
Abstract
Dutch children’s reading skills have been declining consistently for many years. Oral reading fluency, a combination of decoding skills and word recognition skills, is a fundamental pre-requisite for one’s reading competence. Children’s oral reading fluency is often tested through oral word reading tasks, which are time-consuming to carry out as teachers have to administer the tests in a one-on-one setting, in which they have to indicate the word reading correctness on-the-fly. One possible way of alleviating this workload is to use automatic speech recognition (ASR) to aid in the assessment process. A key concern is that many ASR models struggle with children’s speech. We explored the performance of two pre-trained ASR models: Wav2Vec2.0-CGN and Faster-Whisper-v2. We had them carry out correctness judgement on an oral word reading task, using data from the Children’s Oral Reading Corpus (CHOREC). This corpus contains oral reading data of word lists from native Dutch-speaking primary school children aged 6-12 from Flanders. We compared the results of the ASR models to those of assessors in CHOREC by using the agreement metrics specificity, recall, accuracy, F1-score, and MCC as agreement metrics. We then used two different methods to improve the baseline results, by post-correcting ASR model correctness judgements using manually defined error categories. We found that allowing a deviation from the prompt by one error category obtained the best results for the overall metrics. Faster-Whisper-v2 (accuracy = .89; F1-score = .58; MCC = .54) outperformed Wav2Vec2.0 (accuracy = .70; F1-score = .39; MCC = .38). The MCC values show that both ASR models had mild agreement with assessors. We expected the accuracy levels for both models to be lower than the lowest assessor inter-rater accuracy level (.86), but Faster-Whisper-v2 performed better than expected (.89). However, one should be careful in interpreting this result, since the high accuracy scores are partially due to the imbalanced dataset. We conclude that the performance of standard pre-trained ASR models is promising, but given the current quality of the procedure caution should be exercised in its use. Future research could aim to improve the performance of the whole procedure by e.g. using methods like fine-tuning and validation, and through collaborative research with teachers.