Efficient harvesting of Internet audio for resource-scarce ASR
Date
2011Author
Davel, Marelie H.
van Heerden, Charl
Kleynhans, Neil
Barnard, Etienne
Metadata
Show full item recordAbstract
Spoken recordings that have been transcribed for human reading
(e.g. as captions for audiovisual material, or to provide alternative
modes of access to recordings) are widely available in many
languages. Such recordings and transcriptions have proven to be
a valuable source of ASR data in well-resourced languages, but
have not been exploited to a significant extent in under-resourced
languages or dialects. Techniques used to harvest such data typically
assume the availability of a fairly accurate ASR system,
which is generally not available when working with resourcescarce
languages. In this work, we define a process whereby an
ASR corpus is bootstrapped using unmatched ASR models in
conjunction with speech and approximate transcriptions sourced
from the Internet. We introduce a new segmentation technique
based on the use of a phone-internal garbage model, and demonstrate
how this technique (combined with limited filtering) can
be used to develop a large, high-quality corpus in an underresourced
dialect with minimal effort.
URI
https://researchspace.csir.co.za/dspace/bitstream/handle/10204/5769/Davel_2011.pdf?sequence=1&isAllowed=yhttp://hdl.handle.net/10394/26541
Collections
- Faculty of Engineering [1129]