dc.contributor.author | Davel, Marelie H. | |
dc.contributor.author | van Heerden, Charl | |
dc.contributor.author | Kleynhans, Neil | |
dc.contributor.author | Barnard, Etienne | |
dc.date.accessioned | 2018-03-07T07:42:10Z | |
dc.date.available | 2018-03-07T07:42:10Z | |
dc.date.issued | 2011 | |
dc.identifier.citation | Marelie H Davel, Charl Van Heerden, Neil Kleynhans and Etienne Barnard, “Efficient harvesting of Internet audio for resource-scarce ASR”, in Proc. Interspeech, pp 3153-3156, Florence, Italy, 2011. [http://engineering.nwu.ac.za/multilingual-speech-technologies-must/publications] | en_US |
dc.identifier.uri | https://researchspace.csir.co.za/dspace/bitstream/handle/10204/5769/Davel_2011.pdf?sequence=1&isAllowed=y | |
dc.identifier.uri | http://hdl.handle.net/10394/26541 | |
dc.description.abstract | Spoken recordings that have been transcribed for human reading
(e.g. as captions for audiovisual material, or to provide alternative
modes of access to recordings) are widely available in many
languages. Such recordings and transcriptions have proven to be
a valuable source of ASR data in well-resourced languages, but
have not been exploited to a significant extent in under-resourced
languages or dialects. Techniques used to harvest such data typically
assume the availability of a fairly accurate ASR system,
which is generally not available when working with resourcescarce
languages. In this work, we define a process whereby an
ASR corpus is bootstrapped using unmatched ASR models in
conjunction with speech and approximate transcriptions sourced
from the Internet. We introduce a new segmentation technique
based on the use of a phone-internal garbage model, and demonstrate
how this technique (combined with limited filtering) can
be used to develop a large, high-quality corpus in an underresourced
dialect with minimal effort. | en_US |
dc.description.sponsorship | We would like to thank Bryan McAlister and his team, who performed
the initial data collection and processing; and the South
African Centre for High Performance Computing (CHPC), who
made their facilities available for these experiments. | en_US |
dc.language.iso | en | en_US |
dc.publisher | Interspeech 2011 | en_US |
dc.subject | Speech recognition | en_US |
dc.subject | Under-resourced languages | en_US |
dc.subject | Garbage modeling | en_US |
dc.title | Efficient harvesting of Internet audio for resource-scarce ASR | en_US |
dc.type | Presentation | en_US |