Show simple item record

dc.contributor.authorDavel, Marelie H.
dc.contributor.authorvan Heerden, Charl
dc.contributor.authorKleynhans, Neil
dc.contributor.authorBarnard, Etienne
dc.date.accessioned2018-03-07T07:42:10Z
dc.date.available2018-03-07T07:42:10Z
dc.date.issued2011
dc.identifier.citationMarelie H Davel, Charl Van Heerden, Neil Kleynhans and Etienne Barnard, “Efficient harvesting of Internet audio for resource-scarce ASR”, in Proc. Interspeech, pp 3153-3156, Florence, Italy, 2011. [http://engineering.nwu.ac.za/multilingual-speech-technologies-must/publications]en_US
dc.identifier.urihttps://researchspace.csir.co.za/dspace/bitstream/handle/10204/5769/Davel_2011.pdf?sequence=1&isAllowed=y
dc.identifier.urihttp://hdl.handle.net/10394/26541
dc.description.abstractSpoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages or dialects. Techniques used to harvest such data typically assume the availability of a fairly accurate ASR system, which is generally not available when working with resourcescarce languages. In this work, we define a process whereby an ASR corpus is bootstrapped using unmatched ASR models in conjunction with speech and approximate transcriptions sourced from the Internet. We introduce a new segmentation technique based on the use of a phone-internal garbage model, and demonstrate how this technique (combined with limited filtering) can be used to develop a large, high-quality corpus in an underresourced dialect with minimal effort.en_US
dc.description.sponsorshipWe would like to thank Bryan McAlister and his team, who performed the initial data collection and processing; and the South African Centre for High Performance Computing (CHPC), who made their facilities available for these experiments.en_US
dc.language.isoenen_US
dc.publisherInterspeech 2011en_US
dc.subjectSpeech recognitionen_US
dc.subjectUnder-resourced languagesen_US
dc.subjectGarbage modelingen_US
dc.titleEfficient harvesting of Internet audio for resource-scarce ASRen_US
dc.typePresentationen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record