Advanced natural language processing for improved prosody in text–to–speech synthesis

Schlünz, Georg Isaac

dc.contributor.advisor	Barnard, E.
dc.contributor.author	Schlünz, Georg Isaac
dc.date.accessioned	2014-06-09T11:35:21Z
dc.date.available	2014-06-09T11:35:21Z
dc.date.issued	2014
dc.identifier.uri	http://hdl.handle.net/10394/10634
dc.description	PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014	en_US
dc.description.abstract	Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework.	en_US
dc.language.iso	en	en_US
dc.publisher	North-West University	en_US
dc.subject	Natural language processing	en_US
dc.subject	Text-to-speech synthesis	en_US
dc.subject	Prosody	en_US
dc.subject	Discourse	en_US
dc.subject	Information structure	en_US
dc.subject	Affect	en_US
dc.subject	OCC model	en_US
dc.subject	E-motif	en_US
dc.title	Advanced natural language processing for improved prosody in text–to–speech synthesis	en
dc.type	Thesis	en_US
dc.description.thesistype	Doctoral	en_US
dc.contributor.researchID	21021287 - Barnard, Etienne (Supervisor)

Files in this item

Name:: Schlünz_GI.pdf
Size:: 2.787Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Natural and Agricultural Sciences [2708]

Show simple item record