Show simple item record

dc.contributor.advisorBarnard, E.
dc.contributor.authorSchlünz, Georg Isaac
dc.date.accessioned2014-06-09T11:35:21Z
dc.date.available2014-06-09T11:35:21Z
dc.date.issued2014
dc.identifier.urihttp://hdl.handle.net/10394/10634
dc.descriptionPhD (Information Technology), North-West University, Vaal Triangle Campus, 2014en_US
dc.description.abstractText-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework.en_US
dc.language.isoenen_US
dc.publisherNorth-West Universityen_US
dc.subjectNatural language processingen_US
dc.subjectText-to-speech synthesisen_US
dc.subjectProsodyen_US
dc.subjectDiscourseen_US
dc.subjectInformation structureen_US
dc.subjectAffecten_US
dc.subjectOCC modelen_US
dc.subjectE-motifen_US
dc.titleAdvanced natural language processing for improved prosody in text–to–speech synthesisen
dc.typeThesisen_US
dc.description.thesistypeDoctoralen_US
dc.contributor.researchID21021287 - Barnard, Etienne (Supervisor)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record