mirage

Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald

Boloka/Manakin Repository

Show simple item record

dc.contributor.author Groenewald, Hendrik Johannes
dc.date.accessioned 2008-11-28T11:35:23Z
dc.date.available 2008-11-28T11:35:23Z
dc.date.issued 2006
dc.identifier.uri http://hdl.handle.net/10394/131
dc.description Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
dc.description.abstract A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction.
dc.publisher North-West University
dc.subject Lemmatisation en
dc.subject Machine learning en
dc.subject Memory-based learning en
dc.subject Human language technology en
dc.subject Natural language processing en
dc.subject Computer engineering en
dc.subject TIMBL en
dc.subject Afrikaans en
dc.subject Morphology en
dc.title Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald en
dc.type Thesis en
dc.description.thesistype Masters


Files in this item

This item appears in the following Collection(s)

  • ETD@PUK [5164]
    This collection contains the original digitized versions of research conducted at the North-West University (Potchefstroom Campus)

Show simple item record

Search the NWU Repository


Advanced Search

Browse

My Account

Statistics