Outomatiese Setswana lemma-identifisering
Brits, Jeanetta Hendrina
MetadataShow full item record
Within the context of natural language processing, a lemmatiser is one of the most important core technology modules that has to be developed for a particular language. A lemmatiser reduces words in a corpus to the corresponding lemmas of the words in the lexicon. A lemma is defined as the meaningful base form from which other more complex forms (i.e. variants) are derived. Before a lemmatiser can be developed for a specific language, the concept "lemma" as it applies to that specific language should first be defined clearly. This study concludes that, in Setswana, only stems (and not roots) can act independently as words; therefore, only stems should be accepted as lemmas in the context of automatic lemmatisation for Setswana. Five of the seven parts of speech in Setswana could be viewed as closed classes, which means that these classes are not extended by means of regular morphological processes. The two other parts of speech (nouns and verbs) require the implementation of alternation rules to determine the lemma. Such alternation rules were formalised in this study, for the purpose of development of a Setswana lemmatiser. The existing Setswana grammars were used as basis for these rules. Therewith the precision of the formalisation of these existing grammars to lemmatise Setswana words could be determined. The software developed by Van Noord (2002), FSA 6, is one of the best-known applications available for the development of finite state automata and transducers. Regular expressions based on the formalised morphological rules were used in FSA 6 to create finite state transducers. The code subsequently generated by FSA 6 was implemented in the lemmatiser. The metric that applies to the evaluation of the lemmatiser is precision. On a test corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained 70,96% and 70,52% respectively. Expressed in numbers the precision on 500 complex and simplex nouns was 78,45% and on complex and simplex verbs 79,59%. The quantitative achievement only gives an indication of the relative precision of the grammars. Nevertheless, it did offer analysed data with which the grammars were evaluated qualitatively. The study concludes with an overview of how these results might be improved in the future.
- ETD@PUK