Outomatiese Afrikaanse woordsoortetikettering / deur Suléne Pilon
Any community that wants to be part of technological progress has to ensure that the language(s) of that community has/have the necessary human language technology resources. Part of these resources are so-called "core technologies", including part-of-speech taggers. The first part-of-speech tagger for Afrikaans is developed in this research project. It is indicated that three resources (a tag set, a twig algorithm and annotated training data) are necessary for the development of such a part-of-speech tagger. Since none of these resources exist for Afrikaans, three objectives are formulated for this project, i.e. (a) to develop a linpsticdy accurate tag set for Afrikaans; (b) to deter- mine which algorithm is the most effective one to use; and (c) to find an effective method for generating annotated Afrikaans training data. To reach the first objective, a unique and language-specific tag set was developed for Afrikaans. The resulting tag set is relatively big and consists of 139 tags. The level of specificity of the tag set can easily be adjusted to make the tag set smaller and less specific. After the development of the tag set, research is done on different approaches to, and techniques that can be used in, the development of a part-of-speech tagger. The available algorithms are evaluated by means of prerequisites that were set and in doing so, the most effective algorithm for the purposes of this project, TnT, is identified. Bootstrapping is then used to generate training data with the help of the TnT algorithm. This process results in 20,000 correctly annotated words, and thus annotated training data, the hard resource which is necessary for the development of a part-of-speech tagger, is developed. The tagger that is trained with 20,000 words reaches an accuracy of 85.87% when evaluated. The tag set is then simplified to thirteen tags in order to determine the effect that the size of the tag set has on the accuracy of the tagger. The tagger is 93.69% accurate when using the diminished tag set. The main conclusion of this study is that training data of 20,000 words is not enough for the Afrikaans TnT tagger to compete with other state-of-the-art taggers. The tagger and the data that is developed in this project can be used to generate even more training data in order to develop an optimally accurate Afrikaans TnT tagger. Different techniques might also lead to better results; therefore other algorithms should be tested.
- ETD@PUK