Tswana finite state tokenisation
MetadataShow full item record
Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F$_1$-score of 95 % with respect to a hand tokenised gold standard.
- Faculty of Humanities