Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages
Abstract
In this study, two classical machine learning algorithms, multinomial naive Bayes and support vector machines, are compared when applied to named entity recognition for two South African languages, Afrikaans and English.
The definition of a named entity was based on previous definitions and deliberations in literature as well as the intended purpose of classifying sensitive personal information in textual data. For the purpose of this study, the best algorithm should be able to deliver accurate results while requiring the least amount of time to train the classification model. A binary nominal class was selected for the classifiers and the standard implementation of the algorithms were utilised; no parameter optimisation was done.
All the models achieved remarkable results in both ten-fold cross validation and independent evaluations with the support vector machine models outperforming the multinomial naive Bayes models. The multinomial naive Bayes models, however, required less time to train and would be more suited to low resource implementations