Efficient training of support vector machines and their hyperparameters
Van Heerden, Charl Johannes
MetadataShow full item record
As digital computers become increasingly powerful and ubiquitous, there is a growing need for pattern-recognition algorithms that can handle very large data sets. Support vector machines (SVMs), which are generally viewed as the most accurate classifiers for generalpurpose pattern recognition, are somewhat problematic in this respect: as for all classifiers which employ hyperparameters, the behavior of SVMs depends strongly on the particular choice of hyperparameter values, and popular approaches to training SVMs require computationally expensive grid searches to choose these parameters appropriately [1, 2]. Our main objective is therefore to find more efficient ways to train SVM hyperparameters. We also show that for non-separable datasets, SVMs do not behave like large margin classifiers. This observation in turn leads us to explore algorithms which do not employ a margin term. Since one of the hyperparameters of SVMs is a regularization parameter that controls the relative contribution of the margin term and the sum of misclassifications, dropping the margin term means that there is one less hyperparameter to be trained. Grid searches are an expensive yet widely used technique to train the SVM hyperparameters. We therefore investigate ways in which the hyperparameters can be trained more efficiently, since the traditional grid search approach to finding good parameters takes very long. We also investigate alternative algorithms which are similar to SVMs, but which have fewer hyperparameters to find. With this goal in mind, we first investigate the scaling and asymptotic behaviors of popular SVM hyperparameters on non-separable datasets. We find that the scale factor of the radial basis function (RBF) kernel depends only weakly on the size of the training set and that the regularization parameter C must assume relatively large values for accurate classification to be achieved. The observation with regard to C is true for all datasets considered in the thesis when a linear kernel is employed, while for RBF kernels the evidence is not as strong. The preference for large C casts doubt on the large margin classifier (LMC) tag often associated with SVMs, especially with linear kernels. Further investigation confirms our suspicion that minimization of an error term, rather than maximization of the inter-class margin, is responsible for the widely acknowledged excellence of SVM classifiers. These insights suggest two different approaches to reducing overall SVM training time: SVM hyperparameter training on reduced training sets and stochastic optimization of a simplified criterion function. The SVM hyperparameter training on reduced training sets is further enhanced by a heuristic for the choice of the RBF scale factor. This enables us to propose a hyperparameter selection algorithm that performs as well as the conventional SVM approach on all classification problems considered in this thesis, while reducing the required training time by several orders of magnitude. Our second approach, stochastic optimization of a simplified criterion, is slightly less accurate on some problems, but reduces the overall training time even further. With training sets consisting of tens of thousands of samples, efficient hyperparameter selection for standard SVMs is the method of choice. Looking to the future where training-set sizes will inevitably continue to increase, methods such as our stochastic approach will become preferable for a growing proportion of practical problems.
- Engineering