Classifying recognised speech with deep neural networks
Abstract
We investigate whether word embeddings using deep neural networks can assist in the analysis of text produced by a speech-recognition system. In particular, we develop algorithms to identify which words are incorrectly detected by a speech-recognition system in broadcast news. The multilingual corpus used in this investigation con-tains speech from the eleven official South African languages, as well as Hindi. Pop-ular word embedding algorithms such as word2vec and fastText are investigated and compared with context-specific embedding representations such as doc2vec and non-context specific statistical sentence embedding methods such as term frequency-inverse document frequency (TF-IDF), which is used as our baseline method. These various embedding methods are then used as fixed length input representations for a logistic regression and feedforward neural network classifier. The output is used as an addi-tional categorical input feature to a CatBoost classifier to determine whether the words were correctly recognised. Other methods are also investigated, including a method that uses the word embedding itself and cosine similarity between specific keywords to identify whether a specific keyword was correctly detected. When relying only on the speech-text data, the best result was obtained using the TF-IDF document embed-dings as input features to a feedforward neural network. Adding the output from the feedforward neural network as an additional feature to the CatBoost classifier did not enhance the classifier’s performance compared to using the non-textual information provided, although adding the output from a weaker classifier was somewhat beneficial.
Collections
- Engineering [1418]