dc.contributor.advisor | Huisman, H.M. | |
dc.contributor.advisor | Drevin, G.R. | |
dc.contributor.author | Van Zyl, Petrus Andries | |
dc.date.accessioned | 2018-04-24T09:26:09Z | |
dc.date.available | 2018-04-24T09:26:09Z | |
dc.date.issued | 2015 | |
dc.identifier.uri | http://hdl.handle.net/10394/26820 | |
dc.description | MSc (Computer Science), North-West University, Potchefstroom Campus, 2016 | en_US |
dc.description.abstract | The automatic extraction and handling of information contained on invoice documents holds major
benefits for many businesses as this could save many resources, which would otherwise have
been spent on manual extraction. Document Analysis and Recognition (DAR) is a process, which
makes use of Optical Character Recognition (OCR) for the recognition and analysis of the
contents of physical documents in order to digitally extract and process the information. It consists
of four steps, namely pre-processing, layout analysis, text recognition, and post-processing.
Pre-processing is used to improve the overall quality of a document image in order to prepare it
for the steps that follow. Techniques used for pre-processing have a direct influence on the
resulting OCR accuracy as any small deficiencies that pass through this stage are dragged along
the rest of the OCR process and ultimately recognized incorrectly. A significant contribution can
be made to the relevant research areas and business communities by revealing which preprocessing
techniques are the most effective for the analysis and recognition of invoice
documents.
In order to approach this problem, an exploratory study was first conducted. Case studies were
used during which owners and CEOs of five DAR-related companies were interviewed.
Transcriptions and content analysis of these semi-structured interviews allowed prevalent themes
to emerge from the data.
The second study was an experimental investigation. The experiments conducted involved taking
a number of invoice document images, performing various pre-processing techniques on the
images, and measuring the effect of the techniques on the recognition rates. By acquiring the
recognition rates of the different techniques, it was possible to quantitatively compare the
techniques with each other.
It was revealed that many businesses in the DAR industry make use of the same business
process. Much was learnt about the DAR-related software used in the industry, how Intelligent
Character Recognition (ICR) should be approached, and what the best scanning practices are. It
was also discovered that the use of paper-based information and the need for the electronic
processing thereof is increasing, thereby securing the future of the industry. Regarding the
efficiency of pre-processing techniques, it was successfully revealed that some techniques do
perform better than others. In addition, many findings were made regarding the functioning of
some of the techniques used for the experiments | en_US |
dc.description.sponsorship | National Research Foundation (NRF) | en_US |
dc.language.iso | en | en_US |
dc.publisher | North-West University (South Africa), Potchefstroom Campus | en_US |
dc.subject | Optical character recognition | en_US |
dc.subject | Intelligent character recognition | en_US |
dc.subject | Document analysis and recognition | en_US |
dc.subject | Pre-processing | en_US |
dc.subject | Noise reduction | en_US |
dc.subject | Binarization | en_US |
dc.subject | Exploratory study | en_US |
dc.subject | Experimental investigation | en_US |
dc.subject | Ground truth text | en_US |
dc.title | Evaluation of pre-processing techniques for the analysis and recognition of invoice documents | en_US |
dc.type | Thesis | en_US |
dc.description.thesistype | Masters | en_US |
dc.contributor.researchID | 10066896 - Huisman, Hester Magrietha (Supervisor) | |
dc.contributor.researchID | 10063374 - Drevin, Günther Richard (Supervisor) | |