Classifying spam with generalized additive neural networks
MetadataShow full item record
E-mail is an important and convenient communication tool used by many people on a daily basis. For individuals it is an inexpensive way to stay in contact with family and friends located around the world. An e-mail address serves as an online identity when signing up for different online services like social media (Facebook) and social networking (LinkedIn). Companies use e-mails to facilitate communication between employees and to communicate with their clients by sending information such as newsletters, invoice statements and promotional content. E-mails are also used for core business marketing. Unfortunately, some of the benefits provided by the e-mail application like sending out mass e-mails with little effort at a minimal cost to the sender, are abused by some e-mail users known as spammers. A spammer's incentive for sending unsolicited e-mails in large quantities to an indiscriminate set of recipients is mostly driven by revenue generation. Most spam messages sent contain content related to promotional products and services, which might be a scam or phishing attempt to steal sensitive user information like banking details and passwords. Currently, more than 55.00% of all e-mail network traffic comprises unsolicited spam e-mails which clutters users' inboxes. Traditional spam-filtering approaches have thus far been unsuccessful in solving the spam problem. This is partly due to spammers who generate new spam message content on a regular basis making it difficult for spam filters to classify spam according to a fixed pattern. The main purpose of this study is to determine the feasibility of employing a Generalized additive neural network (GANN) to filter spam e-mail messages with a specific automated construction algorithm. The GANN is a relatively new supervised machine learning technique capable of recognising complex patterns in data and able to adapt to changes over time. The use of GANN models is suggested for classification problems where it might be important to understand the relationship between input attributes and the expected target value. In this study the definition of spam, consequences of unmanaged spam and current spam-filtering techniques are investigated. The current state of the spam problem is summarised followed by a discussion on artificial neural networks that have pattern recognition capabilities. Literature related to the GANN is reviewed with a discussion on both the interactive and automated construction methodologies for the GANN. The latter will be considered as a possible spam filter to try and mitigate the spam problem. A number of spam filtering experiments are conducted on five publicly available spam corpora (Enron, GenSpam, PU1, SpamAssassin and TREC2005) each with different pre-processing techniques and evaluation measures. The Bagging and Boosting ensemble techniques which may improve on the GANN's results are also considered. The GANN and ensembles are then compared to other spam filtering techniques applied to the five corpora before being compared to each other. Results show that the GANN is a feasible spam filter able to mitigate spam e-mails. It compares well to other spam filter techniques found in the literature. In addition, both ensemble methods are able to improve on the GANN's results in most cases.
- Engineering