Clustering Spam Emails Using Hadoop and FP-Trees

It appears your Web browser is not configured to display PDF files. Download adobe Acrobat or click here to download the PDF file.

Click here to download the PDF file.


Kirillov, Danil




Spam emails are a very current and ongoing issue. The sheer volume of spam emails makes the task of analysing them very tedious, for example, for campaign identification purposes. A spam email cluster is the abstraction of numerous spam emails. Clustering spam emails aids by analysing them by bulk. One such clustering algorithm, proposed by Han et al., revolves around using a data model called FP-Tree. We have implemented it in a practical system. We also made the algorithm harness the storage and processing capabilities of Hadoop, a distributed computing framework. The algorithm was also improved to produce less but more dissimilar clusters. Extensive experiments and evaluations were conducted. Results show that our implementation performs better in most circumstances than an original implementation by Dihn et al. This thesis presents the system design, improvements, and evaluation of our implementation of the FP-Tree based spam email clustering algorithm.


Computer Science




Carleton University

Thesis Degree Name: 

Master of Computer Science: 

Thesis Degree Level: 


Thesis Degree Discipline: 

Computer Science

Parent Collection: 

Theses and Dissertations

Items in CURVE are protected by copyright, with all rights reserved, unless otherwise indicated. They are made available with permission from the author(s).