Creator:
Date:
Abstract:
Spam emails are a very current and ongoing issue. The sheer volume of spam emails makes the task of analysing them very tedious, for example, for campaign identification purposes. A spam email cluster is the abstraction of numerous spam emails. Clustering spam emails aids by analysing them by bulk. One such clustering algorithm, proposed by Han et al., revolves around using a data model called FP-Tree. We have implemented it in a practical system. We also made the algorithm harness the storage and processing capabilities of Hadoop, a distributed computing framework. The algorithm was also improved to produce less but more dissimilar clusters. Extensive experiments and evaluations were conducted. Results show that our implementation performs better in most circumstances than an original implementation by Dihn et al. This thesis presents the system design, improvements, and evaluation of our implementation of the FP-Tree based spam email clustering algorithm.