Clustering Spam Emails Using Hadoop and FP-Trees
Public Deposited- Resource Type
- Creator
- Abstract
Spam emails are a very current and ongoing issue. The sheer volume of spam emails makes the task of analysing them very tedious, for example, for campaign identification purposes. A spam email cluster is the abstraction of numerous spam emails. Clustering spam emails aids by analysing them by bulk. One such clustering algorithm, proposed by Han et al., revolves around using a data model called FP-Tree. We have implemented it in a practical system. We also made the algorithm harness the storage and processing capabilities of Hadoop, a distributed computing framework. The algorithm was also improved to produce less but more dissimilar clusters. Extensive experiments and evaluations were conducted. Results show that our implementation performs better in most circumstances than an original implementation by Dihn et al. This thesis presents the system design, improvements, and evaluation of our implementation of the FP-Tree based spam email clustering algorithm.
- Subject
- Language
- Publisher
- Thesis Degree Level
- Thesis Degree Name
- Thesis Degree Discipline
- Identifier
- Rights Notes
Copyright © 2017 the author(s). Theses may be used for non-commercial research, educational, or related academic purposes only. Such uses include personal study, research, scholarship, and teaching. Theses may only be shared by linking to Carleton University Institutional Repository and no part may be used without proper attribution to the author. No part may be used for commercial purposes directly or indirectly via a for-profit platform; no adaptation or derivative works are permitted without consent from the copyright owner.
- Date Created
- 2017
Relations
- In Collection:
Items
Thumbnail | Title | Date Uploaded | Visibility | Actions |
---|---|---|---|---|
kirillov-clusteringspamemailsusinghadoopandfptrees.pdf | 2023-05-05 | Public | Download |