Automated Discovery of Big Data Workload Types

It appears your Web browser is not configured to display PDF files. Download adobe Acrobat or click here to download the PDF file.

Click here to download the PDF file.


Shahmirza, Anousheh




Big data workload characterization is an inevitable part of big data workload prediction and auto-tuning big data applications. Due to many different ways of applying big data frameworks and applications, there are various categories that these workloads can belong. Clustering techniques are applied in this research to detect Apache Spark and Hadoop workloads independent of historical data. Clustering techniques are compared in terms of different evaluation metrics, and the ones with the highest performance are introduced. The DBSCAN algorithm has shown the best performance and adequacy with 71% and 80% for the Purity, and Windows Type Accuracy (Awt), respectively. Ultimately, the Incremental DBSCAN algorithm and Den-Stream (an online version of DBSCAN) are presented as the most practical methods for big data workload discovery automatization. A scheme is then provided to use these algorithms integrated with methods to self-discover their hyperparameters. Ultimately, the procedure is fully automated.


Computer Science




Carleton University

Thesis Degree Name: 

Master of Computer Science: 

Thesis Degree Level: 


Thesis Degree Discipline: 

Computer Science

Parent Collection: 

Theses and Dissertations

Items in CURVE are protected by copyright, with all rights reserved, unless otherwise indicated. They are made available with permission from the author(s).