Automated Discovery of Big Data Workload Types

Public Deposited
Resource Type
Creator
Abstract
  • Big data workload characterization is an inevitable part of big data workload prediction and auto-tuning big data applications. Due to many different ways of applying big data frameworks and applications, there are various categories that these workloads can belong. Clustering techniques are applied in this research to detect Apache Spark and Hadoop workloads independent of historical data. Clustering techniques are compared in terms of different evaluation metrics, and the ones with the highest performance are introduced. The DBSCAN algorithm has shown the best performance and adequacy with 71% and 80% for the Purity, and Windows Type Accuracy (Awt), respectively. Ultimately, the Incremental DBSCAN algorithm and Den-Stream (an online version of DBSCAN) are presented as the most practical methods for big data workload discovery automatization. A scheme is then provided to use these algorithms integrated with methods to self-discover their hyperparameters. Ultimately, the procedure is fully automated.

Subject
Language
Publisher
Thesis Degree Level
Thesis Degree Name
Thesis Degree Discipline
Identifier
Rights Notes
  • Copyright © 2021 the author(s). Theses may be used for non-commercial research, educational, or related academic purposes only. Such uses include personal study, research, scholarship, and teaching. Theses may only be shared by linking to Carleton University Institutional Repository and no part may be used without proper attribution to the author. No part may be used for commercial purposes directly or indirectly via a for-profit platform; no adaptation or derivative works are permitted without consent from the copyright owner.

Date Created
  • 2021

Relations

In Collection:

Items