Automated Discovery of Big Data Workload Types

Resource Type

Creator

Abstract

Big data workload characterization is an inevitable part of big data workload prediction and auto-tuning big data applications. Due to many different ways of applying big data frameworks and applications, there are various categories that these workloads can belong. Clustering techniques are applied in this research to detect Apache Spark and Hadoop workloads independent of historical data. Clustering techniques are compared in terms of different evaluation metrics, and the ones with the highest performance are introduced. The DBSCAN algorithm has shown the best performance and adequacy with 71% and 80% for the Purity, and Windows Type Accuracy (Awt), respectively. Ultimately, the Incremental DBSCAN algorithm and Den-Stream (an online version of DBSCAN) are presented as the most practical methods for big data workload discovery automatization. A scheme is then provided to use these algorithms integrated with methods to self-discover their hyperparameters. Ultimately, the procedure is fully automated.

Subject

Language

Publisher

Thesis Degree Level

Thesis Degree Name

Thesis Degree Discipline

Identifier

Rights Notes

Copyright © 2021 the author(s). Theses may be used for non-commercial research, educational, or related academic purposes only. Such uses include personal study, research, scholarship, and teaching. Theses may only be shared by linking to Carleton University Institutional Repository and no part may be used without proper attribution to the author. No part may be used for commercial purposes directly or indirectly via a for-profit platform; no adaptation or derivative works are permitted without consent from the copyright owner.

Date Created

Relations

In Collection:

Thumbnail	Title	Date Uploaded	Visibility	Actions
	shahmirza-automateddiscoveryofbigdataworkloadtypes.pdf	2023-05-05	Public	Download