Creator:
Date:
Abstract:
Big data workload characterization is an inevitable part of big data workload prediction and auto-tuning big data applications. Due to many different ways of applying big data frameworks and applications, there are various categories that these workloads can belong. Clustering techniques are applied in this research to detect Apache Spark and Hadoop workloads independent of historical data. Clustering techniques are compared in terms of different evaluation metrics, and the ones with the highest performance are introduced. The DBSCAN algorithm has shown the best performance and adequacy with 71% and 80% for the Purity, and Windows Type Accuracy (Awt), respectively. Ultimately, the Incremental DBSCAN algorithm and Den-Stream (an online version of DBSCAN) are presented as the most practical methods for big data workload discovery automatization. A scheme is then provided to use these algorithms integrated with methods to self-discover their hyperparameters. Ultimately, the procedure is fully automated.