Statistical Evaluation of Malware Classification Algorithms

It appears your Web browser is not configured to display PDF files. Download adobe Acrobat or click here to download the PDF file.

Click here to download the PDF file.


Zhu, Lu




Classifying malware with learning algorithms is common in the information security community. In this thesis, the performance of five learning algorithms on malware classification is evaluated statistically.

The study is based on the malicious file collection released by Microsoft on where 10K labeled malware instances (250GB) were provided. Following the work of Ahmadi et al (2016a), 1801 features in 13 feature categories were extracted and the volume of extracted data set was reduced to 90MB.

Five learning algorithms were run on the reduced data set and on a standardized data set and evaluated for accuracy and logloss. Statistical analyses using multivariate analysis of variance (MANOVA) and univariate analysis of variance (ANOVA), and graphical tool of interaction plots were employed to assess the performance of the algorithms while controlling for effect of data set used. The analyses showed that XGBoost was the best classification algorithm for accuracy and logloss.


Computer Science




Carleton University

Thesis Degree Name: 

Master of Science: 

Thesis Degree Level: 


Thesis Degree Discipline: 

Probability and Statistics

Parent Collection: 

Theses and Dissertations

Items in CURVE are protected by copyright, with all rights reserved, unless otherwise indicated. They are made available with permission from the author(s).