This thesis compares statistical algorithms paired with dissimilarity measures for their ability to identify clusters in benchmark binary datasets.
The techniques examined are visualization, classification, and clustering. To visually explore for clusters, we used parallel coordinates plots and heatmaps. The classification algorithms used were neural networks and classification trees. Clustering algorithms used were: partitioning around centroids, partitioning around medoids, hierarchical agglomerative clustering, and hierarchical divisive clustering.
The clustering algorithms were
evaluated on their ability to identify the optimal number of clusters. The "goodness" of the resulting clustering structures was assessed and the clustering results were compared with known classes in the data using purity and entropy measures.
Experimental design was employed to test if the algorithms and / or dissimilarity measures had a statistically significant effect on the optimal number of clusters chosen by our methods as well as whether the algorithms and dissimilarity measures performed differently from one another.