Thawing of permafrost has implications on global climate, but collecting ground observations to monitor its evolution is resource-intensive and consequently, observations are sparse. Computational models are powerful tools used to fill gaps in information required to inform sound policy decision-making but in the study of permafrost, such models are often validated using simple ad-hoc comparisons against observations. In this study, six ground-surface temperature (GST) models are compared against observations and their performance ranked by multiple mathematical accordance measures. A small variation in model ranks is present between accordance measures. Subdivision of the observations temporally by season and spatially by ground type result in greater rank variability, and reveals biases in the models and accordance measures. These results demonstrate the importance of a well-defined objective to guide the selection of accordance measures and observation data subsets to be used in selecting a "best model" on which policy decisions can be based.