MicroRNAs (miRNAs) are short (18–23 nt), non-coding RNAs that play central roles in cellular regulation by modulating the post-transcriptional expression of messenger RNA (mRNA) transcripts. It has been previously estimated that 60-90% of all mammalian mRNAs may be targeted by miRNAs. Due to their biological importance, the ability to accurately predict miRNA sequences is of great importance. Computational prediction of miRNA are either genomic sequence-based (de novo) or analyze transcriptomic data arising from next generation sequencing (NGS) experiments. Unfortunately, existing methods of de novo miRNA prediction often fail when applied to non-model species, and are not well suited to genome-scale data sets. Furthermore, existing methods of NGS-based miRNA prediction do not incorporate all known lines of evidence for miRNA prediction, instead focussing on either sequence-based or expression-based features of putative miRNA.
This thesis makes contributions to the state of the art of miRNA prediction which directly address the issues highlighted above. First, we develop a framework for the generation of species-specific training data sets. Three different forms of classifiers using diverse feature sets are trained and evaluated using the framework. Significant gains in precision and recall are achieved over existing methods, as measured using four diverse species from different phyla. Subsequently, the framework was applied to develop miRNA predictors in two successful genome-wide miRNA prediction studies, resulting in the discovery of 155 novel miRNA, thus verifying the real-world applicability of this work. Second, we introduce a genome-scanning miRNA prediction model which optimizes miRNA prediction for realistic experimental conditions. This model quantifies the performance of elements of the miRNA prediction pipeline, including pre-filtering stages, whose impact was previously ignored. This comprehensive evaluation framework has enabled significant increases in prediction performance over the state of the art through the use of updated RNA secondary structure parameters. Finally, we develop a NGS-based miRNA prediction method which improves on state-of-the-art performance through the integration of all known lines of evidence which discriminate miRNA from non-miRNA. This prediction method substantially outperforms two existing leading methods on data sets from five NGS experiments across three species, and is shown to generalize to hold-out data sets.