In today’s age of “big data” and “omics” research, biologists face two unique challenges - sharing their results with the larger community in an interpretable and reusable format and integrating their experimental data and findings with the prevailing hypotheses that govern their field. Publicly funded biological data curation and warehousing centers have emerged to address the former, but the challenge remains of sifting out relevant information from these resources and integrating it in a scalable way towards assessing biological hypotheses, and in disseminating the results of this process.
To address these challenges, I have developed, implemented and evaluated a semi-automated system for biological hypothesis evaluation that uses semantic technologies to reason over existing experimental data and knowledge. Chapter 1 presents the motivation, driving hypothesis and objectives for this doctoral thesis, as well as a brief review of the Semantic Web and automated systems for hypothesis formulation and evaluation. In Chapter 2 I present HyQue, a Semantic Web tool for evaluating scientific hypotheses, including the system architecture and a prototype implementation for evaluating
hypotheses about yeast metabolism. In Chapter 3, I describe efforts to publish and integrate biological data on the Semantic Web through the Bio2RDF project, a key data source for HyQue that enables browsing, querying and downloading over 3 billion statements from more than 25 life sciences databases. In Chapter 4 I describe the ovopub, a linked data model for capturing provenance on the Semantic Web, as well as its implementation and application to Bio2RDF data. The ovopub provides a simple model for describing basic elements of linked data provenance, and enables provenance-based querying
and filtering over biological linked data. In Chapter 5 I describe the application of HyQue to evaluating hypotheses about the role of C. elegans genes in aging. HyQue correctly identified known lifespan-related genes, as well as 24 candidate aging-related genes by retrieving and evaluating domain-specific evidence from multiple sources. Chapter 6 summarizes the contributions of this thesis and proposes future work.