RADIANT: Rapid Development and Distribution of Statistical Tools for High-Throughput Sequencing Data

Overview

RADIANT is a research project funded by the 7th Framework Programme of the European Commission.The project started in December 2012 and brings together ten project partners from five European countries.

The aim of the RADIANT project is to develop new statistical analysis tools for solving open problems in high-throughput sequencing (HTS) data analysis. HTS is a rapidly evolving family of technologies with many applications, including genetics of both rare and common diseases, understanding disease mechanism progression. RADIANT aims to provide an integrated computational framework for HTS data analysis that is robust and user-friendly, and provide tools to benchmark experimental protocols and statistical methods. The project will establish training materials and an extensive training programme for the rapid dissemination of these new tools to the biomedical community.

The project is sponsored by EU FP7-HEALTH Project Ref 305626 "Rapid development and distribution of statistical tools for high-throughput sequencing data" and is a collaboration with Magnus Rattray of University of Manchester, Korbinian Grote of Genomatix Software GmbH Germany, Alessandro Guffanti of Genomia Italy, Wolfgang Huber of EMBL Heidelberg, Diego di Bernardo of Fondazione Telethon, Mattia Pelizzola of Istituto Italiano di Tecnologia, Jean-Philippe Vert of ARMINES France, Klaus Mauch of Insilico Biotechnology, Jean-Marie Mouillon of Fluxome Sciences A/S, Mark Robinson of University of Zurich Switzerland and Simon Tavare of University of Cambridge.

Personnel at Sheffield

Zhenwen Dai, post-doctoral research assistant

Teo de Campos, post-doctoral research assistant

James Hensman, MRC Fellow

Publications

The following publications have provided background to our work in this project.

Conference Papers

J. Hensman, M. Zwiessele and N. D. Lawrence. (2014) "Tilted variational Bayes" in S. Kaski and J. Corander (eds) Proceedings of the Seventeenth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 33, Iceland, pp . [Software][Google Scholar Search]

Abstract

Machine learning practitioners are often faced with a choice between adiscriminative and a generative approach to modelling. Here, we present a model based on a hybrid approach that breaks down some ofthe barriers between the discriminative and generative points of view,allowing continuous dimensionality reduction of hybriddiscrete-continous data, discriminative classification with missinginputs and manifold learning informed by class labels.

R. Andrade-Pacheco, J. Hensman and N. D. Lawrence. (2014) "Hybrid discriminative-generative approaches with Gaussian processes" in S. Kaski and J. Corander (eds) Proceedings of the Seventeenth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 33, Iceland, pp . [Software][Google Scholar Search]

Abstract

Related References

J. Hensman, N. D. Lawrence and M. Rattray. (2013) "Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters" in BMC Bioinformatics 14 (252) [DOI][Google Scholar Search]

Abstract

\textbf{Background}

Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications.

\textbf{Results}

We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method's capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method's ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications.

\textbf{Conclusion}

The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors' website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.

N. Fusi, C. Lippert, K. Borgwardt, N. D. Lawrence and O. Stegle. (2013) "Detecting regulatory gene-environment interactions with unmeasured environmental factors" in Bioinformatics [DOI][Google Scholar Search]

Abstract

\textbf{Motivation}: Genomic studies have revealed a substantial heritable component of the transcriptional state of the cell. To fully understand the genetic regulation of gene expression variability, it is important to study the effect of genotype in the context of external factors such as alternative environmental conditions. In model systems, explicit environmental perturbations have been considered for this purpose, allowing to directly test for environment-specific genetic effects. However, such experiments are limited to species that can be profiled in controlled environments, hampering their use in important systems such as human. Moreover, even in seemingly tightly regulated experimental conditions, subtle environmental perturbations cannot be ruled out, and hence unknown environmental influences are frequent. Here, we propose a model-based approach to simultaneously infer unmeasured environmental factors from gene expression profiles and use them in genetic analyses, identifying environment-specific associations between polymorphic loci and individual gene expression traits.

\textbf{Results}: In extensive simulation studies, we show that our method is able to accurately reconstruct environmental factors and their interactions with genotype in a variety of settings. We further illustrate the use of our model in a real-world dataset in which one environmental factor has been explicitly experimentally controlled. Our method is able to accurately reconstruct the true underlying environmental factor even if it's not given as an input, allowing to detect genuine genotype-environment interactions. In addition to the known environmental factor, we find unmeasured factors involved in novel genotype-environment interactions. Our results suggest that interactions with both known and unknown environmental factors significantly contribute to gene expression variability.

\textbf{Availability}: Software available at http://ml.sheffield.ac.uk/qtl/limmi

\textbf{Contact}: oliver.stegle@ebi.ac.uk, nicolo.fusi@sheffield.ac.uk

J. Hensman, M. Rattray and N. D. Lawrence. (2012) "Fast variational inference in the conjugate exponential family" in P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L éo. Bottou and K. Q. Weinberger (eds) Advances in Neural Information Processing Systems, . [PDF][Google Scholar Search]

J. Hensman, N. Fusi and N. D. Lawrence. (2013) "Gaussian processes for big data" in A. Nicholson and P. Smyth (eds) Uncertainty in Artificial Intelligence, AUAI Press, . [PDF][Google Scholar Search]

N. Fusi, O. Stegle and N. D. Lawrence. (2012) "Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies" in PLoS Computat Biol 8, pp e1002330 [Software][PDF][DOI][Google Scholar Search]

Abstract

Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at http://ml.sheffield.ac.uk/qtl/.

This document last modified Tuesday, 04-Feb-2014 10:28:46 UTC