New Model-Based Ordination Data Exploration Tools For Microbiome Studies

High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods for analysing microbiome data are still in their infancy. Data exploration often relies on classical dimension reduction methods such as Principal Coordinate Analysis (PCoA), which is basically a Multidimensional Scaling (MDS) method starting from ecologically relevant distance measures between the vectors of relative abundances of the microorganisms (e.g Bray-Curtis distance).
We will demonstrate that these classical visualisation methods fail to deal with microbiome-specific issues such as variability due to library-size differences and overdispersion. Next we propose a new technique that is based on a negative binomial regression model with log-link, and which relies on the connection between correspondence analysis and the log-linear RC(M) models of Goodman (Annals of Statistics, vol. 13, 1985); see also Zhu et al. (Ecological Modelling, vol. 187, 2005). Instead of assuming a Poisson distribution for the counts, a negative binomial distribution is assumed. To better account for library size effects, we adopt a different weighting scheme, which naturally arises from the parameterisation of the model. An iterative parameter estimation method is proposed and implemented into R. The new method is illustrated on several example datasets, and it is empirically evaluated in a simulation study. It is concluded that our method succeeds better in discovering structure in microbiome datasets than with other conventional methods.
In the second part of the presentation we extent the model-based method to a contrained ordination method by using sample-specific covariate data. The method looks for a two-dimensonal visualisation that optimally discriminates between species with respect to their sensitivity to environmental conditions. Again we build upon results of Zhu et al. (2005) and Zhang and Thas (Statistical Modelling, vol. 12, 2012). The method is illustrated on real data.
All methods are available as an R package.