Multiple Sample Hypothesis Testing Of The Human Microbiome Through Evolutionary Trees

Next generation sequencing technologies make it possible to survey microbial communities by sequencing nucleic acid material extracted from multiple samples. The metagenomic read counts are summarized as empirical distributions on a reference phylogenetic tree. The distance between them is evaluated by the Kantorovich-Rubinstein (kr) metric, equivalent to the commonly used weighted UniFrac distance on a tree. This paper proposes a method to test the hypothesis that two sets of samples have the same microbial composition. The asymptotic distributions of the kr distance between the two Frechet means and  the Frechet variances are derived and are shown to be independent. The test statistic is dened as the ratio of those distances and it’s shown to follow an asymptotic F-distribution. Its generality stems from the fact that the test is nonparametric and requires no assumptions on the probability distributions of the count data. It is also applicable for varying set sizes and sample sizes. It is an extension of Evans and Matsen (2012) who suggest a test to only compare two single samples. The computational eciency of the proposed test comes from the exact asymptotic distribution of the proposed test statistic. Extensive data analysis shows that the test is signicantly faster than the permutation-based multivariate analysis of variance using distance matrices (permanova) (McArdle and Anderson, 2001). At the same time, it has correct type 1 errors and comparable power, which makes it preferable in the analysis of large scale microbiome data.