Comparison of placenta samples with contamination controls does not provide evidence for a distinct placenta microbiota, Lauder et al., 2016, https://doi.org/10.1186/s40168-016-0172-3.
Load needed packages:
#load packages
packages <- c("decontam", "ggplot2", "reshape2", "phyloseq", "dada2", "gridExtra")
sapply(packages, require, character.only = TRUE)
## decontam ggplot2 reshape2 phyloseq dada2 gridExtra
## TRUE TRUE TRUE TRUE TRUE TRUE
#settings
theme_set(theme_bw())
options(stringsAsFactors = FALSE)
path.in <- "~/DecontamManuscript/Analyses/LauderPlacenta" # CHANGE ME
#color palettes
cbbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7","#000000")
methodPalette <- c(Prevalence="turquoise3", Frequency="orangered1", Combined="darkorchid4")
prevalencePalette <- c("2" = "#edf8e9", "3-5" = "#bae4b3", "6-10" = "#74c476", "11+" = "#238b45")
This section analyzes the Lauder et al. placenta biopsy amplicon sequencing data: “The data set supporting the results of this article is available in the NCBI SRA repository, PRJNA309332 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA309332].”
Once downloaded, the amplicon sequencing data was processed using the dada2 R package, producing a table of exact amplicon sequence variants (ASVs). Those processed data objects are included as part of this reproducible analysis in the LauderPlacenta
directory. In addition, the DADA2 processing steps are included here as well for completeness, but commented out:
#path.fastq <- "~/Desktop/Contamination/Lauder"
#fns <- list.files(path.fastq, pattern="fastq.gz", full.names=TRUE) # Just forward reads?
## Filter and Trim
#plotQualityProfile(fns[1:3])
#filts <- file.path(path.fastq, "filtered", basename(fns))
#out <- filterAndTrim(fns, filts, maxEE=2, truncLen=260, multithread=TRUE)
#summary(out[,2]/out[,1]) # 98%+ passing
## Read in sample data
#df <- read.csv(file.path(path.fastq, "SraRunInfo.csv"), header=TRUE, stringsAsFactors = FALSE)
#names(filts) <- sapply(strsplit(basename(filts), "_"), `[`, 1)
#identical(df$Run, names(filts)) # TRUE
## Process
#drp <- derepFastq(filts, verbose=TRUE)
#names(drp) <- df$Run
#err <- learnErrors(drp, multithread=TRUE)
## pool=TRUE to go after even the rarest ASVs in the hunt for the placenta microbiome
#dd <- dada(drp, err=err, pool=TRUE, multithread=TRUE)
#sta <- makeSequenceTable(dd)
#dim(sta) # 69 7060 -> 69 7667
#st <- removeBimeraDenovo(sta, method="consensus", multithread=TRUE)
#dim(st); sum(st)/sum(sta) # 69 4644 0.9049186 -> 69 5469 0.9172217
#tax <- assignTaxonomy(st, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
#save(df, sta, st, tax, file=file.path(path.in, "processed_plac.rda"))
Read in the ASV tables and taxonomies inferred by DADA2:
load(file.path(path.in, "processed_plac.rda"))
ft <- sweep(st, 1, rowSums(st), "/")
Read in the sample metadata:
samdf <- read.csv(file.path(path.in, "Biosamples.csv"), header=TRUE, stringsAsFactors = FALSE)
samdf <-samdf[match(df$BioSample, samdf$BioSample),]
rownames(samdf) <- df$Run
identical(samdf$BioSample, df$BioSample) # TRUE
## [1] TRUE
identical(rownames(samdf), rownames(st)) # TRUE
## [1] TRUE
table(samdf$Prep, samdf$BodySite)
##
## Air Swab Extraction PlacentaFS PlacentaMS Saliva Sterile Swab
## Mobio 5 6 6 6 6 5
## PSP 6 2 6 6 6 3
##
## Vaginal Swab
## Mobio 0
## PSP 6
neg.sites <- c("Air Swab", "Extraction", "Sterile Swab")
plac.sites <- c("PlacentaFS", "PlacentaMS")
samdf$neg <- samdf$BodySite %in% neg.sites
samdf$plac <- samdf$BodySite %in% plac.sites
samdf$Type <- samdf$BodySite
samdf$Type[samdf$neg] <- "Negative"
samdf$Type[samdf$plac] <- "Placenta"
samdf$Reads <- rowSums(st)
Note that Vaginal Swab samples were only collected for the PSP kit, and not the Mobio kit.
Perform ordinations and look at the clustering by BodySite and by Prep (~ the kit used): There are four clusters in the ordination: (1) Saliva samples, both kits, (2) Vaginal Swab samples, PSP kit, (3) Placenta/Control samples, Mobio kit, (4) Placenta/Control samples, PSP kit.
Taking a closer look at the negative controls:
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.05329157
## Run 1 stress 0.05434578
## Run 2 stress 0.05241102
## ... New best solution
## ... Procrustes: rmse 0.02870685 max resid 0.1015258
## Run 3 stress 0.0549791
## Run 4 stress 0.05167007
## ... New best solution
## ... Procrustes: rmse 0.007816957 max resid 0.02948436
## Run 5 stress 0.05757953
## Run 6 stress 0.05471614
## Run 7 stress 0.05756912
## Run 8 stress 0.05303113
## Run 9 stress 0.05506131
## Run 10 stress 0.05167006
## ... New best solution
## ... Procrustes: rmse 1.722686e-05 max resid 4.811504e-05
## ... Similar to previous best
## Run 11 stress 0.05771409
## Run 12 stress 0.05482555
## Run 13 stress 0.06113077
## Run 14 stress 0.05167006
## ... Procrustes: rmse 1.661768e-05 max resid 4.977017e-05
## ... Similar to previous best
## Run 15 stress 0.05241102
## Run 16 stress 0.05355152
## Run 17 stress 0.05167008
## ... Procrustes: rmse 3.14523e-05 max resid 9.961566e-05
## ... Similar to previous best
## Run 18 stress 0.05470529
## Run 19 stress 0.05310278
## Run 20 stress 0.05395494
## *** Solution reached
The distribution of read counts in negative controls is completly non-overlapping with the much greater distribution in vaginal/saliva samples. The placenta samples largely follow the air/sterils swab distribution, with a couple of high-read-count outliers. The negative controls totally separate by kit, and the different typers of negative controls are intermixed in ordinations. We’ll use them inerchangably.
In this case, it is reasonable to simply pool the placenta/negatives across the kits. With more samples, independent processing would be better, but sample numbers are so low here, and the non-contaminants we are testing for are not kit dependent:
out <- isNotContaminant(subset_samples(ps, neg | plac), neg="neg", method="prevalence", detailed=TRUE)
out$ind <- seq(nrow(out))
ggplot(data=out, aes(x=ind, y=p)) + geom_point() + scale_y_log10() + geom_hline(yintercept=0.1, color="red") + geom_hline(yintercept=0.01, color="red") + theme_bw()
## Warning: Removed 3556 rows containing missing values (geom_point).
summary(out$p)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.108 0.195 0.280 0.471 1.000 3556
sum(out$p < 0.1, na.rm=TRUE); sum(out$p < 0.01, na.rm=TRUE); sum(p.adjust(out$p)<0.5, na.rm=TRUE)
## [1] 411
## [1] 46
## [1] 4
So 46 have a raw score (which is a p-value here since this is the prevalence method) of less than 0.01, and if we want an FDR < 50% we only get 4 hits. It’s worth requiring a higher total prevalence here though to somewhat lessen the multiple-hypothesis burden:
out5 <- out[out$prev>=5,]
out5$p.adjust <- p.adjust(out5$p)
dim(out5);sum(out5$p.adjust<0.5)
## [1] 810 8
## [1] 6
plot(sort(out5$p.adjust)[1:25], ylab="Adjusted P-Value", xlab="Rank")
abline(0.5, 0, col="red")
# save 2.6x3 in PDF
So there are 6 “believable” hits if using an FDR cutoff. Let’s just go ahead and annotate the top 20:
sq20 <- rownames(out5)[order(out5$p)[1:20]]
anndf <- data.frame(Padjusted = out5[sq20,"p.adjust"],
Kingdom = tax[sq20,"Kingdom"], Genus = tax[sq20,"Genus"],
Placenta=colSums(st[samdf$Type == "Placenta",sq20]>0),
Negative=colSums(st[samdf$Type == "Negative",sq20]>0),
Saliva=colSums(st[samdf$Type == "Saliva",sq20]>0),
Vagina=colSums(st[samdf$Type == "Vaginal Swab",sq20]>0))
rownames(anndf) <- NULL
anndf
## Padjusted Kingdom Genus Placenta Negative
## 1 0.0000118023 Eukaryota <NA> 19 0
## 2 0.1141456713 Eukaryota <NA> 11 0
## 3 0.1736283311 Eukaryota <NA> 9 0
## 4 0.1736283311 Eukaryota <NA> 9 0
## 5 0.4654711246 Bacteria Ruminococcaceae_NK4A214_group 8 0
## 6 0.4654711246 Eukaryota <NA> 8 0
## 7 1.0000000000 Bacteria Streptococcus 7 0
## 8 1.0000000000 Bacteria Lactobacillus 7 0
## 9 1.0000000000 Bacteria Lactobacillus 8 1
## 10 1.0000000000 Bacteria <NA> 6 0
## 11 1.0000000000 Bacteria <NA> 6 0
## 12 1.0000000000 Bacteria <NA> 6 0
## 13 1.0000000000 Bacteria Prevotella_9 6 0
## 14 1.0000000000 Bacteria Rikenellaceae_RC9_gut_group 6 0
## 15 1.0000000000 Bacteria Subdoligranulum 6 0
## 16 1.0000000000 Bacteria Micrococcus 7 1
## 17 1.0000000000 Bacteria <NA> 7 1
## 18 1.0000000000 Bacteria Ruminococcaceae_UCG-002 7 1
## 19 1.0000000000 Bacteria Ruminococcaceae_NK4A214_group 7 1
## 20 1.0000000000 Bacteria Christensenellaceae_R-7_group 7 1
## Saliva Vagina
## 1 0 1
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 1
## 7 12 1
## 8 0 0
## 9 3 5
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 1
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 1
## 18 0 0
## 19 0 0
## 20 0 0
And use BLAST against nt as well, excluding environmental and unculturables:
From the BLAST results (most annotations are from 100% identity and 100% coverage hits)….
There is a pretty clean story here. At the FDR<0.5 cutoff, 5/6 of the “significant” non-contaminants are human DNA. This demonstrates the method is working, non-target human DNA is indeed a non-contaminant, i.e. it is truly present in the placental samples. However, it also demonstrates that there is little to no evidence for a placenta microbiome here, at least a core microbiome that is common across placental environments, as spuriously amplified human DNA is (almost) all we are finding.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 dada2_1.8.0 Rcpp_0.12.17 phyloseq_1.22.3
## [5] reshape2_1.4.3 ggplot2_3.0.0 decontam_1.1.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.38.0 jsonlite_1.5
## [3] splines_3.4.3 foreach_1.4.4
## [5] RcppParallel_4.4.0 assertthat_0.2.0
## [7] stats4_3.4.3 latticeExtra_0.6-28
## [9] Rsamtools_1.30.0 GenomeInfoDbData_1.0.0
## [11] yaml_2.1.19 pillar_1.2.3
## [13] backports_1.1.2 lattice_0.20-35
## [15] glue_1.2.0 digest_0.6.15
## [17] RColorBrewer_1.1-2 GenomicRanges_1.30.3
## [19] XVector_0.18.0 colorspace_1.3-2
## [21] htmltools_0.3.6 Matrix_1.2-14
## [23] plyr_1.8.4 pkgconfig_2.0.1
## [25] ShortRead_1.36.1 zlibbioc_1.24.0
## [27] purrr_0.2.5 scales_0.5.0
## [29] BiocParallel_1.12.0 tibble_1.4.2
## [31] mgcv_1.8-23 IRanges_2.12.0
## [33] withr_2.1.2 SummarizedExperiment_1.8.1
## [35] BiocGenerics_0.24.0 lazyeval_0.2.1
## [37] survival_2.42-3 magrittr_1.5
## [39] evaluate_0.10.1 nlme_3.1-137
## [41] MASS_7.3-50 hwriter_1.3.2
## [43] vegan_2.5-2 tools_3.4.3
## [45] data.table_1.11.4 matrixStats_0.53.1
## [47] stringr_1.3.1 S4Vectors_0.16.0
## [49] munsell_0.4.3 cluster_2.0.7-1
## [51] DelayedArray_0.4.1 bindrcpp_0.2.2
## [53] Biostrings_2.46.0 ade4_1.7-11
## [55] compiler_3.4.3 GenomeInfoDb_1.14.0
## [57] rlang_0.2.1 rhdf5_2.22.0
## [59] grid_3.4.3 RCurl_1.95-4.10
## [61] iterators_1.0.9 biomformat_1.6.0
## [63] igraph_1.2.1 labeling_0.3
## [65] bitops_1.0-6 rmarkdown_1.9
## [67] gtable_0.2.0 codetools_0.2-15
## [69] multtest_2.34.0 R6_2.2.2
## [71] GenomicAlignments_1.14.2 knitr_1.20
## [73] dplyr_0.7.5 bindr_0.1.1
## [75] rprojroot_1.3-2 permute_0.9-4
## [77] ape_5.1 stringi_1.2.2
## [79] parallel_3.4.3 tidyselect_0.2.4