Comparison of placenta samples with contamination controls does not provide evidence for a distinct placenta microbiota, Lauder et al., 2016, https://doi.org/10.1186/s40168-016-0172-3.

Load needed packages:

#load packages
packages <- c("decontam", "ggplot2", "reshape2", "phyloseq", "dada2", "gridExtra")
sapply(packages, require, character.only = TRUE)
##  decontam   ggplot2  reshape2  phyloseq     dada2 gridExtra 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE
#settings
theme_set(theme_bw())
options(stringsAsFactors = FALSE)
path.in <- "~/DecontamManuscript/Analyses/LauderPlacenta" # CHANGE ME

#color palettes
cbbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7","#000000")
methodPalette <- c(Prevalence="turquoise3", Frequency="orangered1", Combined="darkorchid4")
prevalencePalette <- c("2" = "#edf8e9", "3-5" = "#bae4b3", "6-10" = "#74c476", "11+" = "#238b45")

Process 16S Amplicon Data

This section analyzes the Lauder et al. placenta biopsy amplicon sequencing data: “The data set supporting the results of this article is available in the NCBI SRA repository, PRJNA309332 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA309332].”

Once downloaded, the amplicon sequencing data was processed using the dada2 R package, producing a table of exact amplicon sequence variants (ASVs). Those processed data objects are included as part of this reproducible analysis in the LauderPlacenta directory. In addition, the DADA2 processing steps are included here as well for completeness, but commented out:

#path.fastq <- "~/Desktop/Contamination/Lauder"
#fns <- list.files(path.fastq, pattern="fastq.gz", full.names=TRUE) # Just forward reads?
## Filter and Trim
#plotQualityProfile(fns[1:3])
#filts <- file.path(path.fastq, "filtered", basename(fns))
#out <- filterAndTrim(fns, filts, maxEE=2, truncLen=260, multithread=TRUE)
#summary(out[,2]/out[,1]) # 98%+ passing
## Read in sample data
#df <- read.csv(file.path(path.fastq, "SraRunInfo.csv"), header=TRUE, stringsAsFactors = FALSE)
#names(filts) <- sapply(strsplit(basename(filts), "_"), `[`, 1)
#identical(df$Run, names(filts)) # TRUE
## Process
#drp <- derepFastq(filts, verbose=TRUE)
#names(drp) <- df$Run
#err <- learnErrors(drp, multithread=TRUE)
## pool=TRUE to go after even the rarest ASVs in the hunt for the placenta microbiome
#dd <- dada(drp, err=err, pool=TRUE, multithread=TRUE)
#sta <- makeSequenceTable(dd)
#dim(sta) # 69 7060 -> 69 7667
#st <- removeBimeraDenovo(sta, method="consensus", multithread=TRUE)
#dim(st); sum(st)/sum(sta) # 69 4644   0.9049186 -> 69 5469  0.9172217
#tax <- assignTaxonomy(st, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
#save(df, sta, st, tax, file=file.path(path.in, "processed_plac.rda"))

Analayze Data

Read in the ASV tables and taxonomies inferred by DADA2:

load(file.path(path.in, "processed_plac.rda"))
ft <- sweep(st, 1, rowSums(st), "/")

Read in the sample metadata:

samdf <- read.csv(file.path(path.in, "Biosamples.csv"), header=TRUE, stringsAsFactors = FALSE)
samdf <-samdf[match(df$BioSample, samdf$BioSample),]
rownames(samdf) <- df$Run
identical(samdf$BioSample, df$BioSample) # TRUE
## [1] TRUE
identical(rownames(samdf), rownames(st)) # TRUE
## [1] TRUE
table(samdf$Prep, samdf$BodySite)
##        
##         Air Swab Extraction PlacentaFS PlacentaMS Saliva Sterile Swab
##   Mobio        5          6          6          6      6            5
##   PSP          6          2          6          6      6            3
##        
##         Vaginal Swab
##   Mobio            0
##   PSP              6
neg.sites <- c("Air Swab", "Extraction", "Sterile Swab")
plac.sites <- c("PlacentaFS", "PlacentaMS")
samdf$neg <- samdf$BodySite %in% neg.sites
samdf$plac <- samdf$BodySite %in% plac.sites
samdf$Type <- samdf$BodySite
samdf$Type[samdf$neg] <- "Negative"
samdf$Type[samdf$plac] <- "Placenta"
samdf$Reads <- rowSums(st)

Note that Vaginal Swab samples were only collected for the PSP kit, and not the Mobio kit.

Perform ordinations and look at the clustering by BodySite and by Prep (~ the kit used): There are four clusters in the ordination: (1) Saliva samples, both kits, (2) Vaginal Swab samples, PSP kit, (3) Placenta/Control samples, Mobio kit, (4) Placenta/Control samples, PSP kit.

Taking a closer look at the negative controls:

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.05329157 
## Run 1 stress 0.05434578 
## Run 2 stress 0.05241102 
## ... New best solution
## ... Procrustes: rmse 0.02870685  max resid 0.1015258 
## Run 3 stress 0.0549791 
## Run 4 stress 0.05167007 
## ... New best solution
## ... Procrustes: rmse 0.007816957  max resid 0.02948436 
## Run 5 stress 0.05757953 
## Run 6 stress 0.05471614 
## Run 7 stress 0.05756912 
## Run 8 stress 0.05303113 
## Run 9 stress 0.05506131 
## Run 10 stress 0.05167006 
## ... New best solution
## ... Procrustes: rmse 1.722686e-05  max resid 4.811504e-05 
## ... Similar to previous best
## Run 11 stress 0.05771409 
## Run 12 stress 0.05482555 
## Run 13 stress 0.06113077 
## Run 14 stress 0.05167006 
## ... Procrustes: rmse 1.661768e-05  max resid 4.977017e-05 
## ... Similar to previous best
## Run 15 stress 0.05241102 
## Run 16 stress 0.05355152 
## Run 17 stress 0.05167008 
## ... Procrustes: rmse 3.14523e-05  max resid 9.961566e-05 
## ... Similar to previous best
## Run 18 stress 0.05470529 
## Run 19 stress 0.05310278 
## Run 20 stress 0.05395494 
## *** Solution reached

The distribution of read counts in negative controls is completly non-overlapping with the much greater distribution in vaginal/saliva samples. The placenta samples largely follow the air/sterils swab distribution, with a couple of high-read-count outliers. The negative controls totally separate by kit, and the different typers of negative controls are intermixed in ordinations. We’ll use them inerchangably.

Look for non-contaminants

In this case, it is reasonable to simply pool the placenta/negatives across the kits. With more samples, independent processing would be better, but sample numbers are so low here, and the non-contaminants we are testing for are not kit dependent:

out <- isNotContaminant(subset_samples(ps, neg | plac), neg="neg", method="prevalence", detailed=TRUE)
out$ind <- seq(nrow(out))
ggplot(data=out, aes(x=ind, y=p)) + geom_point() + scale_y_log10() + geom_hline(yintercept=0.1, color="red") + geom_hline(yintercept=0.01, color="red") + theme_bw()
## Warning: Removed 3556 rows containing missing values (geom_point).

summary(out$p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.108   0.195   0.280   0.471   1.000    3556
sum(out$p < 0.1, na.rm=TRUE); sum(out$p < 0.01, na.rm=TRUE); sum(p.adjust(out$p)<0.5, na.rm=TRUE)
## [1] 411
## [1] 46
## [1] 4

So 46 have a raw score (which is a p-value here since this is the prevalence method) of less than 0.01, and if we want an FDR < 50% we only get 4 hits. It’s worth requiring a higher total prevalence here though to somewhat lessen the multiple-hypothesis burden:

out5 <- out[out$prev>=5,]
out5$p.adjust <- p.adjust(out5$p)
dim(out5);sum(out5$p.adjust<0.5)
## [1] 810   8
## [1] 6
plot(sort(out5$p.adjust)[1:25], ylab="Adjusted P-Value", xlab="Rank")
abline(0.5, 0, col="red")

# save 2.6x3 in PDF

So there are 6 “believable” hits if using an FDR cutoff. Let’s just go ahead and annotate the top 20:

sq20 <- rownames(out5)[order(out5$p)[1:20]]
anndf <- data.frame(Padjusted = out5[sq20,"p.adjust"],
                    Kingdom = tax[sq20,"Kingdom"], Genus = tax[sq20,"Genus"],
                    Placenta=colSums(st[samdf$Type == "Placenta",sq20]>0),
                    Negative=colSums(st[samdf$Type == "Negative",sq20]>0),
                    Saliva=colSums(st[samdf$Type == "Saliva",sq20]>0),
                    Vagina=colSums(st[samdf$Type == "Vaginal Swab",sq20]>0))

rownames(anndf) <- NULL
anndf
##       Padjusted   Kingdom                         Genus Placenta Negative
## 1  0.0000118023 Eukaryota                          <NA>       19        0
## 2  0.1141456713 Eukaryota                          <NA>       11        0
## 3  0.1736283311 Eukaryota                          <NA>        9        0
## 4  0.1736283311 Eukaryota                          <NA>        9        0
## 5  0.4654711246  Bacteria Ruminococcaceae_NK4A214_group        8        0
## 6  0.4654711246 Eukaryota                          <NA>        8        0
## 7  1.0000000000  Bacteria                 Streptococcus        7        0
## 8  1.0000000000  Bacteria                 Lactobacillus        7        0
## 9  1.0000000000  Bacteria                 Lactobacillus        8        1
## 10 1.0000000000  Bacteria                          <NA>        6        0
## 11 1.0000000000  Bacteria                          <NA>        6        0
## 12 1.0000000000  Bacteria                          <NA>        6        0
## 13 1.0000000000  Bacteria                  Prevotella_9        6        0
## 14 1.0000000000  Bacteria   Rikenellaceae_RC9_gut_group        6        0
## 15 1.0000000000  Bacteria               Subdoligranulum        6        0
## 16 1.0000000000  Bacteria                   Micrococcus        7        1
## 17 1.0000000000  Bacteria                          <NA>        7        1
## 18 1.0000000000  Bacteria       Ruminococcaceae_UCG-002        7        1
## 19 1.0000000000  Bacteria Ruminococcaceae_NK4A214_group        7        1
## 20 1.0000000000  Bacteria Christensenellaceae_R-7_group        7        1
##    Saliva Vagina
## 1       0      1
## 2       0      0
## 3       0      0
## 4       0      0
## 5       0      0
## 6       0      1
## 7      12      1
## 8       0      0
## 9       3      5
## 10      0      0
## 11      0      0
## 12      0      0
## 13      0      1
## 14      0      0
## 15      0      0
## 16      0      0
## 17      0      1
## 18      0      0
## 19      0      0
## 20      0      0

And use BLAST against nt as well, excluding environmental and unculturables:

From the BLAST results (most annotations are from 100% identity and 100% coverage hits)….

  1. Homo sapiens
  2. Homo sapiens
  3. Homo sapiens
  4. Homo sapiens
  5. uncult Ruminococcaceae
  6. Homo sapiens

Conclusions

There is a pretty clean story here. At the FDR<0.5 cutoff, 5/6 of the “significant” non-contaminants are human DNA. This demonstrates the method is working, non-target human DNA is indeed a non-contaminant, i.e. it is truly present in the placental samples. However, it also demonstrates that there is little to no evidence for a placenta microbiome here, at least a core microbiome that is common across placental environments, as spuriously amplified human DNA is (almost) all we are finding.

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gridExtra_2.3   dada2_1.8.0     Rcpp_0.12.17    phyloseq_1.22.3
## [5] reshape2_1.4.3  ggplot2_3.0.0   decontam_1.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.38.0             jsonlite_1.5              
##  [3] splines_3.4.3              foreach_1.4.4             
##  [5] RcppParallel_4.4.0         assertthat_0.2.0          
##  [7] stats4_3.4.3               latticeExtra_0.6-28       
##  [9] Rsamtools_1.30.0           GenomeInfoDbData_1.0.0    
## [11] yaml_2.1.19                pillar_1.2.3              
## [13] backports_1.1.2            lattice_0.20-35           
## [15] glue_1.2.0                 digest_0.6.15             
## [17] RColorBrewer_1.1-2         GenomicRanges_1.30.3      
## [19] XVector_0.18.0             colorspace_1.3-2          
## [21] htmltools_0.3.6            Matrix_1.2-14             
## [23] plyr_1.8.4                 pkgconfig_2.0.1           
## [25] ShortRead_1.36.1           zlibbioc_1.24.0           
## [27] purrr_0.2.5                scales_0.5.0              
## [29] BiocParallel_1.12.0        tibble_1.4.2              
## [31] mgcv_1.8-23                IRanges_2.12.0            
## [33] withr_2.1.2                SummarizedExperiment_1.8.1
## [35] BiocGenerics_0.24.0        lazyeval_0.2.1            
## [37] survival_2.42-3            magrittr_1.5              
## [39] evaluate_0.10.1            nlme_3.1-137              
## [41] MASS_7.3-50                hwriter_1.3.2             
## [43] vegan_2.5-2                tools_3.4.3               
## [45] data.table_1.11.4          matrixStats_0.53.1        
## [47] stringr_1.3.1              S4Vectors_0.16.0          
## [49] munsell_0.4.3              cluster_2.0.7-1           
## [51] DelayedArray_0.4.1         bindrcpp_0.2.2            
## [53] Biostrings_2.46.0          ade4_1.7-11               
## [55] compiler_3.4.3             GenomeInfoDb_1.14.0       
## [57] rlang_0.2.1                rhdf5_2.22.0              
## [59] grid_3.4.3                 RCurl_1.95-4.10           
## [61] iterators_1.0.9            biomformat_1.6.0          
## [63] igraph_1.2.1               labeling_0.3              
## [65] bitops_1.0-6               rmarkdown_1.9             
## [67] gtable_0.2.0               codetools_0.2-15          
## [69] multtest_2.34.0            R6_2.2.2                  
## [71] GenomicAlignments_1.14.2   knitr_1.20                
## [73] dplyr_0.7.5                bindr_0.1.1               
## [75] rprojroot_1.3-2            permute_0.9-4             
## [77] ape_5.1                    stringi_1.2.2             
## [79] parallel_3.4.3             tidyselect_0.2.4