Load packages.

library(decontam); packageVersion("decontam")
## [1] '1.18.0'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.4.1'
library(phyloseq); packageVersion("phyloseq")
## [1] '1.42.0'

Import Curated Data

Note, the imported data has already been curated in the companion scripts FelineBileCuration.Rmd and FelineBileCulture.Rmd.

load(file="../RDS/bile_curated.rda")
if(!identical(rownames(tax.filt), colnames(st.filt))) stop("Table mismatch: tax.filt, st.filt")
if(!identical(rownames(df), rownames(st))) stop("Mismatched samples in df and st.")
if(!identical(rownames(df), rownames(st.filt))) stop("Mismatched samples in df and st.")
ft.filt <- sweep(st.filt, 1, rowSums(st.filt), "/")

Taxon-by-Taxon Comparison of Culture and 16S Results

In terms of the specific taxa detected by culture, is there any concordance with the frequencies of those same taxa as measured by the 16S community sequencing approach? We’ll start this investigation by augmenting the sample metadata with the proportion of each of the taxa tested for by culture – E. coli, Clostridium spp., Enterococcus, Streptococcus and Peptostreptococcus – as measured by 16S sequencing. Since each of these is effectively a genus-level taxon we will use the Silva taxonomic assignments at the genus level to identify the corresponding ASVs. E. coli appears as the genus Escherichia-Shigella in these taxonomic assignments. The true taxonomic range of the culture assay for Clostridium spp. is rather unclear, but we will associate it with the genus designation of Clostridium sensu stricto 1.

is.ecoli <- tax.filt[,"Genus"] %in% "Escherichia-Shigella"
is.clost <- tax.filt[,"Genus"] %in% "Clostridium sensu stricto 1"
is.entero <- tax.filt[,"Genus"] %in% "Enterococcus"
is.strep <- tax.filt[,"Genus"] %in% "Streptococcus"
is.pepto <- tax.filt[,"Genus"] %in% "Peptostreptococcus"
c(Ecoli=sum(is.ecoli), Clost=sum(is.clost), Entero=sum(is.entero), Strep=sum(is.strep), Pepto=sum(is.pepto))
##  Ecoli  Clost Entero  Strep  Pepto 
##      2      3      2      6      0

All of these genera except for Peptostreptococcus were detected in the 16S sequencing results. With a relatively small number of ASVs representing each genus. Inspecting the distribution of the 16S-measured proportions of these genera across the samples, and relative to the culture results.

df$EcoliProportion <- rowSums(ft.filt[,is.ecoli])
df$ClostProportion <- rowSums(ft.filt[,is.clost])
df$EnteroProportion <- rowSums(ft.filt[,is.entero])
df$StrepProportion <- rowSums(ft.filt[,is.strep])
df$PeptoProportion <- rowSums(ft.filt[,is.pepto])
ggplot(data=df[order(df$EcoliProportion),], 
       aes(x=seq(nrow(df)), y=EcoliProportion, color=Ecoli)) +
  geom_point() +
  theme(axis.title.x=element_blank()) + 
  ylab("Ecoli Proportion (16S)") +
  labs(color="Ecoli Present (Culture)")

Strong concordance between culture positivity for E. coli and measured 16S proportion. One outlier sample.

ggplot(data=df[order(df$ClostProportion),], 
       aes(x=seq(nrow(df)), y=ClostProportion, color=Clost)) +
  geom_point() +
  theme(axis.title.x=element_blank()) + 
  ylab("Clostridium Proportion (16S)") +
  labs(color="Clostridium Present (Culture)")

Agreement again, albeit there was only one sample in which Clostridium was detected by culture.

ggplot(data=df[order(df$EnteroProportion),], 
       aes(x=seq(nrow(df)), y=EnteroProportion, color=Enter)) +
  geom_point() +
  theme(axis.title.x=element_blank()) + 
  ylab("Enterococcus Proportion (16S)") +
  labs(color="Enterococcus Present (Culture)")

No concordance here, although again only one culture-positive sample.

ggplot(data=df[order(df$StrepProportion),], 
       aes(x=seq(nrow(df)), y=StrepProportion, color=Strep)) +
  geom_point() +
  theme(axis.title.x=element_blank()) + 
  ylab("Streptococcus Proportion (16S)") +
  labs(color="Streptococcus Present (Culture)")

No concordance, but just one culture-positive sample.

And as a reminder, Peptostreptococcus was not identified in the 16S data.

Overall, the culture positivity results and the 16S relative abundance results are largely concordant, especially for E. coli, the only taxon for which multiple samples were positive by culture.

Community-level Comparison of Culture and 16S Results

We now consider the community-level 16S results, and compare them to our culture results. We start by looking at “alpha-diversity” in the 16S sequencing data. Our expectation is that most of the samples are just contamination, and will have a relatively consistent and high level of diversity reflecting the spectrum of contaminants, but that some samples with a legitimate infection will be lower in diversity because they are dominated by one or a small number of real species infecting the bile.

Alpha-diversity

Calcluate Shannon and Simpson diversities for all our samples.

if(!identical(rownames(df), rownames(ft.filt))) stop("Table mismatch: df, ft")
df$Shannon <- apply(ft.filt, 1, function(x) sum(-x*log(x), na.rm=TRUE))
df$Simpson <- apply(ft.filt, 1, function(x) 1-sum(x*x))
ggplot(data=df, aes(x=Shannon, y=Simpson, color=NegativeControl)) + geom_point()

As expected there is a lot of correlation between the two diversity measures, especially in the low-diversity samples that seem to be separating from the negative control samples (and the bulk of the bile samples).

Looking into the relationship between diversity and culture positivity, while also plotting library size to check on any influence from uncertainly measured (low depth) samples.

ggplot(data=df[seq(nrow(df), 1, -1),], aes(x=Shannon, y=LibrarySize, color=Culture_Positive)) + geom_point() + xlab("Shannon Diversity (16S)")

That is strikingly clear. Save for two outliers, alpha-diversity separates samples into two modes: A low-diversity mode that is entirely culture-positive, and a high-diversity mode that is entirely (save the one exception) culture-negative and that includes all of the negative control samples. Library size is not confounding this signal.

Let’s check on those two outlier culture-positive samples.

df[df$Culture_Positive %in% TRUE & df$Shannon > 2, 
   c("SampleType", "Organisms", "Bactibilia", "Culture_Positive", "Ecoli", "Shannon", "LibrarySize")]
##         SampleType          Organisms   Bactibilia Culture_Positive Ecoli
## 719-515   Positive Peptostreptococcus Nobactibilia             TRUE FALSE
## 720-513   Positive     Staphepidermis Nobactibilia             TRUE FALSE
##          Shannon LibrarySize
## 719-515 4.655519       94239
## 720-513 4.609725       72306

As a reminder, Bactibilia and E. coli culture positivity are fully concordant, and there are two samples which were culture positive for some bacteria, but not for E. coli:

table(Ecoli_Positive=df$Ecoli, Bactibilia_Positive=df$Bactibilia_Positive)
##               Bactibilia_Positive
## Ecoli_Positive FALSE TRUE
##          FALSE    48    0
##          TRUE      0    6
table(Ecoli_Positive=df$Ecoli, Culture_Positive=df$Culture_Positive)
##               Culture_Positive
## Ecoli_Positive FALSE TRUE
##          FALSE    47    2
##          TRUE      0    6

Thus, these two low-diversity “culture positive” outliers are exactly those two samples that were culture-positive samples but no E. coli was detected by culture, and Bactibilia was not detected by microscopy. Or, put another way, there is full concordance in identifying the same set of six samples by microscopic identification of bactibilia, by culture-based detection of E. coli, and by 16S-based identification of low-diversity samples.

For the record, the six samples are:

sam.infected <- df$SampleID[df$Ecoli %in% TRUE]
sam.infected
## [1] "716-516" "716-520" "716-521" "718-515" "720-517" "721-520"

Creating the publication formatted and labelled figure.

df$Fig3 <- "Bile Samples"
df$Fig3[df$NegativeControl] <- "Negative Control Samples"
fig3 <- ggplot(data=df[seq(nrow(df), 1, -1),],
               aes(x=Shannon, y=LibrarySize, color=Ecoli)) + 
  geom_point() + 
  facet_wrap(~Fig3) +
  xlab("Shannon Diversity (16S)") + ylab("Sequencing Depth") +
  theme_bw()
fig3

ggsave("../Figures/Figure3.pdf", fig3, width=6, height=4, units="in")

Beta-diversity

We’ll also take a look at how the samples organize using ordination and the Bray-Curtis community dissimilarity metric.

ps <- phyloseq(sample_data(df), 
               otu_table(ft.filt, taxa_are_rows=FALSE), 
               tax_table(tax.filt))
ord <- ordinate(ps, method="MDS", distance="bray")
f <- plot_ordination(ps, ord, color="Ecoli") + facet_wrap(~SampleType)
f

The six samples identified by culture, microscopy and alpha-diversity also stand out in the Bray-Curtis ordination. It is also clear that the negative control samples intermingle with the rest of the bile (Positive) samples, supporting the idea that those samples are mostly just off-target amplification/contamination, with little to no real signal.

Now combining this into one panel with publication formatting and labeling.

df$Fig2 <- "Negative Control" # Technical negative controls, all types
df$Fig2[df$Patient.Control %in% "Control"] <- "Control case" # Real samples, from healthy "control" cats
df$Fig2[df$Patient.Control %in% "Patient"] <- "Clinical case" # Real samples, from clinical cases
df$Fig2[df$Patient.Control %in% "Patient" & df$Ecoli] <- "Clinical case (E. coli culture positive)" # Real samples, from clinical cases, that were E. coli positive
df$Fig2 <- factor(df$Fig2, levels=c("Negative Control", "Control case", "Clinical case", "Clinical case (E. coli culture positive)"))
# Sanity check
table(df$Fig2, df$Patient.Control, useNA="ifany")
##                                           
##                                            Control ControlTechnical Patient
##   Negative Control                               0                2       0
##   Control case                                  17                0       0
##   Clinical case                                  0                0      34
##   Clinical case (E. coli culture positive)       0                0       6
##                                           
##                                            PatientTechnical Technical <NA>
##   Negative Control                                        2         9    4
##   Control case                                            0         0    0
##   Clinical case                                           0         0    0
##   Clinical case (E. coli culture positive)                0         0    0
table(df$Fig2, df$NegativeControl, useNA="ifany")
##                                           
##                                            FALSE TRUE
##   Negative Control                             0   17
##   Control case                                17    0
##   Clinical case                               34    0
##   Clinical case (E. coli culture positive)     6    0

Sanity check passed.

scale.fig2 <- scale_color_manual(values=c("Negative Control"="black", 
                                          "Control case"="dodgerblue", 
                                          "Clinical case"="darkorange3",
                                          "Clinical case (E. coli culture positive)"="chartreuse3"))

ps.fig2 <- phyloseq(sample_data(df), 
               otu_table(ft.filt, taxa_are_rows=FALSE), 
               tax_table(tax.filt))
ord.fig2 <- ordinate(ps.fig2, method="MDS", distance="bray")
fig2 <- plot_ordination(ps.fig2, ord.fig2, color="Fig2") +
  scale.fig2 +
  theme_bw() + 
  theme(legend.title = element_blank(), 
        legend.position = c(0.7, 0.84),
        legend.spacing.y = unit(0.0, 'in'),
        legend.background = element_blank(),
        legend.key = element_blank(),
        legend.box.spacing = unit(0.0, "in"))
fig2

ggsave("../Figures/Figure2.pdf", fig2, width=6, height=4, units="in") ## uncomment to regenerate figure 

Composition

Finally, what is in the putatitive contaminant background? And what is in those six standout samples?

ps.contam <- subset_samples(ps, !Ecoli %in% TRUE)
ps.infect <- subset_samples(ps, Ecoli %in% TRUE)
genera.tot <- sort(tapply(colSums(ft.filt), tax.filt[,"Genus"], sum), decreasing=TRUE)
genera.contam <- sort(tapply(colSums(ft.filt[!df$Ecoli %in% TRUE,]), tax.filt[,"Genus"], sum), decreasing=TRUE)
genera.infect <- sort(tapply(colSums(ft.filt[df$Ecoli %in% TRUE,]), tax.filt[,"Genus"], sum), decreasing=TRUE)
head(genera.contam, n=20)
##                 Anoxybacillus                   Bacteroides 
##                    10.8275051                     4.9363788 
##                 Ileibacterium                    Dubosiella 
##                     4.1332786                     1.9949405 
##               Bifidobacterium             Ligilactobacillus 
##                     1.8639262                     1.1732673 
##     Coriobacteriaceae UCG-002 Lachnospiraceae NK4A136 group 
##                     1.1013964                     1.0996733 
##                  Helicobacter                     Alistipes 
##                     1.0381067                     1.0172288 
##                 Paenibacillus                        Enorma 
##                     0.9505098                     0.9422189 
##                 Brevibacillus                   Geobacillus 
##                     0.8971653                     0.8959039 
##                 Enterorhabdus                 Mucispirillum 
##                     0.8687622                     0.8290046 
##   Rikenellaceae RC9 gut group               Subdoligranulum 
##                     0.7703743                     0.7572878 
##                     Olsenella                Anaerobacillus 
##                     0.7552610                     0.7255287

Anoxybacillus is the most abundant across the “non-infected” samples, followed by a ong tail of other genera.

head(genera.infect, n=10)
##        Escherichia-Shigella                    Bacillus 
##                 3.734797210                 0.517202487 
##                      Rothia                 Bacteroides 
##                 0.446614995                 0.413100951 
##                 Actinomyces Clostridium sensu stricto 1 
##                 0.337290579                 0.269699840 
##               Streptococcus                Enterococcus 
##                 0.139063342                 0.044901842 
##             Subdoligranulum             Bifidobacterium 
##                 0.002805228                 0.002041338

These samples look very different. As might be expected (since all six were culture-positive for E. coli), Esherichia-Shigella is the most abundant genus. This is followed by 6-7 other genera, before dropping off dramatically to minimal levels.

Create a bar plot of the composition in the six “infected” samples, keeping just the genera with >0.5% average abundance:

genera.infect.keep <- names(genera.infect)[genera.infect/6 > 0.005]
ps.infect.top <- subset_taxa(ps.infect, Genus %in% genera.infect.keep)
plot_bar(ps.infect.top, fill="Genus")

Interestingly, while E. coli is the largest contributor to most of these samples, it is often appearing as what might be a co-infection with 1-3 other bacterial genera. And in sample 716-520, it is not E. coli that dominates the sample, but rather a mixture of Bacillus and Rothia.

sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.42.0 ggplot2_3.4.1   decontam_1.18.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.58.0         sass_0.4.5             jsonlite_1.8.4        
##  [4] splines_4.2.3          foreach_1.5.2          bslib_0.4.2           
##  [7] highr_0.10             stats4_4.2.3           GenomeInfoDbData_1.2.9
## [10] yaml_2.3.7             pillar_1.9.0           lattice_0.20-45       
## [13] glue_1.6.2             digest_0.6.31          XVector_0.38.0        
## [16] colorspace_2.1-0       htmltools_0.5.5        Matrix_1.5-3          
## [19] plyr_1.8.8             pkgconfig_2.0.3        zlibbioc_1.44.0       
## [22] scales_1.2.1           tibble_3.2.1           mgcv_1.8-42           
## [25] generics_0.1.3         farver_2.1.1           IRanges_2.32.0        
## [28] cachem_1.0.7           withr_2.5.0            BiocGenerics_0.44.0   
## [31] cli_3.6.1              survival_3.5-5         magrittr_2.0.3        
## [34] crayon_1.5.2           evaluate_0.20          fansi_1.0.4           
## [37] nlme_3.1-162           MASS_7.3-58.3          vegan_2.6-4           
## [40] textshaping_0.3.6      tools_4.2.3            data.table_1.14.8     
## [43] lifecycle_1.0.3        stringr_1.5.0          Rhdf5lib_1.20.0       
## [46] S4Vectors_0.36.2       munsell_0.5.0          cluster_2.1.4         
## [49] Biostrings_2.66.0      ade4_1.7-22            compiler_4.2.3        
## [52] jquerylib_0.1.4        GenomeInfoDb_1.34.9    systemfonts_1.0.4     
## [55] rlang_1.1.0            rhdf5_2.42.0           grid_4.2.3            
## [58] RCurl_1.98-1.12        iterators_1.0.14       rhdf5filters_1.10.1   
## [61] biomformat_1.26.0      rstudioapi_0.14        igraph_1.4.1          
## [64] bitops_1.0-7           labeling_0.4.2         rmarkdown_2.21        
## [67] gtable_0.3.3           codetools_0.2-19       multtest_2.54.0       
## [70] DBI_1.1.3              reshape2_1.4.4         R6_2.5.1              
## [73] knitr_1.42             dplyr_1.1.1            fastmap_1.1.1         
## [76] utf8_1.2.3             ragg_1.2.5             permute_0.9-7         
## [79] ape_5.7-1              stringi_1.7.12         parallel_4.2.3        
## [82] Rcpp_1.0.10            vctrs_0.6.1            tidyselect_1.2.0      
## [85] xfun_0.38