Setup

Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:

## [1] '1.18.0'
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## /Users/bcallah/LoopData/16S/CallahanMeat/CB28_contig_list_trimmed.fq 
##                                                                42342 
##  /Users/bcallah/LoopData/16S/CallahanMeat/CB4_contig_list_trimmed.fq 
##                                                                34415 
##  /Users/bcallah/LoopData/16S/CallahanMeat/GB8_contig_list_trimmed.fq 
##                                                                43609 
## /Users/bcallah/LoopData/16S/CallahanMeat/GT10_contig_list_trimmed.fq 
##                                                                29603 
##  /Users/bcallah/LoopData/16S/CallahanMeat/GT5_contig_list_trimmed.fq 
##                                                                29938 
##  /Users/bcallah/LoopData/16S/CallahanMeat/GT8_contig_list_trimmed.fq 
##                                                                49218

QA, Filtering and Trimming

First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community. Note that this data was generated alongside the Fecal data, so all parameters used there are expected to translate.

plotComplexity(fn)

plotQualityProfile(fn)

Remove the primers (and any flanking sequence) from the reads, and filter out reads that don’t contain both primers:

FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## 276 sequences out of 42342 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/CB28_contig_list_trimmed.fq
## Read in 42342, output 15651 (37%) filtered sequences.
## 198 sequences out of 34415 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/CB4_contig_list_trimmed.fq
## Read in 34415, output 15374 (44.7%) filtered sequences.
## 923 sequences out of 43609 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GB8_contig_list_trimmed.fq
## Read in 43609, output 7529 (17.3%) filtered sequences.
## 120 sequences out of 29603 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT10_contig_list_trimmed.fq
## Read in 29603, output 14713 (49.7%) filtered sequences.
## 183 sequences out of 29938 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT5_contig_list_trimmed.fq
## Read in 29938, output 13889 (46.4%) filtered sequences.
## 304 sequences out of 49218 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT8_contig_list_trimmed.fq
## Read in 49218, output 20759 (42.2%) filtered sequences.

As in the fecal samples, only a modest (~40%) proportion of sequences are passing the primer screen.

Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing maxEE=1, as that was determined to be a better filter than maxEE=2 in the Zymo mock community data.

filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/CB28_contig_list_trimmed.fq
## Read in 15651, output 14327 (91.5%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/CB4_contig_list_trimmed.fq
## Read in 15374, output 14070 (91.5%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GB8_contig_list_trimmed.fq
## Read in 7529, output 6143 (81.6%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT10_contig_list_trimmed.fq
## Read in 14713, output 13526 (91.9%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT5_contig_list_trimmed.fq
## Read in 13889, output 12866 (92.6%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT8_contig_list_trimmed.fq
## Read in 20759, output 18636 (89.8%) filtered sequences.

Final inspection of the quality profile:

plotQualityProfile(filt)

Final progress of reads through full-length primer enforcement and filtering/trimming:

cbind(raw=out[,1], full.len=out[,2], filtered=track[,2])
##                               raw full.len filtered
## CB28_contig_list_trimmed.fq 42342    15651    14327
## CB4_contig_list_trimmed.fq  34415    15374    14070
## GB8_contig_list_trimmed.fq  43609     7529     6143
## GT10_contig_list_trimmed.fq 29603    14713    13526
## GT5_contig_list_trimmed.fq  29938    13889    12866
## GT8_contig_list_trimmed.fq  49218    20759    18636

Few reads lost to quality filtering. Note that GB8 in particular had a large fraction (~80%) of reads lost due to a failure to detect both primers.

Denoising

Learn the error rates:

err.rds <- file.path(path.rds, "err_16S_Meat.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
  err <- learnErrors(filt, multi=TRUE, verbose=0)
  saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)

Denoise the filtered data into ASVs using current LoopSeq appropriate “high-sensitivity” settings:

dd <- dada(filt, err, OMEGA_A=1e-10, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd
## $CB28_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 388 sequence variants were inferred from 2044 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $CB4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 673 sequence variants were inferred from 2960 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $GB8_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 281 sequence variants were inferred from 1167 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $GT10_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 448 sequence variants were inferred from 2144 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $GT5_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 563 sequence variants were inferred from 2411 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $GT8_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 318 sequence variants were inferred from 2552 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16

Make sequence table and remove chimeras:

sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)
## Identified 188 bimeras out of 2380 input sequences.

Assign taxonomy:

tax.rds <- file.path(path.rds, "tax_16S_Meat.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
  tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
  saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")

Species-level assignment of CDC foodborne pathogens

Here we are going to focus on 6 species of foodborne pathogens that are known to commonly occur in retail meat according to the CDC:

Campylobacter: https://www.cdc.gov/campylobacter/faq.html Clostridium perfringens: https://www.cdc.gov/foodsafety/diseases/clostridium-perfringens.html (some strains) E. coli: https://www.cdc.gov/ecoli/ (STEC, EHEC, ETEC, Diarrheagenic types) Listeria monocytogenes: https://www.cdc.gov/listeria/index.html Salmonella enterica: https://www.cdc.gov/salmonella/ (some serotypes, esp Enteritidis, Typhimurium, Newport, and Javiana) Yersinia enterocolitica: https://www.cdc.gov/yersinia/

First we’ll look for the presence of these bacterial genera in our samples (which to this point have only been assigned taxonomy down to the genus level).

genera.path <- c("Campylobacter",
                 "Clostridium_sensu_stricto_1", # The Silva 132 genus of C. perfringens
                 "Escherichia/Shigella",
                 "Listeria",
                 "Salmonella",
                 "Yersinia")
genera.path %in% tax[,"Genus"]
## [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Campylobacter and Listeria were not detected in our data. We would like to make species-level assignments for the four pathogenic genera that were in our data, to determine if the pathogen species is there and not a congeneric. This will be done by hand via manual inspection of BLAST searches of the relevant ASVs, so as to avoid any questions about how effectively current species-level assignment tools perform.

Yersinia (enterocolitica)

sq.yers <- getSequences(st)[tax[,"Genus"] %in% "Yersinia"]
names(sq.yers) <- paste0("Yersinia", seq_along(sq.yers))
unname(colSums(st[,sq.yers]))
##   [1] 2326 2286  823  321  228  210  165  161  141  139  131  119  117   33   32
##  [16]   25   24   22   19   13   11   10   10    9    9    8    7    5    5    4
##  [31]    4    4    3    3    3    3    3    2    2    2    2    2    2    2    2
##  [46]    2    2    2    2    2    2    2    1    1    1    1    1    1    1    1
##  [61]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##  [76]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##  [91]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [106]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [121]    1    1    1    1

The abundance of Yersinia ASVs drops off quickly. For convenience we’ll just do by-hand species assignments for the top 32 (i.e. all Yersinia ASVs with at least an abundance of 4 across all samples).

## dada2:::pfasta(sq.yers[1:32])
## BLAST against nt
spec.yers <- c("enterocolitica", # 100
               "enterocolitica",
               "enterocolitica",
               "intermedia", # 100
               "intermedia", # 100
               "intermedia",
               "intermedia",
               "intermedia",
               "kristensenii", # 100
               "intermedia", # 100  -- 10th seq
               "intermedia", # 100
               "intermedia",
               "intermedia",
               "intermedia", #100
               "intermedia",
               "intermedia",
               "intermedia",
               "intermedia", #100
               "enterocolitica", # 100
               "frederiksenii", # -- 20th seq
               "enterocolitica",
               "enterocolitica",
               "enterocolitica",
               "frederiksenii", # 10-off
               "intermedia",
               "enterocolitica", # end of query not included (5% of read)
               "kristensenii",
               "enterocolitica",
               "intermedia", # 8-off of intermedia/kristensenii/aldovae, but cooccur patterns say intermedia
               "enterocolitica", # -- 30th seq
               "enterocolitica",
               "frederiksenii")
names(spec.yers) <- sq.yers[1:length(spec.yers)]

All Yersinia ASVs were able to be relatively unambiguously assigned to a specific species, with the arguable exception of Yerisinia29, which was resolved by looking at the co-occurence patterns between that ASV and other unambiguously assigned Yerisina ASVs.

Y. enterocolitica is the main foodborne pathogen of concern here, the other Yersinias are not or very rarely associated with disease in humans.

Salmonella

sq.salm <- getSequences(tax)[tax[,"Genus"] %in% "Salmonella"]
uncolname <- function(x) { y <- x; colnames(y) <- NULL; y }
uncolname(st[,sq.salm])
##                             [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## CB28_contig_list_trimmed.fq    0    0    0    0    0    0    0
## CB4_contig_list_trimmed.fq     0    0    0    0    0    0    0
## GB8_contig_list_trimmed.fq     0    0    0    0    0    0    0
## GT10_contig_list_trimmed.fq   28   27   26   23   22   19   19
## GT5_contig_list_trimmed.fq     0    0    0    0    0    0    0
## GT8_contig_list_trimmed.fq     0    0    0    0    0    0    0

Interesting, Salmonella only shows up in one sample, and in very even counts across 7 ASVs. That is consistent with the 7 rrn copies that Salmonella has according to rrndb. So, what is that strain?

## dada2:::pfasta(sq.salm)
## BLAST against nt
spec.salm <- c("enterica",
               "enterica",
               "enterica",
               "enterica",
               "enterica",
               "enterica",
               "enterica")
names(spec.salm) <- sq.salm

All Salmonella ASVs are unambiguously S. enterica subsp. enterica.

Within that (sub-)species, some ASVs (e.g. 1 and 2) are not very discriminative, but Salmonella enterica subsp. enterica serovar Newport strain CFSAN003387, complete genome (https://www.ncbi.nlm.nih.gov/nucleotide/CP016014.1?report=genbank&log$=nucltop&blast_rank=37&RID=9CCPUEXF014) is the best match (usually tied with many others) for every single ASV. All are exact matches except ASV4 (2-off). ASV6 is the key intra-species disciminator, as this Newport accession is the only exact match to that ASV.

Clostridium (perfringens)

sq.Css1 <- getSequences(st)[tax[,"Genus"] %in% "Clostridium_sensu_stricto_1"]
uncolname(st[,sq.Css1])
##                             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq    0    0    0    0    0    0    0    0    0     0
## CB4_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GB8_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GT10_contig_list_trimmed.fq  251  148  125  125  118  118  108  105   97     3
## GT5_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GT8_contig_list_trimmed.fq     0    0    0    0    0    0    3    0    0     0
##                             [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq     0     1     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq     0     0     1     1     1     1     1     1
## GT5_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT8_contig_list_trimmed.fq      2     0     0     0     0     0     0     0
##                             [,19] [,20] [,21] [,22]
## CB28_contig_list_trimmed.fq     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0
## GT10_contig_list_trimmed.fq     1     1     1     0
## GT5_contig_list_trimmed.fq      0     0     0     0
## GT8_contig_list_trimmed.fq      0     0     0     1

This genus only shows up in sample GT10 with any frequency, but also shows up at low numbers in CB28 and GT8.

## dada2:::pfasta(sq.Css1) # C. perfringens group
## BLAST against nt
spec.Css1 <- c("perfringens", # 100
"perfringens",
"perfringens",
"perfringens", # 100
"perfringens",
"perfringens",
"perfringens",
"perfringens",
"perfringens",
"septicum", # SEQ 10
"perfringens",
"perfringens",
"septicum",
"septicum",
"perfringens",
"perfringens",
"perfringens", # 23-off
"perfringens",
"perfringens",
"perfringens", # SEQ 20
"septicum",
"perfringens")
names(spec.Css1) <- sq.Css1

Most of the ASVs are unambiguously assigned to C. perfringens, but some are clearly the related C. septicum (not a pathogen concern).

Escherichia/Shigella coli

sq.ec<- getSequences(tax)[tax[,"Genus"] %in% "Escherichia/Shigella"]
unname(colSums(st[,sq.ec]))
##   [1] 1408 1107 1071  822  757  713  607  520  409  407  401  368  280  260  245
##  [16]  238  233  230  219  215  209  206  170  101   77   71   70   62   59   43
##  [31]   34   34   32   31   31   24   24   24   16   15   14   13   13   12   11
##  [46]   11   10    9    9    8    8    8    8    7    6    6    6    6    6    5
##  [61]    5    5    5    4    4    4    4    4    3    3    3    3    3    3    3
##  [76]    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
##  [91]    2    2    2    2    2    2    2    2    2    2    2    2    2    1    1
## [106]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [121]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [136]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [151]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [166]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [181]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [196]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [211]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [226]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [241]    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## [256]    1

The abundance of Escherichia/Shigella ASVs drops off quickly. Using the same standard as before, we’ll take a look at the first 68 ASVs, representing all ASVs with at least 4 reads study-wide.

## dada2:::pfasta(sq.ec[1:68])
## BLAST against nt
# First 60 all e. coli except 10/39/41/42/51/54/55/68, Shigella also i top hits
# 66: E.coli and E. fergusonii in top hits

The results here require some quick unpacking. First, the vast majority of these ASVs have E. coli and only E. coli among their top hits and can be unambiguously assigned to the E. coli species. Sequences 10, 39, 41, 42, 51, 54, 55 and 68 have both E. coli and some species of Shigella among their top hits. However, given the (in)famous status of Shigella as a parphyletic genus entirely contained within the E. coli species, and that this well-known inconsistency only remains due to the historical and clinical usage of those names, we simply consider these as E. coli here. Finally, a single sequences 66 has both E. coli and E. fergusonii in its top hits, thus cannot be unambiguously assigned to the species level.

spec.ec <- rep("coli", 68)
spec.ec[66] <- NA
names(spec.ec) <- sq.ec[1:68]
uncolname(st[,sq.ec[1:68]])
##                             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq    0    0    0    0    0    0    0    0    0     0
## CB4_contig_list_trimmed.fq     0    0    0    0    0    0    0    1    0     0
## GB8_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GT10_contig_list_trimmed.fq   55  783  500  254    0  510  581    0  296     6
## GT5_contig_list_trimmed.fq     9    0    5    0    0    0   26    0    0    10
## GT8_contig_list_trimmed.fq  1344  324  566  568  757  203    0  519  113   391
##                             [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq   295     0   280   260   245   233   233   221
## GT5_contig_list_trimmed.fq      1     0     0     0     0     5     0     9
## GT8_contig_list_trimmed.fq    105   368     0     0     0     0     0     0
##                             [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq     0   215   205   206   152   101    77    69
## GT5_contig_list_trimmed.fq      0     0     4     0    18     0     0     2
## GT8_contig_list_trimmed.fq    219     0     0     0     0     0     0     0
##                             [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq    63    62    59    43    34    34    32    31
## GT5_contig_list_trimmed.fq      7     0     0     0     0     0     0     0
## GT8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
##                             [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq    31    24    24    24    16    15    14    13
## GT5_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
##                             [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq    13     0    11    11    10     9     9     8
## GT5_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT8_contig_list_trimmed.fq      0    12     0     0     0     0     0     0
##                             [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq     8     8     8     0     6     6     0     0
## GT5_contig_list_trimmed.fq      0     0     0     7     0     0     6     6
## GT8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
##                             [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66]
## CB28_contig_list_trimmed.fq     0     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq     0     5     5     0     0     0     0     0
## GT5_contig_list_trimmed.fq      0     0     0     5     0     4     0     0
## GT8_contig_list_trimmed.fq      6     0     0     0     5     0     4     4
##                             [,67] [,68]
## CB28_contig_list_trimmed.fq     0     0
## CB4_contig_list_trimmed.fq      0     0
## GB8_contig_list_trimmed.fq      0     0
## GT10_contig_list_trimmed.fq     0     0
## GT5_contig_list_trimmed.fq      0     0
## GT8_contig_list_trimmed.fq      4     4

Essentially no E. coli in the first 3 samples. E. coli abundance is high in GT10 and GT8, however, with the numbers of ASVs present in each suggesting multiple E. coli strains present in each sample (E. coli has 7 copies of the 16S rRNA gene, so at most 7 alleles per strain).

Figures

First let’s plot the relative abundances and identities of the CDC key foodborne pathogens in these 6 samples: Yersinia enterolitica, E. coli (Escherichia/Shigella), Salmonella enterica, Clostridium perfringens (Clostridium_sensu_stricto_1), Campylobacter and Listeria monocytogenes.

tt <- tax
tt <- cbind(tt, Species=NA)
tt[names(spec.yers),"Species"] <- spec.yers
tt[names(spec.salm),"Species"] <- spec.salm
tt[names(spec.Css1),"Species"] <- spec.Css1
tt[names(spec.ec),"Species"] <- spec.ec
table(tt[,"Species"], useNA="ifany")
## 
##           coli       enterica enterocolitica  frederiksenii     intermedia 
##             67              7             11              3             16 
##   kristensenii    perfringens       septicum           <NA> 
##              2             18              4           2064

OK, so we have our taxonomy table with species defined for all the genera containing the 6 species of interest, only 4 of which are present. Let’s make a relative abundance barplot showing their relative abundances of these CDC species-of-interest in these samples:

ft <- sweep(st, 1, rowSums(st), "/")
spec.nice <- c("enterocolitica"="Y. enterocolitica", 
               "perfringens"="C. perfringens", 
               "enterica"="S. enterica", 
               "coli"="E. coli", 
               "monocytogenes"="L. monocytogenes", 
               "Campylobacter"="Campylobacter sp.")
tt[!tt[,"Species"] %in% names(spec.nice),"Species"] <- NA # Drop non-pathogen species
tt[!is.na(tt[,"Species"]), "Species"] <- spec.nice[tt[!is.na(tt[,"Species"]), "Species"]]
suppressWarnings(ftp <- t(rowsum(t(ft), group=tt[,"Species"]))) 
# Throws a warning because of the NAs in species, this is OK
ftp <- ftp[,colnames(ftp) %in% spec.nice]
ftp
##                             C. perfringens      E. coli S. enterica
## CB28_contig_list_trimmed.fq   7.039775e-05 0.000000e+00      0.0000
## CB4_contig_list_trimmed.fq    0.000000e+00 7.273256e-05      0.0000
## GB8_contig_list_trimmed.fq    0.000000e+00 0.000000e+00      0.0000
## GT10_contig_list_trimmed.fq   9.153963e-02 4.887957e-01      0.0125
## GT5_contig_list_trimmed.fq    0.000000e+00 9.846740e-03      0.0000
## GT8_contig_list_trimmed.fq    3.525264e-04 3.238543e-01      0.0000
##                             Y. enterocolitica
## CB28_contig_list_trimmed.fq       0.379232665
## CB4_contig_list_trimmed.fq        0.005600407
## GB8_contig_list_trimmed.fq        0.000000000
## GT10_contig_list_trimmed.fq       0.000000000
## GT5_contig_list_trimmed.fq        0.003335186
## GT8_contig_list_trimmed.fq        0.000000000
dfp <- cbind(data.frame(Sample=substr(rownames(ftp), 1, 3), stringsAsFactors=FALSE), 
             data.frame(ftp, stringsAsFactors=FALSE))
dfpm <- melt(dfp, id.vars="Sample", variable.name="Species", value.name="Proportion") ### WARNING: Auto-converts the variable name to a FACTOR
dfpm$Species <- as.character(dfpm$Species)
dfpm$Species <- gsub("[.]{2}", ". ", dfpm$Species)
dfpm$Species <- factor(dfpm$Species, levels=spec.nice)
fill.scale.path <- scale_fill_manual(values=c("C. perfringens"="#1b9e77", "E. coli"="#d95f02", "S. enterica"="#7570b3", 
                                              "Y. enterocolitica"="#e7298a", "L. monocytogenes"="tomato4", "Campylobacter sp."="#e6ab02"),
                                     name="Pathogenic\nspecies", drop=FALSE)
ppp <- ggplot(data=dfpm, aes(x=Sample, y=Proportion, fill=Species)) + geom_col(width=0.4) + ylim(0,1) + 
  fill.scale.path + theme_bw() + theme(panel.grid=element_blank()) + theme(legend.text=element_text(face="italic"))
ppp

ggsave(file.path(path.fig, "Meat_PathogenProportions.pdf"), ppp, width=6, height=2.5, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_PathogenProportions.png"), ppp, width=6, height=2.5, units="in")

Let’s return to C. perfringens as an example of high-res full complement “ribotyping” that is possible with this method. Recall the distribution of C. perfringens ASVs:

sq.Cperf <- sq.Css1[spec.Css1 %in% "perfringens"]
uncolname(st[,sq.Cperf])
##                             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq    0    0    0    0    0    0    0    0    0     0
## CB4_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GB8_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GT10_contig_list_trimmed.fq  251  148  125  125  118  118  108  105   97     0
## GT5_contig_list_trimmed.fq     0    0    0    0    0    0    0    0    0     0
## GT8_contig_list_trimmed.fq     0    0    0    0    0    0    3    0    0     2
##                             [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq     1     0     0     0     0     0     0     0
## CB4_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GB8_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT10_contig_list_trimmed.fq     0     1     1     1     1     1     1     0
## GT5_contig_list_trimmed.fq      0     0     0     0     0     0     0     0
## GT8_contig_list_trimmed.fq      0     0     0     0     0     0     0     1

We are interested ine visualizing the “ribotype” of the strain in Sample GT10 that spans C. perfringens ASVs 1-9, with a 2:1(x8) ratio of alleles consistent with the 10 copies of the 16S rRNA gene in C. perfringens genomes.

sq.Cpstrain <- sq.Cperf[1:9]
dna.Cpstrain <- DNAStringSet(sq.Cpstrain)
rna.Cpstrain <- DNAStringSet(dna.Cpstrain)
aln <- AlignSeqs(rna.Cpstrain) # align with RNA secondary structure
## Determining distance matrix based on shared 9-mers:
## ================================================================================
## 
## Time difference of 0 secs
## 
## Clustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Aligning Sequences:
## ================================================================================
## 
## Time difference of 0.04 secs
## 
## Iteration 1 of 2:
## 
## Determining distance matrix based on alignment:
## ================================================================================
## 
## Time difference of 0 secs
## 
## Reclustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0 secs
## 
## Realigning Sequences:
## ================================================================================
## 
## Time difference of 0.04 secs
## 
## Alignment converged - skipping remaining iteration.
names(aln) <- paste0("Cperf", seq_along(sq.Cpstrain))
aln
## DNAStringSet object of length 9:
##     width seq                                               names               
## [1]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf1
## [2]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf2
## [3]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf3
## [4]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf4
## [5]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf5
## [6]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf6
## [7]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf7
## [8]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf8
## [9]  1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf9

Scannings 100nt chunks to see where some diffs are for the publication plot:

lst <- min(nchar(aln))
for(start in seq(1, 1500, 100)) {
  print(ggmsa(aln, start, min(lst, start+99), font = NULL, color = 'Shapely_NT'))
}

Almost all the diffs are in V1/V2 it seems (about postiions 50-75 and 150-190 in these sequences, which start at ~E. coli coordinate position 27, which overlap the V1 and V2 E. coli positions according to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2562909/).

Maybe just plot V2 using those coords:

pmsa <- ggmsa(aln, 137-27, 242-27, font = NULL, color = 'Shapely_NT') + 
  ggtitle(expression(italic("C. perfringens")~"alleles, V2 hypervariable region")) +
  theme(axis.text.y=element_blank()) + theme(plot.title = element_text(hjust = 0.5))
pmsa

ggsave(file.path(path.fig, "Meat_CPerf_MSA.pdf"), pmsa, width=6, height=1.2, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_CPerf_MSA.png"), pmsa, width=6, height=1.2, units="in")

Add the abundances as a column to the right:

df.ab <- data.frame(Name=factor(names(aln), levels=sort(names(aln), decreasing=TRUE)), 
                    Sequence=sq.Cpstrain, 
                    Abundance=st["GT10_contig_list_trimmed.fq", sq.Cpstrain])
pab <- ggplot(df.ab, aes(x=Abundance, y=Name)) +
  geom_col() +
  theme_minimal() +
  theme(panel.grid=element_blank()) +
  theme(axis.text=element_blank()) +
  ylab("") + xlab("Relative\nAbundance")
pab

ggsave(file.path(path.fig, "Meat_CPerf_MSA_ab.pdf"), pab, width=1, height=1.2, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_CPerf_MSA_ab.png"), pab, width=1, height=1.2, units="in")