Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:
## [1] '1.18.0'
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## /Users/bcallah/LoopData/16S/CallahanMeat/CB28_contig_list_trimmed.fq
## 42342
## /Users/bcallah/LoopData/16S/CallahanMeat/CB4_contig_list_trimmed.fq
## 34415
## /Users/bcallah/LoopData/16S/CallahanMeat/GB8_contig_list_trimmed.fq
## 43609
## /Users/bcallah/LoopData/16S/CallahanMeat/GT10_contig_list_trimmed.fq
## 29603
## /Users/bcallah/LoopData/16S/CallahanMeat/GT5_contig_list_trimmed.fq
## 29938
## /Users/bcallah/LoopData/16S/CallahanMeat/GT8_contig_list_trimmed.fq
## 49218
First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community. Note that this data was generated alongside the Fecal data, so all parameters used there are expected to translate.
plotComplexity(fn)
plotQualityProfile(fn)
Remove the primers (and any flanking sequence) from the reads, and filter out reads that don’t contain both primers:
FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## 276 sequences out of 42342 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/CB28_contig_list_trimmed.fq
## Read in 42342, output 15651 (37%) filtered sequences.
## 198 sequences out of 34415 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/CB4_contig_list_trimmed.fq
## Read in 34415, output 15374 (44.7%) filtered sequences.
## 923 sequences out of 43609 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GB8_contig_list_trimmed.fq
## Read in 43609, output 7529 (17.3%) filtered sequences.
## 120 sequences out of 29603 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT10_contig_list_trimmed.fq
## Read in 29603, output 14713 (49.7%) filtered sequences.
## 183 sequences out of 29938 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT5_contig_list_trimmed.fq
## Read in 29938, output 13889 (46.4%) filtered sequences.
## 304 sequences out of 49218 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/nop/GT8_contig_list_trimmed.fq
## Read in 49218, output 20759 (42.2%) filtered sequences.
As in the fecal samples, only a modest (~40%) proportion of sequences are passing the primer screen.
Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing maxEE=1
, as that was determined to be a better filter than maxEE=2
in the Zymo mock community data.
filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/CB28_contig_list_trimmed.fq
## Read in 15651, output 14327 (91.5%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/CB4_contig_list_trimmed.fq
## Read in 15374, output 14070 (91.5%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GB8_contig_list_trimmed.fq
## Read in 7529, output 6143 (81.6%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT10_contig_list_trimmed.fq
## Read in 14713, output 13526 (91.9%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT5_contig_list_trimmed.fq
## Read in 13889, output 12866 (92.6%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanMeat/filtered/GT8_contig_list_trimmed.fq
## Read in 20759, output 18636 (89.8%) filtered sequences.
Final inspection of the quality profile:
plotQualityProfile(filt)
Final progress of reads through full-length primer enforcement and filtering/trimming:
cbind(raw=out[,1], full.len=out[,2], filtered=track[,2])
## raw full.len filtered
## CB28_contig_list_trimmed.fq 42342 15651 14327
## CB4_contig_list_trimmed.fq 34415 15374 14070
## GB8_contig_list_trimmed.fq 43609 7529 6143
## GT10_contig_list_trimmed.fq 29603 14713 13526
## GT5_contig_list_trimmed.fq 29938 13889 12866
## GT8_contig_list_trimmed.fq 49218 20759 18636
Few reads lost to quality filtering. Note that GB8 in particular had a large fraction (~80%) of reads lost due to a failure to detect both primers.
Learn the error rates:
err.rds <- file.path(path.rds, "err_16S_Meat.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
err <- learnErrors(filt, multi=TRUE, verbose=0)
saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)
Denoise the filtered data into ASVs using current LoopSeq appropriate “high-sensitivity” settings:
dd <- dada(filt, err, OMEGA_A=1e-10, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd
## $CB28_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 388 sequence variants were inferred from 2044 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $CB4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 673 sequence variants were inferred from 2960 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $GB8_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 281 sequence variants were inferred from 1167 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $GT10_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 448 sequence variants were inferred from 2144 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $GT5_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 563 sequence variants were inferred from 2411 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $GT8_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 318 sequence variants were inferred from 2552 input unique sequences.
## Key parameters: OMEGA_A = 1e-10, OMEGA_C = 1e-40, BAND_SIZE = 16
Make sequence table and remove chimeras:
sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)
## Identified 188 bimeras out of 2380 input sequences.
Assign taxonomy:
tax.rds <- file.path(path.rds, "tax_16S_Meat.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")
Here we are going to focus on 6 species of foodborne pathogens that are known to commonly occur in retail meat according to the CDC:
Campylobacter: https://www.cdc.gov/campylobacter/faq.html Clostridium perfringens: https://www.cdc.gov/foodsafety/diseases/clostridium-perfringens.html (some strains) E. coli: https://www.cdc.gov/ecoli/ (STEC, EHEC, ETEC, Diarrheagenic types) Listeria monocytogenes: https://www.cdc.gov/listeria/index.html Salmonella enterica: https://www.cdc.gov/salmonella/ (some serotypes, esp Enteritidis, Typhimurium, Newport, and Javiana) Yersinia enterocolitica: https://www.cdc.gov/yersinia/
First we’ll look for the presence of these bacterial genera in our samples (which to this point have only been assigned taxonomy down to the genus level).
genera.path <- c("Campylobacter",
"Clostridium_sensu_stricto_1", # The Silva 132 genus of C. perfringens
"Escherichia/Shigella",
"Listeria",
"Salmonella",
"Yersinia")
genera.path %in% tax[,"Genus"]
## [1] FALSE TRUE TRUE FALSE TRUE TRUE
Campylobacter and Listeria were not detected in our data. We would like to make species-level assignments for the four pathogenic genera that were in our data, to determine if the pathogen species is there and not a congeneric. This will be done by hand via manual inspection of BLAST searches of the relevant ASVs, so as to avoid any questions about how effectively current species-level assignment tools perform.
sq.yers <- getSequences(st)[tax[,"Genus"] %in% "Yersinia"]
names(sq.yers) <- paste0("Yersinia", seq_along(sq.yers))
unname(colSums(st[,sq.yers]))
## [1] 2326 2286 823 321 228 210 165 161 141 139 131 119 117 33 32
## [16] 25 24 22 19 13 11 10 10 9 9 8 7 5 5 4
## [31] 4 4 3 3 3 3 3 2 2 2 2 2 2 2 2
## [46] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [91] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [121] 1 1 1 1
The abundance of Yersinia ASVs drops off quickly. For convenience we’ll just do by-hand species assignments for the top 32 (i.e. all Yersinia ASVs with at least an abundance of 4 across all samples).
## dada2:::pfasta(sq.yers[1:32])
## BLAST against nt
spec.yers <- c("enterocolitica", # 100
"enterocolitica",
"enterocolitica",
"intermedia", # 100
"intermedia", # 100
"intermedia",
"intermedia",
"intermedia",
"kristensenii", # 100
"intermedia", # 100 -- 10th seq
"intermedia", # 100
"intermedia",
"intermedia",
"intermedia", #100
"intermedia",
"intermedia",
"intermedia",
"intermedia", #100
"enterocolitica", # 100
"frederiksenii", # -- 20th seq
"enterocolitica",
"enterocolitica",
"enterocolitica",
"frederiksenii", # 10-off
"intermedia",
"enterocolitica", # end of query not included (5% of read)
"kristensenii",
"enterocolitica",
"intermedia", # 8-off of intermedia/kristensenii/aldovae, but cooccur patterns say intermedia
"enterocolitica", # -- 30th seq
"enterocolitica",
"frederiksenii")
names(spec.yers) <- sq.yers[1:length(spec.yers)]
All Yersinia ASVs were able to be relatively unambiguously assigned to a specific species, with the arguable exception of Yerisinia29, which was resolved by looking at the co-occurence patterns between that ASV and other unambiguously assigned Yerisina ASVs.
Y. enterocolitica is the main foodborne pathogen of concern here, the other Yersinias are not or very rarely associated with disease in humans.
sq.salm <- getSequences(tax)[tax[,"Genus"] %in% "Salmonella"]
uncolname <- function(x) { y <- x; colnames(y) <- NULL; y }
uncolname(st[,sq.salm])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 28 27 26 23 22 19 19
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 0
Interesting, Salmonella only shows up in one sample, and in very even counts across 7 ASVs. That is consistent with the 7 rrn copies that Salmonella has according to rrndb. So, what is that strain?
## dada2:::pfasta(sq.salm)
## BLAST against nt
spec.salm <- c("enterica",
"enterica",
"enterica",
"enterica",
"enterica",
"enterica",
"enterica")
names(spec.salm) <- sq.salm
All Salmonella ASVs are unambiguously S. enterica subsp. enterica.
Within that (sub-)species, some ASVs (e.g. 1 and 2) are not very discriminative, but Salmonella enterica subsp. enterica serovar Newport strain CFSAN003387, complete genome (https://www.ncbi.nlm.nih.gov/nucleotide/CP016014.1?report=genbank&log$=nucltop&blast_rank=37&RID=9CCPUEXF014) is the best match (usually tied with many others) for every single ASV. All are exact matches except ASV4 (2-off). ASV6 is the key intra-species disciminator, as this Newport accession is the only exact match to that ASV.
sq.Css1 <- getSequences(st)[tax[,"Genus"] %in% "Clostridium_sensu_stricto_1"]
uncolname(st[,sq.Css1])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 251 148 125 125 118 118 108 105 97 3
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 3 0 0 0
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq 0 1 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 0 0 1 1 1 1 1 1
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 2 0 0 0 0 0 0 0
## [,19] [,20] [,21] [,22]
## CB28_contig_list_trimmed.fq 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0
## GT10_contig_list_trimmed.fq 1 1 1 0
## GT5_contig_list_trimmed.fq 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 1
This genus only shows up in sample GT10 with any frequency, but also shows up at low numbers in CB28 and GT8.
## dada2:::pfasta(sq.Css1) # C. perfringens group
## BLAST against nt
spec.Css1 <- c("perfringens", # 100
"perfringens",
"perfringens",
"perfringens", # 100
"perfringens",
"perfringens",
"perfringens",
"perfringens",
"perfringens",
"septicum", # SEQ 10
"perfringens",
"perfringens",
"septicum",
"septicum",
"perfringens",
"perfringens",
"perfringens", # 23-off
"perfringens",
"perfringens",
"perfringens", # SEQ 20
"septicum",
"perfringens")
names(spec.Css1) <- sq.Css1
Most of the ASVs are unambiguously assigned to C. perfringens, but some are clearly the related C. septicum (not a pathogen concern).
sq.ec<- getSequences(tax)[tax[,"Genus"] %in% "Escherichia/Shigella"]
unname(colSums(st[,sq.ec]))
## [1] 1408 1107 1071 822 757 713 607 520 409 407 401 368 280 260 245
## [16] 238 233 230 219 215 209 206 170 101 77 71 70 62 59 43
## [31] 34 34 32 31 31 24 24 24 16 15 14 13 13 12 11
## [46] 11 10 9 9 8 8 8 8 7 6 6 6 6 6 5
## [61] 5 5 5 4 4 4 4 4 3 3 3 3 3 3 3
## [76] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [91] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [136] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [151] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [166] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [181] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [196] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [226] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [241] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [256] 1
The abundance of Escherichia/Shigella ASVs drops off quickly. Using the same standard as before, we’ll take a look at the first 68 ASVs, representing all ASVs with at least 4 reads study-wide.
## dada2:::pfasta(sq.ec[1:68])
## BLAST against nt
# First 60 all e. coli except 10/39/41/42/51/54/55/68, Shigella also i top hits
# 66: E.coli and E. fergusonii in top hits
The results here require some quick unpacking. First, the vast majority of these ASVs have E. coli and only E. coli among their top hits and can be unambiguously assigned to the E. coli species. Sequences 10, 39, 41, 42, 51, 54, 55 and 68 have both E. coli and some species of Shigella among their top hits. However, given the (in)famous status of Shigella as a parphyletic genus entirely contained within the E. coli species, and that this well-known inconsistency only remains due to the historical and clinical usage of those names, we simply consider these as E. coli here. Finally, a single sequences 66 has both E. coli and E. fergusonii in its top hits, thus cannot be unambiguously assigned to the species level.
spec.ec <- rep("coli", 68)
spec.ec[66] <- NA
names(spec.ec) <- sq.ec[1:68]
uncolname(st[,sq.ec[1:68]])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 1 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 55 783 500 254 0 510 581 0 296 6
## GT5_contig_list_trimmed.fq 9 0 5 0 0 0 26 0 0 10
## GT8_contig_list_trimmed.fq 1344 324 566 568 757 203 0 519 113 391
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 295 0 280 260 245 233 233 221
## GT5_contig_list_trimmed.fq 1 0 0 0 0 5 0 9
## GT8_contig_list_trimmed.fq 105 368 0 0 0 0 0 0
## [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 0 215 205 206 152 101 77 69
## GT5_contig_list_trimmed.fq 0 0 4 0 18 0 0 2
## GT8_contig_list_trimmed.fq 219 0 0 0 0 0 0 0
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 63 62 59 43 34 34 32 31
## GT5_contig_list_trimmed.fq 7 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 31 24 24 24 16 15 14 13
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 13 0 11 11 10 9 9 8
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 12 0 0 0 0 0 0
## [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 8 8 8 0 6 6 0 0
## GT5_contig_list_trimmed.fq 0 0 0 7 0 0 6 6
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 0 5 5 0 0 0 0 0
## GT5_contig_list_trimmed.fq 0 0 0 5 0 4 0 0
## GT8_contig_list_trimmed.fq 6 0 0 0 5 0 4 4
## [,67] [,68]
## CB28_contig_list_trimmed.fq 0 0
## CB4_contig_list_trimmed.fq 0 0
## GB8_contig_list_trimmed.fq 0 0
## GT10_contig_list_trimmed.fq 0 0
## GT5_contig_list_trimmed.fq 0 0
## GT8_contig_list_trimmed.fq 4 4
Essentially no E. coli in the first 3 samples. E. coli abundance is high in GT10 and GT8, however, with the numbers of ASVs present in each suggesting multiple E. coli strains present in each sample (E. coli has 7 copies of the 16S rRNA gene, so at most 7 alleles per strain).
First let’s plot the relative abundances and identities of the CDC key foodborne pathogens in these 6 samples: Yersinia enterolitica, E. coli (Escherichia/Shigella), Salmonella enterica, Clostridium perfringens (Clostridium_sensu_stricto_1), Campylobacter and Listeria monocytogenes.
tt <- tax
tt <- cbind(tt, Species=NA)
tt[names(spec.yers),"Species"] <- spec.yers
tt[names(spec.salm),"Species"] <- spec.salm
tt[names(spec.Css1),"Species"] <- spec.Css1
tt[names(spec.ec),"Species"] <- spec.ec
table(tt[,"Species"], useNA="ifany")
##
## coli enterica enterocolitica frederiksenii intermedia
## 67 7 11 3 16
## kristensenii perfringens septicum <NA>
## 2 18 4 2064
OK, so we have our taxonomy table with species defined for all the genera containing the 6 species of interest, only 4 of which are present. Let’s make a relative abundance barplot showing their relative abundances of these CDC species-of-interest in these samples:
ft <- sweep(st, 1, rowSums(st), "/")
spec.nice <- c("enterocolitica"="Y. enterocolitica",
"perfringens"="C. perfringens",
"enterica"="S. enterica",
"coli"="E. coli",
"monocytogenes"="L. monocytogenes",
"Campylobacter"="Campylobacter sp.")
tt[!tt[,"Species"] %in% names(spec.nice),"Species"] <- NA # Drop non-pathogen species
tt[!is.na(tt[,"Species"]), "Species"] <- spec.nice[tt[!is.na(tt[,"Species"]), "Species"]]
suppressWarnings(ftp <- t(rowsum(t(ft), group=tt[,"Species"])))
# Throws a warning because of the NAs in species, this is OK
ftp <- ftp[,colnames(ftp) %in% spec.nice]
ftp
## C. perfringens E. coli S. enterica
## CB28_contig_list_trimmed.fq 7.039775e-05 0.000000e+00 0.0000
## CB4_contig_list_trimmed.fq 0.000000e+00 7.273256e-05 0.0000
## GB8_contig_list_trimmed.fq 0.000000e+00 0.000000e+00 0.0000
## GT10_contig_list_trimmed.fq 9.153963e-02 4.887957e-01 0.0125
## GT5_contig_list_trimmed.fq 0.000000e+00 9.846740e-03 0.0000
## GT8_contig_list_trimmed.fq 3.525264e-04 3.238543e-01 0.0000
## Y. enterocolitica
## CB28_contig_list_trimmed.fq 0.379232665
## CB4_contig_list_trimmed.fq 0.005600407
## GB8_contig_list_trimmed.fq 0.000000000
## GT10_contig_list_trimmed.fq 0.000000000
## GT5_contig_list_trimmed.fq 0.003335186
## GT8_contig_list_trimmed.fq 0.000000000
dfp <- cbind(data.frame(Sample=substr(rownames(ftp), 1, 3), stringsAsFactors=FALSE),
data.frame(ftp, stringsAsFactors=FALSE))
dfpm <- melt(dfp, id.vars="Sample", variable.name="Species", value.name="Proportion") ### WARNING: Auto-converts the variable name to a FACTOR
dfpm$Species <- as.character(dfpm$Species)
dfpm$Species <- gsub("[.]{2}", ". ", dfpm$Species)
dfpm$Species <- factor(dfpm$Species, levels=spec.nice)
fill.scale.path <- scale_fill_manual(values=c("C. perfringens"="#1b9e77", "E. coli"="#d95f02", "S. enterica"="#7570b3",
"Y. enterocolitica"="#e7298a", "L. monocytogenes"="tomato4", "Campylobacter sp."="#e6ab02"),
name="Pathogenic\nspecies", drop=FALSE)
ppp <- ggplot(data=dfpm, aes(x=Sample, y=Proportion, fill=Species)) + geom_col(width=0.4) + ylim(0,1) +
fill.scale.path + theme_bw() + theme(panel.grid=element_blank()) + theme(legend.text=element_text(face="italic"))
ppp
ggsave(file.path(path.fig, "Meat_PathogenProportions.pdf"), ppp, width=6, height=2.5, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_PathogenProportions.png"), ppp, width=6, height=2.5, units="in")
Let’s return to C. perfringens as an example of high-res full complement “ribotyping” that is possible with this method. Recall the distribution of C. perfringens ASVs:
sq.Cperf <- sq.Css1[spec.Css1 %in% "perfringens"]
uncolname(st[,sq.Cperf])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CB28_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 251 148 125 125 118 118 108 105 97 0
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 3 0 0 2
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## CB28_contig_list_trimmed.fq 1 0 0 0 0 0 0 0
## CB4_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GB8_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT10_contig_list_trimmed.fq 0 1 1 1 1 1 1 0
## GT5_contig_list_trimmed.fq 0 0 0 0 0 0 0 0
## GT8_contig_list_trimmed.fq 0 0 0 0 0 0 0 1
We are interested ine visualizing the “ribotype” of the strain in Sample GT10 that spans C. perfringens ASVs 1-9, with a 2:1(x8) ratio of alleles consistent with the 10 copies of the 16S rRNA gene in C. perfringens genomes.
sq.Cpstrain <- sq.Cperf[1:9]
dna.Cpstrain <- DNAStringSet(sq.Cpstrain)
rna.Cpstrain <- DNAStringSet(dna.Cpstrain)
aln <- AlignSeqs(rna.Cpstrain) # align with RNA secondary structure
## Determining distance matrix based on shared 9-mers:
## ================================================================================
##
## Time difference of 0 secs
##
## Clustering into groups by similarity:
## ================================================================================
##
## Time difference of 0.01 secs
##
## Aligning Sequences:
## ================================================================================
##
## Time difference of 0.04 secs
##
## Iteration 1 of 2:
##
## Determining distance matrix based on alignment:
## ================================================================================
##
## Time difference of 0 secs
##
## Reclustering into groups by similarity:
## ================================================================================
##
## Time difference of 0 secs
##
## Realigning Sequences:
## ================================================================================
##
## Time difference of 0.04 secs
##
## Alignment converged - skipping remaining iteration.
names(aln) <- paste0("Cperf", seq_along(sq.Cpstrain))
aln
## DNAStringSet object of length 9:
## width seq names
## [1] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf1
## [2] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf2
## [3] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf3
## [4] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf4
## [5] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf5
## [6] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf6
## [7] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf7
## [8] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf8
## [9] 1439 TCAGGATGAACGCTGGCGGCGTG...AGGTAGGGTCAGCGATTGGGGTG Cperf9
Scannings 100nt chunks to see where some diffs are for the publication plot:
lst <- min(nchar(aln))
for(start in seq(1, 1500, 100)) {
print(ggmsa(aln, start, min(lst, start+99), font = NULL, color = 'Shapely_NT'))
}
Almost all the diffs are in V1/V2 it seems (about postiions 50-75 and 150-190 in these sequences, which start at ~E. coli coordinate position 27, which overlap the V1 and V2 E. coli positions according to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2562909/).
Maybe just plot V2 using those coords:
pmsa <- ggmsa(aln, 137-27, 242-27, font = NULL, color = 'Shapely_NT') +
ggtitle(expression(italic("C. perfringens")~"alleles, V2 hypervariable region")) +
theme(axis.text.y=element_blank()) + theme(plot.title = element_text(hjust = 0.5))
pmsa
ggsave(file.path(path.fig, "Meat_CPerf_MSA.pdf"), pmsa, width=6, height=1.2, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_CPerf_MSA.png"), pmsa, width=6, height=1.2, units="in")
Add the abundances as a column to the right:
df.ab <- data.frame(Name=factor(names(aln), levels=sort(names(aln), decreasing=TRUE)),
Sequence=sq.Cpstrain,
Abundance=st["GT10_contig_list_trimmed.fq", sq.Cpstrain])
pab <- ggplot(df.ab, aes(x=Abundance, y=Name)) +
geom_col() +
theme_minimal() +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
ylab("") + xlab("Relative\nAbundance")
pab
ggsave(file.path(path.fig, "Meat_CPerf_MSA_ab.pdf"), pab, width=1, height=1.2, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Meat_CPerf_MSA_ab.png"), pab, width=1, height=1.2, units="in")