Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:
## [1] '1.18.0'
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## This is vegan 2.5-7
##
## Attaching package: 'ape'
## The following object is masked from 'package:ShortRead':
##
## zoom
## The following object is masked from 'package:Biostrings':
##
## complement
## /Users/bcallah/LoopData/16S/CallahanFecal/R3.1_contig_list_trimmed.fq
## 111798
## /Users/bcallah/LoopData/16S/CallahanFecal/R9.3_contig_list_trimmed.fq
## 23405
## /Users/bcallah/LoopData/16S/CallahanFecal/R9.4_contig_list_trimmed.fq
## 35471
First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community.
plotComplexity(fn)
plotQualityProfile(fn)
These profiles look consistent with what we saw in the Zymo mock – no low complexity issues, and high qualities throughout the 16S sequence (i.e. ignoring the >1500nt tail). The one difference, especially in the R3.1 is that the length distribution is different, around a half of the reads in that library appear not to extend to the full-length of the 16S gene. This may be a common way in which quality variation in Loop libraries becomes evidence, as molecules that are sequenced to insufficient coverage are likely to end up being assembled into incomplete contigs that don’t cover the entire targeted amplicon.
Enter primers, and confirm their presence and the overall orientation of the reads. Code is adapated from the DADA2 ITS tutorial workflow:
FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer)
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
for(f in fn) {
cat("Primers detected in", basename(f), "\n")
print(rbind(FWD.Primer = sapply(allOrients(FWD), primerHits, fn = f),
REV.Primer = sapply(allOrients(REV), primerHits, fn = f)))
cat("Out of", length(readFastq(f)), "total reads.\n\n")
}
## Primers detected in R3.1_contig_list_trimmed.fq
## Forward Complement Reverse RevComp
## FWD.Primer 67942 0 0 2044
## REV.Primer 658 0 0 20771
## Out of 111798 total reads.
##
## Primers detected in R9.3_contig_list_trimmed.fq
## Forward Complement Reverse RevComp
## FWD.Primer 20954 0 0 125
## REV.Primer 59 0 0 11718
## Out of 23405 total reads.
##
## Primers detected in R9.4_contig_list_trimmed.fq
## Forward Complement Reverse RevComp
## FWD.Primer 30735 0 0 244
## REV.Primer 100 0 0 15714
## Out of 35471 total reads.
Primers in the expected orientations, and yes it looks like the R3.1 library will take a big hit from incomplete amplicon reconstruction, with the REV primer only being detected in <20% of the total reads, whereas the other libraries are closer to 50%. It is worth noting that all of these libraries have a much lower fraction of primer-complete reads than the Zymo mock community data had (~90%).
Remove the primers (and any flanking sequence) from the reads, and filter out reads that don’t contain both primers:
nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## 2139 sequences out of 111798 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R3.1_contig_list_trimmed.fq
## Read in 111798, output 18311 (16.4%) filtered sequences.
## 109 sequences out of 23405 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R9.3_contig_list_trimmed.fq
## Read in 23405, output 11563 (49.4%) filtered sequences.
## 223 sequences out of 35471 are being reverse-complemented.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R9.4_contig_list_trimmed.fq
## Read in 35471, output 15320 (43.2%) filtered sequences.
Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing maxEE=1
, as that was determined to be a better filter than maxEE=2
in the Zymo mock community data.
filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R3.1_contig_list_trimmed.fq
## Read in 18311, output 15209 (83.1%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R9.3_contig_list_trimmed.fq
## Read in 11563, output 10642 (92%) filtered sequences.
## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R9.4_contig_list_trimmed.fq
## Read in 15320, output 14152 (92.4%) filtered sequences.
Final inspection of the quality profile:
plotQualityProfile(filt)
Final progress of reads through filtering and trimming:
cbind(raw=out[,1], primers=out[,2], filtered=track[,2])
## raw primers filtered
## R3.1_contig_list_trimmed.fq 111798 18311 15209
## R9.3_contig_list_trimmed.fq 23405 11563 10642
## R9.4_contig_list_trimmed.fq 35471 15320 14152
The large loss of reads due to a lack of detectable primers, especially in the R3.1 library, is notable. It seems likely that optimizing the ratio of Loop molecules to total output reads will be an important part of maximizing throughput in the form of the most high-quality Loop reads per run.
Learn the error rates:
err.rds <- file.path(path.rds, "err_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
err <- learnErrors(filt, multi=TRUE, verbose=0)
saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)
Denoise the filtered data into ASVs using current DADA2 defaults:
dd <- dada(filt, err, multi=TRUE, verbose=0)
dd
## $R3.1_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 123 sequence variants were inferred from 2598 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $R9.3_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 385 sequence variants were inferred from 2498 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $R9.4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 394 sequence variants were inferred from 2882 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
Make sequence table and remove chimeras:
sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)
## Identified 33 bimeras out of 739 input sequences.
Assign taxonomy:
tax.rds <- file.path(path.rds, "tax_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")
table(tax[,"Phylum"], useNA="ifany")
##
## Actinobacteria Bacteroidetes Epsilonbacteraeota Firmicutes
## 12 53 1 583
## Fusobacteria Proteobacteria Verrucomicrobia
## 2 46 9
The usual suspects at the Phylum level at least.
Currently, PacBio is the gold standard for full-length 16S gene sequencing. Previously, we sequenced these same three fecal samples as part of our earlier investigation of the accuracy of PacBio full-length 16S sequencing. We processed those samples through the DADA2 workflow in that paper, and have included the sequence table obtained from a set of 12 fecal samples (including the 3 resequenced here using Loop full-length 16S) as part of this repository.
Importing PacBio results, and coordinating sample names between the Loop and PacBio results:
pb1.rds <- file.path(path.rds, "PacBio_Fecal_st1.rds")
st1.pb <- removeBimeraDenovo(readRDS(pb1.rds), minFoldParentOverAbundance=4.5,
multi=TRUE, verbose=TRUE)
## Identified 3 bimeras out of 1108 input sequences.
pb2.rds <- file.path(path.rds, "PacBio_Fecal_st2.rds")
st2.pb <- removeBimeraDenovo(readRDS(pb2.rds), minFoldParentOverAbundance=4.5,
multi=TRUE, verbose=TRUE)
## Identified 8 bimeras out of 1411 input sequences.
sam.names <- sapply(strsplit(rownames(st), "_"), `[`, 1) # e.g. R3.1_contig_list_trimmed.fq
rownames(st) <- sam.names; rownames(sta) <- sam.names
rownames(st1.pb) <- gsub("^Callahan_16S_2pM-Cell1_", "", rownames(st1.pb)) # Remove prefix
rownames(st1.pb) <- gsub("[.]fastq[.]gz$", "", rownames(st1.pb)) # Remove suffix
rownames(st1.pb) <- gsub("_", "", rownames(st1.pb)) # e.g. R_3.1
rownames(st2.pb) <- gsub("_", "", rownames(st2.pb)) # e.g. R_3.1
if(!all(rownames(st) %in% rownames(st1.pb))) stop("Sample name mismatch (1).")
if(!all(rownames(st) %in% rownames(st2.pb))) stop("Sample name mismatch (2).")
st1.pb <- st1.pb[rownames(st),]
st2.pb <- st2.pb[rownames(st),]
In the PacBio protocol, a slightly different set of primers is used:
FWD.pacb <- "AGRGTTYGATYMTGGCTCAG"
FWD.loop <- "AGAGTTTGATCMTGGC"
REV.pacb <- "RGYTACCTTGTTACGACTT"
REV.loop <- "TACCTTGTTACGACTT"
The result is that the Loop sequences have an additional 4 nucleotides at the start of the reads that are absent in the PacBio data. To remedy that, we will simply remove the first 4nts from all the Loop sequences:
sq.loop.full <- getSequences(st)
sq.loop.trunc <- substr(sq.loop.full, 5, nchar(sq.loop.full))
any(duplicated(sq.loop.trunc)) # FALSE
## [1] FALSE
if(mean(grepl("^TCAG", sq.loop.full)) < 0.95) stop("Trim issue, expected 4nt sequence not at start of these sequences.")
st.loop <- st; tax.loop <- tax
colnames(st.loop) <- sq.loop.trunc; rownames(tax.loop) <- sq.loop.trunc
As a first pass, looking at the overlap in ASVs and reads between Loop and PacBio full-length 16S data (replicate 2) on each sample:
setASV <- function(unq1, unq2, nms) {
unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
rval <- c(sum(!names(unq1) %in% names(unq2)),
sum(names(unq1) %in% names(unq2)),
sum(!names(unq2) %in% names(unq1)))
names(rval) <- c(nms[[1]], "Shared", nms[[2]])
rval
}
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
## Loop Shared PB1
## R3.1 40 68 9
## R9.3 113 262 74
## R9.4 172 214 48
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
## Loop Shared PB2
## R3.1 26 82 16
## R9.3 69 306 151
## R9.4 131 255 113
t(sapply(sam.names, function(nm) setASV(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
## PB1 Shared PB2
## R3.1 10 67 31
## R9.3 53 283 174
## R9.4 50 212 156
Half or more of the denoised ASVs are shared across the methods. How about on a per-read basis?
setReads <- function(unq1, unq2, nms) {
unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
rval <- c(sum(unq1[!names(unq1) %in% names(unq2)]),
sum(unq1[names(unq1) %in% names(unq2)]),
sum(unq2[names(unq2) %in% names(unq1)]),
sum(unq2[!names(unq2) %in% names(unq1)]))
names(rval) <- c(nms[[1]], paste0("Shared", nms[[1]]), paste0("Shared", nms[[2]]), nms[[2]])
rval
}
totReads <- function(st1, st2) { as.matrix(cbind(rowSums(st1), rowSums(st1), rowSums(st2), rowSums(st2))) }
tab.loop.pb1 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
tab.loop.pb1
## Loop SharedLoop SharedPB1 PB1
## R3.1 410 14364 14861 111
## R9.3 1139 8853 8960 769
## R9.4 1903 11659 7366 450
tab.loop.pb1/totReads(st.loop, st1.pb)
## Loop SharedLoop SharedPB1 PB1
## R3.1 0.02775146 0.9722485 0.9925862 0.007413839
## R9.3 0.11399119 0.8860088 0.9209580 0.079042039
## R9.4 0.14031854 0.8596815 0.9424258 0.057574207
tab.loop.pb2 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
tab.loop.pb2
## Loop SharedLoop SharedPB2 PB2
## R3.1 347 14427 25717 371
## R9.3 490 9502 16094 1827
## R9.4 1204 12358 11870 994
tab.loop.pb2/totReads(st.loop, st2.pb)
## Loop SharedLoop SharedPB2 PB2
## R3.1 0.02348721 0.9765128 0.9857789 0.0142211
## R9.3 0.04903923 0.9509608 0.8980526 0.1019474
## R9.4 0.08877747 0.9112225 0.9227301 0.0772699
tab.pb1.pb2 <- t(sapply(sam.names, function(nm) setReads(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
tab.pb1.pb2
## PB1 SharedPB1 SharedPB2 PB2
## R3.1 203 14769 25510 578
## R9.3 462 9267 15721 2200
## R9.4 556 7260 11109 1755
tab.pb1.pb2/totReads(st1.pb, st2.pb)
## PB1 SharedPB1 SharedPB2 PB2
## R3.1 0.01355864 0.9864414 0.9778442 0.02215578
## R9.3 0.04748689 0.9525131 0.8772390 0.12276101
## R9.4 0.07113613 0.9288639 0.8635728 0.13642724
A very high fraction of reads (>85%) occur in ASVs identified consistently by both methods. How do things look at the whole community level, when we ordinate these three samples?
Combine the sequence tables together:
st.loop.lab <- st.loop
rownames(st.loop.lab) <- gsub("$", ".loop", rownames(st.loop.lab))
st1.pb.lab <- st1.pb
rownames(st1.pb.lab) <- gsub("$", ".pb1", rownames(st1.pb.lab))
st2.pb.lab <- st2.pb
rownames(st2.pb.lab) <- gsub("$", ".pb2", rownames(st2.pb.lab))
st.both.lab <- mergeSequenceTables(st.loop.lab, st1.pb.lab, st2.pb.lab)
ft.both.lab <- sweep(st.both.lab, 1, rowSums(st.both.lab), "/")
Perform ordination:
bc <- vegdist(ft.both.lab, "bray")
mds <- pcoa(bc)
df <- data.frame(mds$vectors)
df$Sample <- substr(rownames(df), 1, 4)
df$Technology <- sapply(strsplit(rownames(df), "[.]"), `[`, 3)
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
clr3 <- gg_color_hue(3)
df$Technology <- c(loop="LoopSeq", pb1="PacBio CCS\n(S/P2-C2/5.0)\n", pb2="PacBio CCS\n(S/P3-C3/5.0)")[df$Technology]
tech.color.scale <- c("LoopSeq"=clr3[[1]], "PacBio CCS\n(S/P2-C2/5.0)\n"=clr3[[2]],
"PacBio CCS\n(S/P3-C3/5.0)"="darkgreen")
ppb <- ggplot(data=df, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() +
scale_color_manual(values=tech.color.scale)
ppb + geom_text(alpha=0.8)
labdf <- df[df$Technology=="LoopSeq",]
labdf$Axis.1 <- 0.8*labdf$Axis.1 # pull labels towards middle of plot
labdf$Axis.2 <- 0.8*labdf$Axis.2
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.pdf"),
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"),
width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.png"),
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"),
width=5, height=3, units="in")
They fall almost exactly on top of each other. Worth calling out that two of these samples were longitudinal samples from the same person too, so inter-subject and within-subject temporal variability is represented here, as well as the technical replication amongst the same technology (PacBio).
Comparing the measurement dissimilarities between Loop and PacBio and between the PacBio technical replicates:
as.matrix(bc)[df$Sample=="R3.1", df$Sample=="R3.1"]
## R3.1.loop R3.1.pb1 R3.1.pb2
## R3.1.loop 0.0000000 0.13655146 0.13894991
## R3.1.pb1 0.1365515 0.00000000 0.04260862
## R3.1.pb2 0.1389499 0.04260862 0.00000000
as.matrix(bc)[df$Sample=="R9.3", df$Sample=="R9.3"]
## R9.3.loop R9.3.pb1 R9.3.pb2
## R9.3.loop 0.0000000 0.1974966 0.1773402
## R9.3.pb1 0.1974966 0.0000000 0.1733080
## R9.3.pb2 0.1773402 0.1733080 0.0000000
as.matrix(bc)[df$Sample=="R9.4", df$Sample=="R9.4"]
## R9.4.loop R9.4.pb1 R9.4.pb2
## R9.4.loop 0.0000000 0.2299891 0.2098496
## R9.4.pb1 0.2299891 0.0000000 0.1815628
## R9.4.pb2 0.2098496 0.1815628 0.0000000
In Sample R3.1 the PacBio technical replicates are highly similar, but in the other two samples the Loop measurement is about as similar to each PacBio technical replicate as the PacBio technical replicates are to each other!
Also testing Jaccard distance to ensure this pattern is not entirely driven by high abundance ASVs.
jc <- vegdist(ft.both.lab, "jaccard")
mds.jc <- pcoa(jc)
df.jc <- df # Just replace the vectors with the new jaccard derived numbers
df.jc[,paste0("Axis.", seq(8))] <- data.frame(mds.jc$vectors)
ppj <- ggplot(data=df.jc, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() +
scale_color_manual(values=tech.color.scale)
ppj + geom_text(alpha=0.8)
labdf.jc <- df.jc[df.jc$Technology=="LoopSeq",]
labdf.jc$Axis.1 <- 0.8*labdf.jc$Axis.1 # pull labels towards middle of plot
labdf.jc$Axis.2 <- 0.8*labdf.jc$Axis.2
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.pdf"),
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"),
width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.png"),
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"),
width=5, height=3, units="in")
Exact same qualitative patterns with Jaccard distance as well.
Natural samples re closer in complexity and composition thatn are mock communities to the types of samples in which people want to employ technologies like LoopSeq full-length. However, unlike in mock communities, we do not know the true compposition of natural samples, and thus accuracy is more difficult to determine in natural samples. Here we will attempt to evaluate the accuracy of LoopSeq FL 16S in natural human fecal samples by leveraging the variable levels at which different sections of the 16S rRNA gene are evolutionarily conserved. In short, we expect the differences between legitimate biological variants to preferentially be found in the variable regions of the 16S rRNA gene, while artefactual variation introduced by technical error processes will not.
We’ll use [the external program ssu-align(http://eddylab.org/software/ssu-align/) for this purpose, which uses Infernal to align potentially large collectiosn of SSU rRNA sequences.
First we’ll run the program on an example sequence, to show its output:
ssu.exe <- "~/Desktop/bin/ssu-align-0.1.1/src/ssu-align" # CHANGE ME...
ssu.dir <- "/usr/local/share/ssu-align-0.1.1"
export.str <- paste0('export SSUALIGNDIR="', ssu.dir, '";') # Prepend to all ssu-align calls
# i <- 2 has just 1 sub, for testing
d1 <- dd[[1]]
i <- 2
bs <- d1$birth_subs[d1$birth_subs$clust==i,]
sq <- d1$sequence[d1$clustering$birth_from[i]] # Positions are in birth (ref) seq
if(!dir.exists("temp")) dir.create("temp")
tf <- "temp/tempfile.fa"
writeFasta(sq, tf)
td <- "temp/tmp"
system(paste(export.str, ssu.exe, '-f', tf, td))
At the end of the program, the key file we are interested in is the bacterial alignment, which is the alignment of the input sequences against the bacterial SSU RNA model. Inspecting the formate of that file:
print(substr(readLines("temp/tmp/tmp.bacteria.stk"), 1, 80))
## [1] "# STOCKHOLM 1.0"
## [2] "#=GF AU Infernal s0.1.1"
## [3] ""
## [4] "1 -------------------------UCAGAACGAACGCUGGCGGCGUGGAUAAGACAUGCAAGUCGA"
## [5] "#=GR 1 PP .........................******************************************"
## [6] "#=GC SS_cons ::::::::::<<<<<_______>>>>>,{{{{-{{{{{{,{{{{{{{{{----{{{-{{{,,<<<--"
## [7] "#=GC RF UUuaAaUGGAGAGUUUGAUCCUGGCUCAGgAUGAACGCUGGCGGCGuGCuUAACACAUGCAAGUCGA"
## [8] "//"
There are four lines of output. The first is the aligned sequence itself, converted to the RNA ACGU code. The second line gives “posterior probabilities” of the confidence of the alignment at that position. The third gives the secondary structure consensus at that position, i.e. is there expected to be a basepairing (indicated by various brackets) or not (gaps). Finally, the fourth line is what we are interested in, this shows the level of conservation of that nucleotide position in the model, with well-conserved positions in uppercase, and less-conserved positions in lower-case.
So, we can use this file to determine the fraction of differences between the second dada-denoised ASV and the first that occurred in conserved positions.
bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
ln <- readLines("temp/tmp/tmp.bacteria.stk")
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
if(!gsub("U", "T", toupper(paste0(sqa, collapse=""))) == sq) { # TRUE
stop("Aligned sequence not matching input sequence!")
}
sqcons <- refcons[keep] %in% caps
bs$cons <- sqcons[bs$pos]
table(sqcons)/sum(table(sqcons))
## sqcons
## FALSE TRUE
## 0.1710709 0.8289291
table(bs$cons)/sum(table(bs$cons))
##
## FALSE TRUE
## 0.4 0.6
So in this example, there are a much higher fraction of subsitutions in non-conserved read positions in the differences between the top two ASVs than would be expected by chance!
Functionalizing this operation, and then extending it to a larger set of denoised ASVs:
get_subcons <- function(i, dd, tf = "temp/tempfile.fa", td = "temp/tmp", ssu=ssu.exe, export=export.str, return.all.pos=FALSE) {
bs <- dd$birth_subs[dd$birth_subs$clust==i,]
i.from <- dd$clustering$birth_from[[i]] # Sub positions are in ASV that this one was born from
bs$from <- i.from
sq <- dd$sequence[[i.from]]
writeLines(c(paste0(">Sq", i.from), sq), tf)
system(paste(export, ssu, '-f', tf, td), ignore.stdout=TRUE)
bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
alfile <- file.path(td, paste0(basename(td), ".bacteria.stk"))
if(file.exists(alfile)) {
ln <- readLines(file.path(td, "tmp.bacteria.stk"))
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
sqa.tr <- gsub("U", "T", toupper(paste0(sqa, collapse="")))
PAD <- 0
if(!sqa.tr == sq) {
### cat("\n", sq, sqa.tr, sep="\n")
if(sqa.tr == substr(sq,1,nchar(sqa.tr))) {
cat("Aligned sequence is missing some trailing nucleotides:", nchar(sq)-nchar(sqa.tr), "\n")
} else if(sqa.tr == substr(sq,1+nchar(sq)-nchar(sqa.tr),nchar(sq))) {
PAD <- nchar(sq)-nchar(sqa.tr)
cat("Aligned sequence is missing some beginning nucleotides:", PAD, "\n")
} else {
stop("Input sequence didn't match aligned sequence.")
}
}
sqcons <- refcons[keep] %in% caps
if(PAD > 0) { # NEED to fix for when aligned sequence starts later...
sqcons <- c(rep(NA, PAD), sqcons)
}
bs$cons <- sqcons[bs$pos]
} else { # Didn't match database, give NAs to all
bs$cons <- NA
}
if(return.all.pos) { # Return vector of conservation at all positions, not those associated with the birth substitutions
return(sqcons)
}
else { return(bs) } # Return birth_subs data.frame, now with conservation state of each position added in the $cons column
}
nseqs <- function(x) length(getSequences(x))
Testing the function:
rbind(get_subcons(3, d1), get_subcons(4, d1), get_subcons(5, d1))
## pos ref sub qual clust from cons
## 306 53 G A 39 3 2 FALSE
## 307 114 T C 34 3 2 FALSE
## 308 114 T C 34 4 2 FALSE
## 309 241 A T 36 4 2 FALSE
## 310 241 T C 33 5 4 FALSE
Output is matching the expected format, this is a data.frame
with information on the position, substitution type and quality score associated with various dada-denoised ASVs. Interestingly, all of the substitutions between these denoised ASVs and the more abundant ASVs from which they were disciminated occurrsed in non-conserved base positions! Now to see if that pattern is something that holds up over the dataset.
First, running dada
with a very aggressive sensitivity setting. Our goal here is to use dada
to order ASVs by the diagnostic likelihood that they are true (under the DADA2 error model). We will then evaluate the conservation patterns of those ordered ASVs. First, to generate them:
dd2 <- dada(filt, err, OMEGA_A=1e-2, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd2
## $R3.1_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 577 sequence variants were inferred from 2598 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $R9.3_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 1192 sequence variants were inferred from 2498 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16
##
## $R9.4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 1242 sequence variants were inferred from 2882 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16
Now let’s run our identication of conservation patterns on all these ASVs:
cons.rds <- file.path(path.rds, "cons_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(cons.rds)) {
system.time(cons <- lapply(names(dd2), function(nm) {
di <- dd2[[nm]]
lapply(seq(2, nseqs(di)), get_subcons, dd=di)
}))
names(cons) <- names(dd2)
saveRDS(cons, cons.rds)
system('rm -Rf temp/tmp')
system('rm temp/*')
}
cons <- readRDS(cons.rds)
if(!all(sapply(cons, length) == sapply(dd2, function(x) nseqs(x)-1))) stop("Number of cons ASVs mismatch.")
Now to curate the SSU conservation data into a data.frame:
clustdfs <- lapply(names(dd2), function(nm) {
di <- dd2[[nm]]; coni <- cons[[nm]]
rdf <- di$clustering[-1,] # Dropping first ASV for simplicity
rdf$cons <- sapply(coni, function(x) sum(x$cons, na.rm=TRUE))
# NAs occur when only part of the ASVs aligns to the ssu-align RNA model
rdf$vars <- sapply(coni, function(x) sum(!x$cons, na.rm=TRUE))
rdf$nsub <- sapply(coni, function(x) nrow(x))
rdf$ncat <- rdf$cons + rdf$vars # The number of substitutions categorized as conserved/variable
rdf$var.frac <- rdf$vars/rdf$ncat
rdf$index <- seq(2,nrow(rdf)+1)
rdf$sample <- nm
rdf
})
names(clustdfs) <- names(dd2)
For comparison, we also want to know the average fraction of positions that are conserved if chosen at random (rather than at the positions of birth substutitions). To calculate that:
ii <- sample(nseqs(d1), 20)
cons.all <- lapply(ii, get_subcons, dd=d1, return.all.pos=TRUE)
## Aligned sequence is missing some trailing nucleotides: 1
sapply(cons.all, mean)
## [1] 0.8289291 0.8127544 0.8104839 0.8289291 0.8110289 0.8127946 0.8289291
## [8] 0.8054813 0.8104839 0.8122034 0.8104839 0.8086957 0.8289291 0.8161866
## [15] 0.8289291 0.8127946 0.8134328 0.8160604 0.8110289 0.8289291
Yep. Quite consistent, a bit over 80% of the substitutions are in conserved positions.
Now some exploration of the distribution of the ratio of substitutions in conserved and variable regions, as a function of the order of the ASVs (which reflects decreasing confidence by DADA2 they are not seqeuencing error) and by the abundance of the denoised ASVs:
# Use the median over our random sample as a plugin estimate for the fraction of conserved sites in each sequence
raw.cons.frac <- median(sapply(cons.all, mean))
raw.var.frac <- 1-raw.cons.frac
clustdf <- do.call(rbind, clustdfs)
clustdf$nnt <- nchar(clustdf$sequence)
clustdf$frac.sub.cons <- clustdf$cons/(clustdf$nnt*raw.cons.frac)
clustdf$frac.sub.var <- clustdf$vars/(clustdf$nnt*raw.var.frac)
clustdf$logp <- log10(clustdf$birth_pval)
clustdf$rlogp <- log10(clustdf$birth_pval + min(clustdf$birth_pval[clustdf$birth_pval>0]))
clustdf$sample <- paste("Sample", sapply(strsplit(clustdf$sample, "_"), `[`, 1))
clustdf$abundance <- as.numeric(clustdf$abundance)
theme_set(theme_bw())
hist(clustdf$ncat, 100)
p <- ggplot(data=clustdf, aes(x=frac.sub.cons, y=frac.sub.var)) +
xlab("Substitution Rate at Conserved Positions") + ylab("Substitution Rate at\nVariable Positions") +
geom_abline(col="grey80", intercept=0, slope=1) + theme(aspect.ratio=1)
sz.scale <- scale_size_continuous(breaks=c("1"=1, "10"=10, "100"=100, "1000"=1000), name="ASV abundance")
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale + scale_x_log10() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis
p4 <- p + aes(size=abundance) + geom_point(alpha=0.2) + facet_wrap(~sample) + sz.scale
p4
ggsave(file.path(path.fig, "Fecal_VarVsCons.pdf"), p4, width=7, heigh=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_VarVsCons.png"), p4, width=7, heigh=3, units="in")
#require(hexbin)
#p + geom_hex()
#p + geom_bin2d()
There are two modes here. The first mode contains ASVs with very large numbers of substitution differences (>100) from the ASV from which they were discriminated. In this large-difference mode, the number of substitutiosn at conserved positions exceed the number of substitutions at variable positions by about a factor of 1.5x. The second mode contains ASVs with small to moderate numbers of substitution differences (<100, mostly <10) from the ASV from which they were disciminated. In this small-difference mode, the number of substitutions at conserved positions seems to be slightly less than at variable positions. Remember that ~80% of the positions are conserved (red-dashed line) so this is strongly skewed against what would be expected by chance.
Does this enrichment of substitutions at variable positions hold for even the most difficult to identify ASVs, singletons and single-substitution variants?
is.var.1diff <- as.logical(clustdf$vars[clustdf$ncat==1])
table(is.var.1diff)
## is.var.1diff
## FALSE TRUE
## 116 124
frac.var.1abund <- clustdf$var.frac[clustdf$abundance==1]
hist(frac.var.1abund, n=100)
Yep, for both categories the number of substitutions at variable positions is about equal to those at conserved positions.
Our goals here are to determine the level to which biological differences can be discriminated from Loop sequencing errors in real complex data from these human fecal samples. Thus, we are primarily interested in that small-substitution part of the data – it has already been established that the per-nucleotide error rate in LoopSeq data is extremely low and is not introducing hundreds of substitutions. So, focusing in on that part of the data, is there a trend towards decreasing accuracy (i.e. a lower fraction of substutions at variable positions) as we push the sensitivity of ASV inference (i.e. look at lower-certainty ASVs)?
p1 <- ggplot(data=clustdf[clustdf$ncat==1,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p1 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
p2 <- ggplot(data=clustdf[clustdf$ncat==2,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p2 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
pdef <- ggplot(data=clustdf[clustdf$abundance>2 & clustdf$birth_pval < 1e-40,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
pdef + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat>10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat<10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$abundance > 1 & clustdf$ncat>5 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance, weight=ncat)) + geom_point() + geom_smooth() + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$abundance == 1 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat<2,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat<3,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat<4,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=clustdf[clustdf$ncat<5,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There is some dropoff in the fraction of substitutions at variable positions as we get to the later ASVs identified using these very high sensitivity parameters, but not that much really.