Setup

Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:

## [1] '1.18.0'

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

## This is vegan 2.5-7

## 
## Attaching package: 'ape'

## The following object is masked from 'package:ShortRead':
## 
##     zoom

## The following object is masked from 'package:Biostrings':
## 
##     complement

## /Users/bcallah/LoopData/16S/CallahanFecal/R3.1_contig_list_trimmed.fq 
##                                                                111798 
## /Users/bcallah/LoopData/16S/CallahanFecal/R9.3_contig_list_trimmed.fq 
##                                                                 23405 
## /Users/bcallah/LoopData/16S/CallahanFecal/R9.4_contig_list_trimmed.fq 
##                                                                 35471

QA, Filtering and Trimming

First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community.

plotComplexity(fn)

plotQualityProfile(fn)

These profiles look consistent with what we saw in the Zymo mock – no low complexity issues, and high qualities throughout the 16S sequence (i.e. ignoring the >1500nt tail). The one difference, especially in the R3.1 is that the length distribution is different, around a half of the reads in that library appear not to extend to the full-length of the 16S gene. This may be a common way in which quality variation in Loop libraries becomes evidence, as molecules that are sequenced to insufficient coverage are likely to end up being assembled into incomplete contigs that don’t cover the entire targeted amplicon.

Enter primers, and confirm their presence and the overall orientation of the reads. Code is adapated from the DADA2 ITS tutorial workflow:

FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)
    orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna), 
        RevComp = Biostrings::reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
for(f in fn) {
  cat("Primers detected in", basename(f), "\n")
  print(rbind(FWD.Primer = sapply(allOrients(FWD), primerHits, fn = f), 
              REV.Primer = sapply(allOrients(REV), primerHits, fn = f)))
  cat("Out of", length(readFastq(f)), "total reads.\n\n")
}

## Primers detected in R3.1_contig_list_trimmed.fq 
##            Forward Complement Reverse RevComp
## FWD.Primer   67942          0       0    2044
## REV.Primer     658          0       0   20771
## Out of 111798 total reads.
## 
## Primers detected in R9.3_contig_list_trimmed.fq 
##            Forward Complement Reverse RevComp
## FWD.Primer   20954          0       0     125
## REV.Primer      59          0       0   11718
## Out of 23405 total reads.
## 
## Primers detected in R9.4_contig_list_trimmed.fq 
##            Forward Complement Reverse RevComp
## FWD.Primer   30735          0       0     244
## REV.Primer     100          0       0   15714
## Out of 35471 total reads.

Primers in the expected orientations, and yes it looks like the R3.1 library will take a big hit from incomplete amplicon reconstruction, with the REV primer only being detected in <20% of the total reads, whereas the other libraries are closer to 50%. It is worth noting that all of these libraries have a much lower fraction of primer-complete reads than the Zymo mock community data had (~90%).

Remove the primers (and any flanking sequence) from the reads, and filter out reads that don’t contain both primers:

nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)

## Multiple matches to the primer(s) in some sequences. Using the longest possible match.

## 2139 sequences out of 111798 are being reverse-complemented.

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R3.1_contig_list_trimmed.fq

## Read in 111798, output 18311 (16.4%) filtered sequences.

## 109 sequences out of 23405 are being reverse-complemented.

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R9.3_contig_list_trimmed.fq

## Read in 23405, output 11563 (49.4%) filtered sequences.

## 223 sequences out of 35471 are being reverse-complemented.

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/nop/R9.4_contig_list_trimmed.fq

## Read in 35471, output 15320 (43.2%) filtered sequences.

Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing maxEE=1, as that was determined to be a better filter than maxEE=2 in the Zymo mock community data.

filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R3.1_contig_list_trimmed.fq

## Read in 18311, output 15209 (83.1%) filtered sequences.

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R9.3_contig_list_trimmed.fq

## Read in 11563, output 10642 (92%) filtered sequences.

## Overwriting file:/Users/bcallah/LoopData/16S/CallahanFecal/filtered/R9.4_contig_list_trimmed.fq

## Read in 15320, output 14152 (92.4%) filtered sequences.

Final inspection of the quality profile:

plotQualityProfile(filt)

Final progress of reads through filtering and trimming:

cbind(raw=out[,1], primers=out[,2], filtered=track[,2])

##                                raw primers filtered
## R3.1_contig_list_trimmed.fq 111798   18311    15209
## R9.3_contig_list_trimmed.fq  23405   11563    10642
## R9.4_contig_list_trimmed.fq  35471   15320    14152

The large loss of reads due to a lack of detectable primers, especially in the R3.1 library, is notable. It seems likely that optimizing the ratio of Loop molecules to total output reads will be an important part of maximizing throughput in the form of the most high-quality Loop reads per run.

Denoising

Learn the error rates:

err.rds <- file.path(path.rds, "err_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
  err <- learnErrors(filt, multi=TRUE, verbose=0)
  saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)

Denoise the filtered data into ASVs using current DADA2 defaults:

dd <- dada(filt, err, multi=TRUE, verbose=0)
dd

## $R3.1_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 123 sequence variants were inferred from 2598 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $R9.3_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 385 sequence variants were inferred from 2498 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $R9.4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 394 sequence variants were inferred from 2882 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Make sequence table and remove chimeras:

sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)

## Identified 33 bimeras out of 739 input sequences.

Assign taxonomy:

tax.rds <- file.path(path.rds, "tax_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
  tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
  saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")
table(tax[,"Phylum"], useNA="ifany")

## 
##     Actinobacteria      Bacteroidetes Epsilonbacteraeota         Firmicutes 
##                 12                 53                  1                583 
##       Fusobacteria     Proteobacteria    Verrucomicrobia 
##                  2                 46                  9

The usual suspects at the Phylum level at least.

Comparison to PacBio full-length 16S sequencing results

Currently, PacBio is the gold standard for full-length 16S gene sequencing. Previously, we sequenced these same three fecal samples as part of our earlier investigation of the accuracy of PacBio full-length 16S sequencing. We processed those samples through the DADA2 workflow in that paper, and have included the sequence table obtained from a set of 12 fecal samples (including the 3 resequenced here using Loop full-length 16S) as part of this repository.

Importing PacBio results, and coordinating sample names between the Loop and PacBio results:

pb1.rds <- file.path(path.rds, "PacBio_Fecal_st1.rds")
st1.pb <- removeBimeraDenovo(readRDS(pb1.rds), minFoldParentOverAbundance=4.5, 
                             multi=TRUE, verbose=TRUE)

## Identified 3 bimeras out of 1108 input sequences.

pb2.rds <- file.path(path.rds, "PacBio_Fecal_st2.rds")
st2.pb <- removeBimeraDenovo(readRDS(pb2.rds), minFoldParentOverAbundance=4.5, 
                             multi=TRUE, verbose=TRUE)

## Identified 8 bimeras out of 1411 input sequences.

sam.names <- sapply(strsplit(rownames(st), "_"), `[`, 1) # e.g. R3.1_contig_list_trimmed.fq
rownames(st) <- sam.names; rownames(sta) <- sam.names
rownames(st1.pb) <- gsub("^Callahan_16S_2pM-Cell1_", "", rownames(st1.pb)) # Remove prefix
rownames(st1.pb) <- gsub("[.]fastq[.]gz$", "", rownames(st1.pb)) # Remove suffix
rownames(st1.pb) <- gsub("_", "", rownames(st1.pb)) # e.g. R_3.1
rownames(st2.pb) <- gsub("_", "", rownames(st2.pb)) # e.g. R_3.1
if(!all(rownames(st) %in% rownames(st1.pb))) stop("Sample name mismatch (1).")
if(!all(rownames(st) %in% rownames(st2.pb))) stop("Sample name mismatch (2).")
st1.pb <- st1.pb[rownames(st),]
st2.pb <- st2.pb[rownames(st),]

In the PacBio protocol, a slightly different set of primers is used:

FWD.pacb <- "AGRGTTYGATYMTGGCTCAG" 
FWD.loop <- "AGAGTTTGATCMTGGC"
REV.pacb <- "RGYTACCTTGTTACGACTT"
REV.loop <-    "TACCTTGTTACGACTT"

The result is that the Loop sequences have an additional 4 nucleotides at the start of the reads that are absent in the PacBio data. To remedy that, we will simply remove the first 4nts from all the Loop sequences:

sq.loop.full <- getSequences(st)
sq.loop.trunc <- substr(sq.loop.full, 5, nchar(sq.loop.full))
any(duplicated(sq.loop.trunc)) # FALSE

## [1] FALSE

if(mean(grepl("^TCAG", sq.loop.full)) < 0.95) stop("Trim issue, expected 4nt sequence not at start of these sequences.")
st.loop <- st; tax.loop <- tax
colnames(st.loop) <- sq.loop.trunc; rownames(tax.loop) <- sq.loop.trunc

As a first pass, looking at the overlap in ASVs and reads between Loop and PacBio full-length 16S data (replicate 2) on each sample:

setASV <- function(unq1, unq2, nms) {
  unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
  unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
  rval <- c(sum(!names(unq1) %in% names(unq2)), 
            sum(names(unq1) %in% names(unq2)),
            sum(!names(unq2) %in% names(unq1)))
  names(rval) <- c(nms[[1]], "Shared", nms[[2]])
  rval
}
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))

##      Loop Shared PB1
## R3.1   40     68   9
## R9.3  113    262  74
## R9.4  172    214  48

t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))

##      Loop Shared PB2
## R3.1   26     82  16
## R9.3   69    306 151
## R9.4  131    255 113

t(sapply(sam.names, function(nm) setASV(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))

##      PB1 Shared PB2
## R3.1  10     67  31
## R9.3  53    283 174
## R9.4  50    212 156

Half or more of the denoised ASVs are shared across the methods. How about on a per-read basis?

setReads <- function(unq1, unq2, nms) {
  unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
  unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
  rval <- c(sum(unq1[!names(unq1) %in% names(unq2)]), 
            sum(unq1[names(unq1) %in% names(unq2)]),
            sum(unq2[names(unq2) %in% names(unq1)]),
            sum(unq2[!names(unq2) %in% names(unq1)]))
  names(rval) <- c(nms[[1]], paste0("Shared", nms[[1]]), paste0("Shared", nms[[2]]), nms[[2]])
  rval
}
totReads <- function(st1, st2) { as.matrix(cbind(rowSums(st1), rowSums(st1), rowSums(st2), rowSums(st2))) }

tab.loop.pb1 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
tab.loop.pb1

##      Loop SharedLoop SharedPB1 PB1
## R3.1  410      14364     14861 111
## R9.3 1139       8853      8960 769
## R9.4 1903      11659      7366 450

tab.loop.pb1/totReads(st.loop, st1.pb)

##            Loop SharedLoop SharedPB1         PB1
## R3.1 0.02775146  0.9722485 0.9925862 0.007413839
## R9.3 0.11399119  0.8860088 0.9209580 0.079042039
## R9.4 0.14031854  0.8596815 0.9424258 0.057574207

tab.loop.pb2 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
tab.loop.pb2

##      Loop SharedLoop SharedPB2  PB2
## R3.1  347      14427     25717  371
## R9.3  490       9502     16094 1827
## R9.4 1204      12358     11870  994

tab.loop.pb2/totReads(st.loop, st2.pb)

##            Loop SharedLoop SharedPB2       PB2
## R3.1 0.02348721  0.9765128 0.9857789 0.0142211
## R9.3 0.04903923  0.9509608 0.8980526 0.1019474
## R9.4 0.08877747  0.9112225 0.9227301 0.0772699

tab.pb1.pb2 <- t(sapply(sam.names, function(nm) setReads(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
tab.pb1.pb2

##      PB1 SharedPB1 SharedPB2  PB2
## R3.1 203     14769     25510  578
## R9.3 462      9267     15721 2200
## R9.4 556      7260     11109 1755

tab.pb1.pb2/totReads(st1.pb, st2.pb)

##             PB1 SharedPB1 SharedPB2        PB2
## R3.1 0.01355864 0.9864414 0.9778442 0.02215578
## R9.3 0.04748689 0.9525131 0.8772390 0.12276101
## R9.4 0.07113613 0.9288639 0.8635728 0.13642724

A very high fraction of reads (>85%) occur in ASVs identified consistently by both methods. How do things look at the whole community level, when we ordinate these three samples?

Combine the sequence tables together:

st.loop.lab <- st.loop
rownames(st.loop.lab) <- gsub("$", ".loop", rownames(st.loop.lab))
st1.pb.lab <- st1.pb
rownames(st1.pb.lab) <- gsub("$", ".pb1", rownames(st1.pb.lab))
st2.pb.lab <- st2.pb
rownames(st2.pb.lab) <- gsub("$", ".pb2", rownames(st2.pb.lab))
st.both.lab <- mergeSequenceTables(st.loop.lab, st1.pb.lab, st2.pb.lab)
ft.both.lab <- sweep(st.both.lab, 1, rowSums(st.both.lab), "/")

Perform ordination:

bc <- vegdist(ft.both.lab, "bray")
mds <- pcoa(bc)
df <- data.frame(mds$vectors)
df$Sample <- substr(rownames(df), 1, 4)
df$Technology <- sapply(strsplit(rownames(df), "[.]"), `[`, 3)
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
clr3 <- gg_color_hue(3)
df$Technology <- c(loop="LoopSeq", pb1="PacBio CCS\n(S/P2-C2/5.0)\n", pb2="PacBio CCS\n(S/P3-C3/5.0)")[df$Technology]
tech.color.scale <- c("LoopSeq"=clr3[[1]], "PacBio CCS\n(S/P2-C2/5.0)\n"=clr3[[2]], 
                      "PacBio CCS\n(S/P3-C3/5.0)"="darkgreen")
ppb <- ggplot(data=df, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() + 
  scale_color_manual(values=tech.color.scale)
ppb + geom_text(alpha=0.8)

labdf <- df[df$Technology=="LoopSeq",]
labdf$Axis.1 <- 0.8*labdf$Axis.1 # pull labels towards middle of plot
labdf$Axis.2 <- 0.8*labdf$Axis.2
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black")

ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.pdf"), 
       ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"), 
       width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.png"), 
       ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"), 
       width=5, height=3, units="in")

They fall almost exactly on top of each other. Worth calling out that two of these samples were longitudinal samples from the same person too, so inter-subject and within-subject temporal variability is represented here, as well as the technical replication amongst the same technology (PacBio).

Comparing the measurement dissimilarities between Loop and PacBio and between the PacBio technical replicates:

as.matrix(bc)[df$Sample=="R3.1", df$Sample=="R3.1"]

##           R3.1.loop   R3.1.pb1   R3.1.pb2
## R3.1.loop 0.0000000 0.13655146 0.13894991
## R3.1.pb1  0.1365515 0.00000000 0.04260862
## R3.1.pb2  0.1389499 0.04260862 0.00000000

as.matrix(bc)[df$Sample=="R9.3", df$Sample=="R9.3"]

##           R9.3.loop  R9.3.pb1  R9.3.pb2
## R9.3.loop 0.0000000 0.1974966 0.1773402
## R9.3.pb1  0.1974966 0.0000000 0.1733080
## R9.3.pb2  0.1773402 0.1733080 0.0000000

as.matrix(bc)[df$Sample=="R9.4", df$Sample=="R9.4"]

##           R9.4.loop  R9.4.pb1  R9.4.pb2
## R9.4.loop 0.0000000 0.2299891 0.2098496
## R9.4.pb1  0.2299891 0.0000000 0.1815628
## R9.4.pb2  0.2098496 0.1815628 0.0000000

In Sample R3.1 the PacBio technical replicates are highly similar, but in the other two samples the Loop measurement is about as similar to each PacBio technical replicate as the PacBio technical replicates are to each other!

Also testing Jaccard distance to ensure this pattern is not entirely driven by high abundance ASVs.

jc <- vegdist(ft.both.lab, "jaccard")
mds.jc <- pcoa(jc)
df.jc <- df # Just replace the vectors with the new jaccard derived numbers
df.jc[,paste0("Axis.", seq(8))] <- data.frame(mds.jc$vectors)
ppj <- ggplot(data=df.jc, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() + 
  scale_color_manual(values=tech.color.scale)
ppj + geom_text(alpha=0.8)

labdf.jc <- df.jc[df.jc$Technology=="LoopSeq",]
labdf.jc$Axis.1 <- 0.8*labdf.jc$Axis.1 # pull labels towards middle of plot
labdf.jc$Axis.2 <- 0.8*labdf.jc$Axis.2
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black")

ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.pdf"), 
       ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"), 
       width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.png"), 
       ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"), 
       width=5, height=3, units="in")

Exact same qualitative patterns with Jaccard distance as well.

Demonstration of LoopSeq Accuracy using 16S conservation patterns

Natural samples re closer in complexity and composition thatn are mock communities to the types of samples in which people want to employ technologies like LoopSeq full-length. However, unlike in mock communities, we do not know the true compposition of natural samples, and thus accuracy is more difficult to determine in natural samples. Here we will attempt to evaluate the accuracy of LoopSeq FL 16S in natural human fecal samples by leveraging the variable levels at which different sections of the 16S rRNA gene are evolutionarily conserved. In short, we expect the differences between legitimate biological variants to preferentially be found in the variable regions of the 16S rRNA gene, while artefactual variation introduced by technical error processes will not.

We’ll use [the external program ssu-align(http://eddylab.org/software/ssu-align/) for this purpose, which uses Infernal to align potentially large collectiosn of SSU rRNA sequences.

First we’ll run the program on an example sequence, to show its output:

ssu.exe <- "~/Desktop/bin/ssu-align-0.1.1/src/ssu-align" # CHANGE ME...
ssu.dir <- "/usr/local/share/ssu-align-0.1.1"
export.str <- paste0('export SSUALIGNDIR="', ssu.dir, '";') # Prepend to all ssu-align calls
# i <- 2 has just 1 sub, for testing
d1 <- dd[[1]]
i <- 2
bs <- d1$birth_subs[d1$birth_subs$clust==i,]
sq <- d1$sequence[d1$clustering$birth_from[i]] # Positions are in birth (ref) seq
if(!dir.exists("temp")) dir.create("temp")
tf <- "temp/tempfile.fa"
writeFasta(sq, tf)
td <- "temp/tmp"
system(paste(export.str, ssu.exe, '-f', tf, td))

At the end of the program, the key file we are interested in is the bacterial alignment, which is the alignment of the input sequences against the bacterial SSU RNA model. Inspecting the formate of that file:

print(substr(readLines("temp/tmp/tmp.bacteria.stk"), 1, 80))

## [1] "# STOCKHOLM 1.0"                                                                 
## [2] "#=GF AU Infernal s0.1.1"                                                         
## [3] ""                                                                                
## [4] "1            -------------------------UCAGAACGAACGCUGGCGGCGUGGAUAAGACAUGCAAGUCGA"
## [5] "#=GR 1 PP    .........................******************************************"
## [6] "#=GC SS_cons ::::::::::<<<<<_______>>>>>,{{{{-{{{{{{,{{{{{{{{{----{{{-{{{,,<<<--"
## [7] "#=GC RF      UUuaAaUGGAGAGUUUGAUCCUGGCUCAGgAUGAACGCUGGCGGCGuGCuUAACACAUGCAAGUCGA"
## [8] "//"

There are four lines of output. The first is the aligned sequence itself, converted to the RNA ACGU code. The second line gives “posterior probabilities” of the confidence of the alignment at that position. The third gives the secondary structure consensus at that position, i.e. is there expected to be a basepairing (indicated by various brackets) or not (gaps). Finally, the fourth line is what we are interested in, this shows the level of conservation of that nucleotide position in the model, with well-conserved positions in uppercase, and less-conserved positions in lower-case.

So, we can use this file to determine the fraction of differences between the second dada-denoised ASV and the first that occurred in conserved positions.

bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
ln <- readLines("temp/tmp/tmp.bacteria.stk")
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
if(!gsub("U", "T", toupper(paste0(sqa, collapse=""))) == sq) { # TRUE
  stop("Aligned sequence not matching input sequence!")
}
sqcons <- refcons[keep] %in% caps
bs$cons <- sqcons[bs$pos]
table(sqcons)/sum(table(sqcons))

## sqcons
##     FALSE      TRUE 
## 0.1710709 0.8289291

table(bs$cons)/sum(table(bs$cons))

## 
## FALSE  TRUE 
##   0.4   0.6

So in this example, there are a much higher fraction of subsitutions in non-conserved read positions in the differences between the top two ASVs than would be expected by chance!

Functionalizing this operation, and then extending it to a larger set of denoised ASVs:

get_subcons <- function(i, dd, tf = "temp/tempfile.fa", td = "temp/tmp", ssu=ssu.exe, export=export.str, return.all.pos=FALSE) {
  bs <- dd$birth_subs[dd$birth_subs$clust==i,]
  i.from <- dd$clustering$birth_from[[i]] # Sub positions are in ASV that this one was born from
  bs$from <- i.from
  sq <- dd$sequence[[i.from]]
  writeLines(c(paste0(">Sq", i.from), sq), tf)
  system(paste(export, ssu, '-f', tf, td), ignore.stdout=TRUE)
  bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
  caps <- c("A", "C", "G", "U")
  alfile <- file.path(td, paste0(basename(td), ".bacteria.stk"))
  if(file.exists(alfile)) {
    ln <- readLines(file.path(td, "tmp.bacteria.stk"))
    sqa <- strsplit(ln[[4]], "")[[1]]
    refcons <- strsplit(ln[[7]], "")[[1]]
    keep <- sqa %in% bps
    sqa <- sqa[keep]
    sqa.tr <- gsub("U", "T", toupper(paste0(sqa, collapse="")))
    PAD <- 0
    if(!sqa.tr == sq) {
      ### cat("\n", sq, sqa.tr, sep="\n")
      if(sqa.tr == substr(sq,1,nchar(sqa.tr))) {
        cat("Aligned sequence is missing some trailing nucleotides:", nchar(sq)-nchar(sqa.tr), "\n")
      } else if(sqa.tr == substr(sq,1+nchar(sq)-nchar(sqa.tr),nchar(sq))) {
        PAD <- nchar(sq)-nchar(sqa.tr)
        cat("Aligned sequence is missing some beginning nucleotides:", PAD, "\n")
      } else { 
        stop("Input sequence didn't match aligned sequence.")
      }
    }
    sqcons <- refcons[keep] %in% caps
    if(PAD > 0) { # NEED to fix for when aligned sequence starts later...
      sqcons <- c(rep(NA, PAD), sqcons)
    }
    bs$cons <- sqcons[bs$pos] 
  } else { # Didn't match database, give NAs to all
    bs$cons <- NA
  }
  if(return.all.pos) { # Return vector of conservation at all positions, not those associated with the birth substitutions
    return(sqcons)
  }
  else { return(bs) } # Return birth_subs data.frame, now with conservation state of each position added in the $cons column
}
nseqs <- function(x) length(getSequences(x))

Testing the function:

rbind(get_subcons(3, d1), get_subcons(4, d1), get_subcons(5, d1))

##     pos ref sub qual clust from  cons
## 306  53   G   A   39     3    2 FALSE
## 307 114   T   C   34     3    2 FALSE
## 308 114   T   C   34     4    2 FALSE
## 309 241   A   T   36     4    2 FALSE
## 310 241   T   C   33     5    4 FALSE

Output is matching the expected format, this is a data.frame with information on the position, substitution type and quality score associated with various dada-denoised ASVs. Interestingly, all of the substitutions between these denoised ASVs and the more abundant ASVs from which they were disciminated occurrsed in non-conserved base positions! Now to see if that pattern is something that holds up over the dataset.

First, running dada with a very aggressive sensitivity setting. Our goal here is to use dada to order ASVs by the diagnostic likelihood that they are true (under the DADA2 error model). We will then evaluate the conservation patterns of those ordered ASVs. First, to generate them:

dd2 <- dada(filt, err, OMEGA_A=1e-2, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd2

## $R3.1_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 577 sequence variants were inferred from 2598 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $R9.3_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 1192 sequence variants were inferred from 2498 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $R9.4_contig_list_trimmed.fq
## dada-class: object describing DADA2 denoising results
## 1242 sequence variants were inferred from 2882 input unique sequences.
## Key parameters: OMEGA_A = 0.01, OMEGA_C = 1e-40, BAND_SIZE = 16

Now let’s run our identication of conservation patterns on all these ASVs:

cons.rds <- file.path(path.rds, "cons_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(cons.rds)) {
  system.time(cons <- lapply(names(dd2), function(nm) {
    di <- dd2[[nm]]
    lapply(seq(2, nseqs(di)), get_subcons, dd=di)
  }))
  names(cons) <- names(dd2)
  saveRDS(cons, cons.rds)
  system('rm -Rf temp/tmp')
  system('rm temp/*')
}

cons <- readRDS(cons.rds)
if(!all(sapply(cons, length) == sapply(dd2, function(x) nseqs(x)-1))) stop("Number of cons ASVs mismatch.")

Now to curate the SSU conservation data into a data.frame:

clustdfs <- lapply(names(dd2), function(nm) {
  di <- dd2[[nm]]; coni <- cons[[nm]]
  rdf <- di$clustering[-1,] # Dropping first ASV for simplicity
  rdf$cons <- sapply(coni, function(x) sum(x$cons, na.rm=TRUE)) 
  # NAs occur when only part of the ASVs aligns to the ssu-align RNA model
  rdf$vars <- sapply(coni, function(x) sum(!x$cons, na.rm=TRUE))
  rdf$nsub <- sapply(coni, function(x) nrow(x))
  rdf$ncat <- rdf$cons + rdf$vars # The number of substitutions categorized as conserved/variable
  rdf$var.frac <- rdf$vars/rdf$ncat
  rdf$index <- seq(2,nrow(rdf)+1)
  rdf$sample <- nm
  rdf
})
names(clustdfs) <- names(dd2)

For comparison, we also want to know the average fraction of positions that are conserved if chosen at random (rather than at the positions of birth substutitions). To calculate that:

ii <- sample(nseqs(d1), 20)
cons.all <- lapply(ii, get_subcons, dd=d1, return.all.pos=TRUE)

## Aligned sequence is missing some trailing nucleotides: 1

sapply(cons.all, mean)

##  [1] 0.8289291 0.8127544 0.8104839 0.8289291 0.8110289 0.8127946 0.8289291
##  [8] 0.8054813 0.8104839 0.8122034 0.8104839 0.8086957 0.8289291 0.8161866
## [15] 0.8289291 0.8127946 0.8134328 0.8160604 0.8110289 0.8289291

Yep. Quite consistent, a bit over 80% of the substitutions are in conserved positions.

Now some exploration of the distribution of the ratio of substitutions in conserved and variable regions, as a function of the order of the ASVs (which reflects decreasing confidence by DADA2 they are not seqeuencing error) and by the abundance of the denoised ASVs:

# Use the median over our random sample as a plugin estimate for the fraction of conserved sites in each sequence
raw.cons.frac <- median(sapply(cons.all, mean))
raw.var.frac <- 1-raw.cons.frac
clustdf <- do.call(rbind, clustdfs)
clustdf$nnt <- nchar(clustdf$sequence)
clustdf$frac.sub.cons <- clustdf$cons/(clustdf$nnt*raw.cons.frac)
clustdf$frac.sub.var <- clustdf$vars/(clustdf$nnt*raw.var.frac)
clustdf$logp <- log10(clustdf$birth_pval)
clustdf$rlogp <- log10(clustdf$birth_pval + min(clustdf$birth_pval[clustdf$birth_pval>0]))
clustdf$sample <- paste("Sample", sapply(strsplit(clustdf$sample, "_"), `[`, 1))
clustdf$abundance <- as.numeric(clustdf$abundance)
theme_set(theme_bw())
hist(clustdf$ncat, 100)

p <- ggplot(data=clustdf, aes(x=frac.sub.cons, y=frac.sub.var)) + 
  xlab("Substitution Rate at Conserved Positions") + ylab("Substitution Rate at\nVariable Positions") +
  geom_abline(col="grey80", intercept=0, slope=1) + theme(aspect.ratio=1)
sz.scale <- scale_size_continuous(breaks=c("1"=1, "10"=10, "100"=100, "1000"=1000), name="ASV abundance")
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale

p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale + scale_x_log10() + scale_y_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous y-axis

p4 <- p + aes(size=abundance) + geom_point(alpha=0.2) + facet_wrap(~sample) + sz.scale
p4

ggsave(file.path(path.fig, "Fecal_VarVsCons.pdf"), p4, width=7, heigh=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_VarVsCons.png"), p4, width=7, heigh=3, units="in")
#require(hexbin)
#p + geom_hex()
#p + geom_bin2d()

There are two modes here. The first mode contains ASVs with very large numbers of substitution differences (>100) from the ASV from which they were discriminated. In this large-difference mode, the number of substitutiosn at conserved positions exceed the number of substitutions at variable positions by about a factor of 1.5x. The second mode contains ASVs with small to moderate numbers of substitution differences (<100, mostly <10) from the ASV from which they were disciminated. In this small-difference mode, the number of substitutions at conserved positions seems to be slightly less than at variable positions. Remember that ~80% of the positions are conserved (red-dashed line) so this is strongly skewed against what would be expected by chance.

Does this enrichment of substitutions at variable positions hold for even the most difficult to identify ASVs, singletons and single-substitution variants?

is.var.1diff <- as.logical(clustdf$vars[clustdf$ncat==1])
table(is.var.1diff)

## is.var.1diff
## FALSE  TRUE 
##   116   124

frac.var.1abund <- clustdf$var.frac[clustdf$abundance==1]
hist(frac.var.1abund, n=100)

Yep, for both categories the number of substitutions at variable positions is about equal to those at conserved positions.

Our goals here are to determine the level to which biological differences can be discriminated from Loop sequencing errors in real complex data from these human fecal samples. Thus, we are primarily interested in that small-substitution part of the data – it has already been established that the per-nucleotide error rate in LoopSeq data is extremely low and is not introducing hundreds of substitutions. So, focusing in on that part of the data, is there a trend towards decreasing accuracy (i.e. a lower fraction of substutions at variable positions) as we push the sensitivity of ASV inference (i.e. look at lower-certainty ASVs)?

p1 <- ggplot(data=clustdf[clustdf$ncat==1,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p1 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

p2 <- ggplot(data=clustdf[clustdf$ncat==2,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p2 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

pdef <- ggplot(data=clustdf[clustdf$abundance>2 & clustdf$birth_pval < 1e-40,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
pdef + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat>10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat<10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$abundance > 1 & clustdf$ncat>5 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance, weight=ncat)) + geom_point() + geom_smooth() + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$abundance == 1 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat<2,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat<3,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat<4,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=clustdf[clustdf$ncat<5,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There is some dropoff in the fraction of substitutions at variable positions as we get to the later ASVs identified using these very high sensitivity parameters, but not that much really.

Evaluationg LoopSeq 16S on Human Fecal Samples

BJC

Setup

QA, Filtering and Trimming

Denoising

Comparison to PacBio full-length 16S sequencing results

Demonstration of LoopSeq Accuracy using 16S conservation patterns