ErrorFreeByLength

Setup

Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), define the other relevant file paths for this analysis.

## [1] '1.18.0'

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

Fungal Analysis

We are analyzing 18SITS amplicon data of DNA extracted from 6 fungal isolate obtained from the ATCC: Saccharomyces cerevisiae Meyen ex E.C. Hansen (ATCC 201389D-5), Aspergillus oryzae var. oryzae (ATCC 42149D-2), Candida albicans (Robin) Berkhout (ATCC10231D-5), Trichoderma reesei Simmons (ATCC 13631D-2), Kluyveromyces lactis (Dombrowski) van der Walt (ATCC 8585D-5) and Penicillium chrysogenum Thom (ATCC 10106D-2). These samples were all sequenced on a common LoopSeq run.

Define the fungal fastq file paths.

fn <- list.files(path=path.fun, pattern=".fq.gz", full.names=TRUE)
fn

## [1] "/Users/bcallah/LoopData/18SITS/Aoryzae_sample_B1_contig_list_trimmed.fq.gz"     
## [2] "/Users/bcallah/LoopData/18SITS/Calbicans_sample_C1_contig_list_trimmed.fq.gz"   
## [3] "/Users/bcallah/LoopData/18SITS/Klactis_sample_G1_contig_list_trimmed.fq.gz"     
## [4] "/Users/bcallah/LoopData/18SITS/Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz"
## [5] "/Users/bcallah/LoopData/18SITS/Scerevisiae_sample_A1_contig_list_trimmed.fq.gz" 
## [6] "/Users/bcallah/LoopData/18SITS/Treesei_sample_D1_contig_list_trimmed.fq.gz"

Define the Loop 18S-ITS Mycobiome primers (from their documentation).

FWD <- "TACCTGGTTGATYCTGCCAGT"
REV1 <- "CTBTTVCCKCTTCACTCG"
REV2 <- "GGTTGGTTTCTTTTCCT"
REV3 <- "TAAATTACAACTCGGAC"
REV4 <- "TCCTCCGCTTWTTGWTWTGC"
REV <- c(REV1, REV2, REV3, REV4)

Right now removePrimers does not handle multiple distinct primers like exist here for the reverse primers. So we’ll do it by hand. First running primer removal with all four combinations of the FWD primer with each REV primer.

nop <- lapply(seq_along(REV), function(i) file.path(dirname(fn), paste0("nop", i), basename(fn)))
#nop <- lapply(seq_along(REV), function(i) file.path(path.fun, paste0("nop", i)))
out <- lapply(seq_along(REV), function(i) {
  removePrimers(fn, nop[[i]], FWD, rc(REV[[i]]))
})

## Read in 3131, output 0 (0%) filtered sequences.

## Read in 5874, output 0 (0%) filtered sequences.

## Read in 6274, output 0 (0%) filtered sequences.

## Some input samples had no reads pass the primer detection.

## Read in 3131, output 0 (0%) filtered sequences.

## Read in 5874, output 0 (0%) filtered sequences.

## Read in 13472, output 0 (0%) filtered sequences.

## Read in 6274, output 0 (0%) filtered sequences.

## Some input samples had no reads pass the primer detection.

out

## [[1]]
##                                                  reads.in reads.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         37585         2
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        3131         0
## Klactis_sample_G1_contig_list_trimmed.fq.gz          5874         0
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    18830         2
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz     13472         2
## Treesei_sample_D1_contig_list_trimmed.fq.gz          6274         0
## 
## [[2]]
##                                                  reads.in reads.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         37585      9303
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        3131       806
## Klactis_sample_G1_contig_list_trimmed.fq.gz          5874      1692
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    18830      4357
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz     13472      4395
## Treesei_sample_D1_contig_list_trimmed.fq.gz          6274      1397
## 
## [[3]]
##                                                  reads.in reads.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         37585         1
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        3131         0
## Klactis_sample_G1_contig_list_trimmed.fq.gz          5874         0
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    18830         3
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz     13472         0
## Treesei_sample_D1_contig_list_trimmed.fq.gz          6274         0
## 
## [[4]]
##                                                  reads.in reads.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         37585     24487
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        3131      2295
## Klactis_sample_G1_contig_list_trimmed.fq.gz          5874      4127
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    18830     13466
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz     13472      9658
## Treesei_sample_D1_contig_list_trimmed.fq.gz          6274      4330

So REV1 and REV3 have no hits in this data (may be matches to fungi outside the 6 species we are assaying here?). So we’ll ignore those. But need to combine the results from REV2 and REV4, without duplicating sequences that could have been matches to both primers. (note to self, should augment removePrimers to do this itself when a list of FWD or REV primers is provided)

nop24 <- file.path(dirname(fn), "nop", basename(fn))
for(i in seq_along(fn)) {
  srq2 <- readFastq(nop[[2]][i])
  srq4 <- readFastq(nop[[4]][i])
  id4.in2 <- id(srq4) %in% id(srq2)
  srqo <- append(srq2, srq4[!id4.in2])
  if(!dir.exists(dirname(nop24[i]))) dir.create(dirname(nop24[i]))
  if(file.exists(nop24[i])) file.remove(nop24[i])
  writeFastq(srqo, nop24[i], width=20000L)
}
tot <- cbind(reads.in=out[[1]][,1], all.out=sapply(nop24, function(f) length(getSequences(f))))
tot

##                                                  reads.in all.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         37585   24534
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        3131    2310
## Klactis_sample_G1_contig_list_trimmed.fq.gz          5874    4141
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    18830   13507
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz     13472    9681
## Treesei_sample_D1_contig_list_trimmed.fq.gz          6274    4337

A pretty high fraction of full-length (i.e. FWD primer ot REV primer) 18SITS sequences in these data actually, 2/3rds or more in each sample.

Checking on sequence complexities and overall quality.

plotComplexity(nop24)

plotQualityProfile(nop24)

Everything looks good, no issues with low complexity sequences or sequence quality. Going to perform standard quality filtering, but without the length screen we imposed on full-length 16S since this amplicon varies so much more in length.

filt <- file.path(dirname(fn), "filtered", basename(fn))
track <- filterAndTrim(nop24, filt, maxEE=1, multi=TRUE)
track

##                                                  reads.in reads.out
## Aoryzae_sample_B1_contig_list_trimmed.fq.gz         24534     24018
## Calbicans_sample_C1_contig_list_trimmed.fq.gz        2310      2279
## Klactis_sample_G1_contig_list_trimmed.fq.gz          4141      4099
## Pchrysogenum_sample_H1_contig_list_trimmed.fq.gz    13507     13340
## Scerevisiae_sample_A1_contig_list_trimmed.fq.gz      9681      9574
## Treesei_sample_D1_contig_list_trimmed.fq.gz          4337      4290

As expected, very few reads lost to filtering. Proceed with denoising, first by learning the error rates.

err <- learnErrors(filt, multi=TRUE, verbose=0)
plotErrors(err, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

Looks good.

Denoise:

dd <- dada(filt, err, multi=TRUE, verbose=0)
names(dd) <- sapply(strsplit(basename(filt), "_"), `[`, 1)
dd

## $Aoryzae
## dada-class: object describing DADA2 denoising results
## 3 sequence variants were inferred from 3109 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $Calbicans
## dada-class: object describing DADA2 denoising results
## 1 sequence variants were inferred from 287 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $Klactis
## dada-class: object describing DADA2 denoising results
## 1 sequence variants were inferred from 529 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $Pchrysogenum
## dada-class: object describing DADA2 denoising results
## 1 sequence variants were inferred from 1731 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $Scerevisiae
## dada-class: object describing DADA2 denoising results
## 1 sequence variants were inferred from 1031 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $Treesei
## dada-class: object describing DADA2 denoising results
## 1 sequence variants were inferred from 595 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Remarkably, just a single legitimate ASV is being inferred in all these samples (albeit with default rather than sensitive settings), save A. oryzae sample with 3 ASVs inferred.

dd[[1]]$clustering$abundance

## [1] 23995     8     5

And even in the A. oryzae sample, all of the read weight is in the first ASV.

Our working assumption is that this single ASV is the correct sequence for each of these fungal genomes (this data is likely better than the available reference genomes). So, what fraction of the reads exactly match the single ASV being defined in all samples and are therefore error-free? (code accounts for possible length-variation due to different primers hitting the same sequence, i.e. FWD-REV2 or FWD-REV4 amplifications of the same sequence could be different lengths).

sq.correct <- sapply(dd, function(x) x$sequence[[1]])
names(sq.correct) <- names(dd)
drp <- derepFastq(nop24)
names(drp) <- names(dd)
fundf <- data.frame(Length=nchar(sq.correct),
                    ErrorFree=sapply(names(dd), function(f) {
                      is.id <- sq.correct[[f]] == substr(getSequences(drp[[f]]), 1, nchar(sq.correct[[f]]))
                      sum(drp[[f]]$uniques[is.id], na.rm=TRUE)/sum(drp[[f]]$uniques)
                      }),
                    Species=names(dd), stringsAsFactors=FALSE)
ggplot(data=fundf, aes(x=Length, y=ErrorFree, label=Species)) + geom_label() + theme_bw()

In each case a single sequence in the data (modulo lenth variation between the two REV primers) accounts for over 80% of the reads, and we therefore conservatively estimate that represents the error-free fraction of the full-length sequence data, with the acceptance that we may be incorrectly counting some minor allelic variants as incorrect given that many fungi have hundred(s) of 18SITS gene copies.

Genomic Analysis

Here we are analyzing randomly amplified segments from DNA isolated from bacterial isolates obtained from the ATCC: Nitrosomonas europaea (CATCC 19718D-5), Desulfovibrio desulfuricans (ATCC 27774D-5), and Salinispora tropica (ATCC CNB-440D-5). The reference genomes for each of these strains was downloaded from NCBI and stores in the Docs/ directory of this repository with the following namings scheme: Neuropaee_genome.fasta.gz.

Define the filanames to the LoopSeq fastq files and the reference genomes.

fn.gen <- list.files(path.gen, pattern=".fq$", full.names=TRUE)
names(fn.gen) <- sapply(strsplit(basename(fn.gen), "_"), `[`, 1)
spc <- names(fn.gen)
fn.gen

##                                                              Ddesulfuricans 
## "/Users/bcallah/LoopData/Genomic/Ddesulfuricans_1410_sample_contig_list.fq" 
##                                                                   Neuropaea 
##      "/Users/bcallah/LoopData/Genomic/Neuropaea_1409_sample_contig_list.fq" 
##                                                                    Stropica 
##       "/Users/bcallah/LoopData/Genomic/Stropica_1411_sample_contig_list.fq"

Define file paths to the reference fastas for the genomes of each of these species.

fa.ref <- list.files(path.doc, pattern=".fasta$", full.names=TRUE)
names(fa.ref) <- sapply(strsplit(basename(fa.ref), "_"), `[`, 1)
fa.ref

##                     Ddesulfuricans                          Neuropaea 
## "Docs/Ddesulfuricans_genome.fasta"      "Docs/Neuropaea_genome.fasta" 
##                           Stropica                               Zymo 
##       "Docs/Stropica_genome.fasta"  "Docs/Zymo_Bacteria_16S_v2.fasta"

Compare all the LoopSeq reads to the reference genomes, and record whether they exactly matched or not.

lapply(spc, function(s) {
  ref <- readFasta(fa.ref[s])
  ref.seq <- sread(ref)
  queries <- getSequences(fn.gen[[s]])
  exact <- sapply(queries, function(q) {
    vcountPattern(q, ref.seq) > 0 || vcountPattern(rc(q), ref.seq) > 0
  })
  data.frame(Exact=exact, Length=nchar(queries), Species=s, stringsAsFactors=FALSE)
}) -> hits
genome.hits <- do.call(rbind, hits)

Calculate summary accuracies and median lengths for the read length window of 4000-6000 (representative of ~5Kb target sequences size).

bh5 <- genome.hits[genome.hits$Length > 4000 & genome.hits$Length < 6000,]
gendf <- data.frame(Length=tapply(bh5$Length, bh5$Species, median), # Median lengths of reads in this window
                    ErrorFree=tapply(bh5$Exact, bh5$Species, mean),  # Exact match fraction
                    Species=names(tapply(bh5$Length, bh5$Species, median)),
                    stringsAsFactors=FALSE)
gendf

##                Length ErrorFree        Species
## Ddesulfuricans   4824 0.7314510 Ddesulfuricans
## Neuropaea        4827 0.7536769      Neuropaea
## Stropica         4842 0.6376904       Stropica

Figure showing Error-free Fraction versus Length

Now to integrate the results for the error-free fractions of reads by length from the different technologies and types of evidence we are including.

Define data.frame that contains the measured values that will be plotted as points. Values for error-free fractions on the Zymo mock community are hard-copied here from our analaysis of LoopSeq data and our analysis of PacBio CCS data. In both cases the error-free fraction is calculated relative to the fraction of full-length reads (i.e. reads in which both the forward and reverse primers could be detected) before any quality filtering is applied.

errfree.16S.loop <- 0.92342982 # Copied from the analysis of the Zymo mock community in this manuscript
errfree.16S.pbccs <- 0.5223684 # Copied from the analysis of the ZYmo mock community in Callahan et al. 2019
len.16S <- 1450
# Make data.frame with the point evidence
dfp <- cbind(fundf, Evidence="Fungal 18S-ITS", Technology="LoopSeq")
dfp <- rbind(dfp, data.frame(Length=len.16S, ErrorFree=errfree.16S.loop, Species=NA, 
                             Evidence="Zymo 16S", Technology="LoopSeq"))
dfp <- rbind(dfp, data.frame(Length=len.16S, ErrorFree=errfree.16S.pbccs, Species=NA,
                             Evidence="Zymo 16S", Technology="PacBio CCS"))
dfp <- rbind(dfp, cbind(gendf, Evidence="Genomic", Technology="LoopSeq"))

Now create data.frame of projected error rates as a function of the evidence from the Zymo full-length 16S analysis for LoopSeq and PacBio CCS, or the manufacturer’s reported error rate for Oxford Nanopore.

er.loop <- -log(errfree.16S.loop)/len.16S
er.pbccs <- -log(errfree.16S.pbccs)/len.16S
er.ont <- 0.06 # Approximate number for ONT R10 chemistry error rate as reported by the manufacturer

nts <- seq(6000)
dfer <- rbind(data.frame(Length=nts, ErrorFree=exp(-nts*er.loop), Species=NA, Evidence="Zymo 16S", Technology="LoopSeq"),
              data.frame(Length=nts, ErrorFree=exp(-nts*er.pbccs), Species=NA, Evidence="Zymo 16S", Technology="PacBio CCS"),
              data.frame(Length=nts, ErrorFree=exp(-nts*er.ont), Species=NA, Evidence="Zymo 16S", Technology="Oxford Nanopore"))

Combine the collated data into a publication figure showing the relationship between read-length and the error-free fraction.

#colnames(dfer)[colnames(dfer) == "Evidence"] <- "Sequence_Type"
perrfree <- ggplot(data=dfer, aes(x=Length, y=ErrorFree, shape=Evidence, color=Technology)) + 
  geom_point(data=dfp) + ylim(0, NA) + theme_bw() + 
  scale_shape_manual(name="Sequence Type",
                     values=c("Fungal 18S-ITS"="triangle", "Genomic"="square", "Zymo 16S"="circle")) +
  guides(color = guide_legend(order = 1, override.aes = list(linetype="blank")), shape = guide_legend(order = 0)) +
  geom_line(linetype="dashed") +
  ylab("Fraction Error-free Reads") + xlab("Amplicon Length")
perrfree

ggsave(file.path(path.fig, "perrfree.pdf"), perrfree, 
       width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "perrfree.png"), perrfree, 
       width=6, height=4, units="in")

ErrorFreeByLength

BJC

Setup

Fungal Analysis

Genomic Analysis

Figure showing Error-free Fraction versus Length