Introduction
Welcome to the insect R package, a tool for assigning taxon IDs to amplicon libraries using informatic sequence classification trees. The insect learning algorithm takes a set of reference sequences (obtained from GenBank and/or other sources) to build a classification tree, which is then used to assign taxonomic IDs to a set of query sequences (e.g. those generated from an NGS platform such as Illumina MiSeq). The learning and classification functions are best suited to computing environments with multiple processors and access to large amounts of memory; however, most modest-sized datasets can be processed on standard personal computers if time is available. While not a prerequisite, insect is designed to be used in conjunction with the ape package (Paradis et al., 2004; Paradis, 2012), which features memory-efficient binary formats for DNA and amino acid sequences (“DNAbin” and “AAbin” object types), and the dada2 package (Callahan et al., 2016) which contains essential functions for de-noising high-throughput sequencing data and other important pre-processing steps.
The insect package can be used to analyze environmental DNA (eDNA) meta-barcode libraries as well as single-source NGS/Sanger amplicon sequences.
The most time-consuming and memory-intensive stage of the insect work-flow generally involves training the classifier. For example, the COI classifier used in this tutorial, which was built from the MIDORI UNIQUE reference trainingset consisting of nearly a million COI barcode sequences, took around three days to run on a 24 x multithread. The insect classification trees are amplicon specific, so a unique tree is generally required for each primer set. However, trees are already available for some of the more commonly used barcoding primers here:
Marker | Target | Primers | Source | Version | Date | Download |
---|---|---|---|---|---|---|
12S | Fish | MiFishUF/MiFishUR (Miya et al 2015) | GenBank | 1 | 20181111 | RDS (9MB) |
16S | Marine crustaceans | Crust16S_F/Crust16S_R (Berry et al 2017) | GenBank | 4 | 20180626 | RDS (7.1 MB) |
16S | Marine fish | Fish16sF/16s2R (Berry et al 2017; Deagle et al 2007) | GenBank | 4 | 20180627 | RDS (6.8MB) |
18S | Marine eukaryotes | 18S_1F/18S_400R (Pochon et al 2017) | SILVA_132_LSUParc, GenBank | 5 | 20180709 | RDS (11.8 MB) |
18S | Marine eukaryotes | 18S_V4F/18S_V4R (Stat et al 2017) | GenBank | 4 | 20180525 | RDS (11.5 MB) |
23S | Algae | p23SrV_f1/p23SrV_r1 (Sherwood & Presting 2007) | SILVA_132_LSUParc | 1 | 20180715 | RDS (26.9MB) |
COI | Metazoans | mlCOIintF/jgHCO2198 (Leray et al 2013) | Midori, GenBank | 5 | 20181124 | RDS (140 MB) |
ITS2 | Cnidarians and sponges | scl58SF/scl28SR (Wilkinson et al in prep) | GenBank | 5 | 20180920 | RDS (6.6 MB) |
The insect package also includes functions for downloading, trimming and filtering reference datasets (including a “virtual PCR” tool and an annotation quality filter), building a hierarchical taxonomy database, and training the classifier; however, these methods are beyond the scope of this introductory tutorial. New classification trees and updates are frequently added to the collection, so please feel free to suggest a barcoding primer set with which to train a classifier and we will endeavor to add it to the list.
The INSECT learning and classification algorithms
To learn a classification tree, a reference sequence dataset is first
obtained from GenBank and/or other sources from which barcode sequences
with accurate taxon IDs are available. These sequences are filtered to
remove any with obvious taxonomic labeling issues, and trimmed to retain
only the region of interest using the virtualPCR
function (sequences
that do not span the entire amplicon region are removed). the learn
function then splits the training sequences into two subsets, maximizing
the dissimilarity between the two groups. A profile hidden Markov model
is then derived for each group (see Durbin et al. (1998) for a detailed
description of these models). The partitioning and model training
procedure then continues recursively, splitting the reference sequences
into smaller and smaller subsets while adding new nodes and models to
the classification tree.
Once the classifier has been trained, query sequences obtained from the specified primer set can be assigned taxonomic IDs along with probabilistic confidence values. The classification algorithm works as follows: starting from the root node of the classification tree, the likelihood of the query sequence (the full probability of the sequence given a particular model) is computed for each of the models at the two immediate child nodes using the forward algorithm (see Durbin et al. (1998)). The competing likelihood values are then compared by computing their Akaike weights (see Johnson and Omland, 2004). If one model is overwhelmingly more likely to have produced the sequence than the other, that child node is selected and the classification is updated to reflect the lowest common taxonomic rank of the training sequences belonging to the node.
This procedure is repeated recursively, descending down the tree until
either an inconclusive result is returned from a model comparison test
(i.e. the Akaike weight is lower than a pre-defined threshold, e.g.
0.9), or a terminal leaf node is reached, at which point a species-level
ID is generally returned. The classify
function outputs the taxon
name, rank and ID number (i.e. NCBI taxon ID, WoRMS aphia ID, or other
identifier depending on the taxonomy database used in the training
step), along with the final Akaike weight value, which can be
interpreted as a confidence score (between 0 and 1, with values close to
1 indicating high confidence). Note that the default behavior is for the
Akaike weight to ‘decay’ as it moves down the tree, by computing the
cumulative product of all preceding Akaike weight values. This is
perhaps an overly conservative approach, but it minimizes the chance of
mis-classifying or over-classifying the query sequences.
A worked example
This tutorial demonstrates the insect work-flow using an example dataset of COI sequences derived from autonomous reef monitoring structures (ARMS) in Ofu, American Samoa, amplified using the metazoan COI barcoding primers mlCOIintF and jgHCO2198 (GGWACWGGWTGAACWGTWTAYCCYCC and TAIACYTCIGGRTGICCRAARAAYCA, respectively; Leray et al. (2013)).
The dataset was first de-noised and tabulated using the DADA2 pipeline, to produce a table of chimera-free amplicon sequence variants (ASVs). The most abundant 16 ASVs are included in the insect package as an example dataset (see below).
If using other tools for de-noising, trimming, merging, etc, the data
can be read in using either the readFASTA
or readFASTQ
functions to
produce a “DNAbin” object compatible with the insect classifier.
Loading the package and data
First, make sure that the devtools, ape and seqinr packages are installed and up to date. Then install and load the latest development version of the insect package from GitHub as follows:
devtools::install_github("shaunpwilkinson/insect")
library(insect)
The COI classifier was trained on the MIDORI UNIQUE 20180221 dataset, and uses information from both the DNA read and the translated amino acid sequence (EBI5 invertebrate mitochondrial translation table) to assign taxonomy to query sequences. You can download the classifier from the link in the table above; the file is quite large (~ 140 MB), so make sure there is a good internet connection available.
Alternatively, the classifier can be downloaded to the current working directory as follows:
download.file("https://www.dropbox.com/s/dvnrhnfmo727774/classifier.rds?dl=1",
destfile = "classifier.rds", mode = "wb")
Classifying sequences
To assign taxon IDs to the table of amplicon sequence variants (ASVs)
produced by DADA2, we first extract the sequences stored as column names
in the seqtab.nochim
matrix (see the DADA2
tutorial for
instructions on how to produce this table).
Once the sequences are stored in memory as a “DNAbin” object, we can optionally nullify the column names in the table to avoid flooding the console with long sequence strings:
## read in the example seqtab.nochim ASV table
data(samoa)
## get sequences from table column names
x <- char2dna(colnames(samoa))
## name the sequences sequentially
names(x) <- paste0("ASV", seq_along(x))
## optionally remove column names that can flood the console when printed
colnames(samoa) <- NULL
The next step is to load the classifier. Note that this ‘insect’ class object is just a large dendrogram with additional attributes for classifying sequences including profile HMMs and taxonomic information:
classifier <- readRDS("classifier.rds")
classifier
names(attributes(classifier))
#> 'dendrogram' with 2 branches and 113833 members total, at height 70
#> [1] "k" "height" "midpoint" "members" "class"
#> [6] "taxonomy" "clade" "frame" "remainder" "sequences"
#> [11] "minscore" "seqlengths" "pointers" "key" "kmers"
#> [16] "minlength" "maxlength" "model" "trainingset" "alternative"
#> [21] "nunique" "ntotal" "taxID" "numcode"
The final step is to assign taxon IDs and confidence values to each ASV.
The classify
function may take a minute or so to process these
sequences, since it uses a computationally intensive dynamic programming
algorithm to find the likelihood values of each sequence given the
models at each node of the tree. The exception is when ping
is not
FALSE
and there is an exact match or high similarity
between the query sequence and at least one of the sequences in the
training dataset (e.g. >= 99% if ping = 0.99
). In this case the
function simply returns the common ancestor of the matching sequences
without a confidence score. To stay on the safe side, we will keep
ping = 1
(i.e. only sequences with 100% identity are considered
matches).
The classify
function can also be run in parallel by setting the
cores
argument to 2 or more depending on the number available (setting
cores = "autodetect"
will automatically run on one less than the
number available). If choosing this option for the large COI classifier,
please ensure that there is at least 2 GB of available RAM per
processor. Classification times can vary, and depend on several factors
including the number of unique sequences in the dataset, the size of the
classifier, the length of the input sequences, the processing speed,
number of processors used, etc. The average time to ID COI sequences
using the classifier above is approximately 3 - 4 seconds per unique
sequence per processor. For example, a dataset containing 1000 unique
sequences would take around an hour to process on a single processor,
half an hour on two, and so on.
In the following code we set tabulize = FALSE
(the default setting)
since the DADA2 output table already contains the sequence counts and we
are only classifying the unique sequence variants. In the case where a
list of sequences containing replicates is to be processed, users can
prefix the sequence names with their respective sample names, delimited
with an underscore (e.g. “sample001_sequence001”) and set
tabulize = TRUE
. In this case, the classify
function will
automatically count and dereplicate the sequences, producing an output
table with one column of sequence counts for each sample.
longDF <- classify(x, classifier, threshold = 0.8)
For DADA2 users, the ASV abundance table can now be transposed and appended to the table of taxonomic information if required:
longDF <- cbind(longDF, t(samoa))
representative | taxID | taxon | rank | score | kingdom | phylum | class | order | family | genus | species | OFU04A-100_S64 | OFU04A-1_S13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASV1 | 2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 156 | 5600 | ||||||
ASV2 | 6379 | Chaetopterus | genus | 1.0000 | Metazoa | Annelida | Polychaeta | Spionida | Chaetopteridae | Chaetopterus | 4496 | 0 | |
ASV3 | 2806 | Florideophyceae | class | 0.9989 | Florideophyceae | 28 | 3267 | ||||||
ASV4 | 2172821 | Multicrustacea | superclass | 1.0000 | Metazoa | Arthropoda | 3203 | 0 | |||||
ASV5 | 131567 | cellular organisms | no rank | 0.9952 | 3024 | 0 | |||||||
ASV6 | 2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 20 | 2409 | ||||||
ASV7 | 39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 2379 | 0 | ||
ASV8 | 116571 | Podoplea | superorder | 0.9995 | Metazoa | Arthropoda | Hexanauplia | 2156 | 104 | ||||
ASV9 | 2806 | Florideophyceae | class | 0.9482 | Florideophyceae | 0 | 2149 | ||||||
ASV10 | 1 | root | no rank | NA | 2091 | 0 | |||||||
ASV11 | 115834 | Hesionidae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Hesionidae | 1905 | 6 | ||
ASV12 | 1443949 | Corallinophycidae | subclass | 0.9910 | Florideophyceae | 87 | 1757 | ||||||
ASV13 | 33213 | Bilateria | no rank | 1.0000 | Metazoa | 27 | 1800 | ||||||
ASV14 | 131567 | cellular organisms | no rank | 0.9952 | 1729 | 9 | |||||||
ASV15 | 2806 | Florideophyceae | class | 0.9993 | Florideophyceae | 0 | 1725 | ||||||
ASV16 | 39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 1481 | 0 |
Any sequences that return exact hits with at least one training sequence
(or near matches if ping = 0.99
or similar) are assigned a score of
NA
. For hybrid DNA/AA classifiers such as the COI version used above,
non-translatable sequences are also automatically assigned a score of
NA
, as is the case for ASV10 in the table above.
For a more succinct output we can aggregate the table to only include one row for each unique taxon as follows:
taxa <- aggregate(longDF[3:12], longDF["taxID"], head, 1)
counts <- aggregate(longDF[13:ncol(longDF)], longDF["taxID"], sum)
shortDF <- merge(taxa, counts, by = "taxID")
taxID | taxon | rank | score | kingdom | phylum | class | order | family | genus | species | OFU04A-100_S64 | OFU04A-1_S13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | root | no rank | NA | 2091 | 0 | |||||||
2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 204 | 15150 | ||||||
6379 | Chaetopterus | genus | 1.0000 | Metazoa | Annelida | Polychaeta | Spionida | Chaetopteridae | Chaetopterus | 4496 | 0 | |
33213 | Bilateria | no rank | 1.0000 | Metazoa | 27 | 1800 | ||||||
39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 3860 | 0 | ||
115834 | Hesionidae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Hesionidae | 1905 | 6 | ||
116571 | Podoplea | superorder | 0.9995 | Metazoa | Arthropoda | Hexanauplia | 2156 | 104 | ||||
131567 | cellular organisms | no rank | 0.9952 | 4753 | 9 | |||||||
1443949 | Corallinophycidae | subclass | 0.9910 | Florideophyceae | 87 | 1757 | ||||||
2172821 | Multicrustacea | superclass | 1.0000 | Metazoa | Arthropoda | 3203 | 0 |
As shown in the above example, many of the sequences return fairly
uninformative taxon IDs (e.g. ‘cellular organisms’). This is a fairly
typical feature of eDNA datasets that can contain a large number of
novel sequences that are dissimilar to anything recorded in the
reference database(s). Note that query sequences with high similarity to
reference sequences can also occasionally produce uninformative
classifications due to inconclusive model comparison tests at top-level
nodes. This may be circumvented by reducing the threshold
parameter or
setting decay = FALSE
; however, users are advised against the
excessive relaxation of these parameters since it may increase the
chance of returning erroneous classifications (these tend to be very
rare when using the default values). Further testing and optimization
may help to address some of these best-practice considerations, and will
be a focus of future research.
This introduction to the insect package has outlined the steps involved in taxonomic identification of amplicon sequence variants (ASVs) using a pre-built classification tree. The next tutorial will deal with downloading and curating a primer-specific local sequence database and using it to build a classification tree.
Please feel free to email the author directly with any feedback or questions at shaunpwilkinson AT gmail DOT com. Bug reports can also be directed to the GitHub issues page.
Acknowledgements
This software was developed with funding from a Rutherford Foundation Postdoctoral Research Fellowship from the Royal Society of New Zealand. Unpublished COI data care of Molly Timmers (NOAA).
References
Callahan,B.J. et al. (2016) DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods, 13, 581–583.
Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.
Johnson,J.B. and Omland,K.S. (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution, 19, 101–108.
Leray,M. et al. (2013) A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in Zoology, 10, 34.
Paradis,E. (2012) Analysis of Phylogenetics and Evolution with R. Second Edition. Springer, New York.
Paradis,E. et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290.