Accurate supervised classification of environmental DNA

Environmental DNA (eDNA) meta-barcoding offers unprecidented insights into the ecology of biological communities, enabling comprehensive assessments of biodiversity from only trace amounts of genetic material. Yet several analytical challenges remain, particularly when incorporating robust statistical inference in the assignment of taxonomic identities. False discovery rates at low taxonomic ranks are often well above those generally considered acceptable in biology, leading to questionable conclusions being reported in the literature. In this project, we focus on developing new supervised machine-learning algorithms that incorporate full probabilistic models to identify the taxonomic sources of environmental DNA amplicon sequences. These models offer strict control of false discovery rates, and often improve recall (resolution/sensitivity), but at a cost of computational efficiency. With the increased availability of remote parallel computing services, a shift in focus toward precision and statistical interpretability favors the incorporation of full probabilistic models in the field of supervised taxonomic classification.