Extrapolation of Urn Models via Poissonization: Accurate Measurements of the Microbial Unknown

The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the alpha-diversit…

Authors: Manuel Lladser, Raul Gouet, Jens Reeder

1 Extrap olation of urn m o dels via P oissonization: accurate measuremen ts of the microbial unkno wn Manuel E. Lladser 1 , ∗ , Ra´ ul Gouet 2 , Je ns Reeder 3 1 Departmen t of Applied Mathematics, Univ ersity of Col o ra do, Boulder, Colorado, USA 2 Cen tro de M odel amien to Matem´ atico (CNRS UMI 2807), Universidad de Chile, San ti ago, Chile 3 Departmen t o f Chemis try and Bio c hem istry , Univ ersi ty o f Co lorado, Boulder, Colo rado, USA ∗ E-mail: man uel.lladser@colorado.e du Abstract The av ailabilit y of high-throughput parallel metho ds for s equencing microbia l communities is increasing our knowledge of the micr obial world at an unprecedented rate. Though most a tten tion has fo cused on determining lower-bounds o n the α -diversity i.e. the total num ber of different sp ecies prese nt in the environment , tight bounds on this quantit y may b e highly uncer tain b ecause a small fra ction o f the environment could be co mposed o f a v ast num b er of differen t spe c ies. T o better assess what remains unknown, we prop ose instea d to predict the fra ction of the environment that b elongs to uns a mpled clas ses. Mo deling samples a s draws with replacement of colored balls from an urn with an unknown comp osition, and under the sole assumption that there ar e still undiscov er e d sp ecies, we show that conditionally un biased predictors and exact prediction in terv als (of cons tan t le ng th in logarithmic sc a le) are p ossible for the fraction of the e nvironment that b elongs to uns a mpled c lasses. Our predictions are based on a Poissonization argument , which w e hav e implemented in what we call the Embedding algor ithm. In fixe d i.e. non-randomize d sample sizes, the algor ithm leads to very accurate predictions o n a sub-sa mple o f the original s a mple. W e qua n tify the effect of fixed sample sizes on our prediction int erv als and test our metho ds and others found in the liter ature aga inst simulated environments, which we devise taking in to account datasets fro m a hum an-gut and -ha nd microbiota. Our metho dology applies to any dataset that can be conceptualized as a sample with replacement fro m an urn. In particular, it could be applied, for example, to quantify the prop o rtion of a ll the unseen solutions to a binding site problem in a random RNA po ol, o r to reas sess the surveillance of a cer ta in terrorist gr o up, predicting the conditio na l probability that it deploys a new tactic in a next attack. F u ll paper av ailable at: http://www.plosone.org/ar ticle/inf o%3 Adoi%2F10.1371%2Fjournal .pone.0021105

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment