This is a modeling tool to compare methods of metagenomic biosurveillance. It puts you in the position of a person designing an early detection system to identify a novel stealth pathogen before it spreads widely.
While we've tried to make the tool realistic, many of the input parameter estimates are very rough. Without actually building a pilot metagenomic monitoring system it's quite possible that important inputs are considerably wrong.
The tool simulates an epidemic that starts with one person and grows exponentially with a specified doubling time. There are one or more sites, each implementing the program in parallel. At each site, a form of sampling and sequencing is performed on a weekly schedule. We model the pathogen as detected once a specific portion of its genome has been observed a minimum number of times.
At a technical level, the tool runs many simulations and graphs them
by the observed cumulative incidence at detection.
Each time the tool samples, it uses a Poisson approximation of
binomial sampling to determine how many sick people are in
the sample. Similarly, each time it sequences, it uses a
Poisson approximation of binomial sampling to determine how
many sequencing reads are obtained for the pathogen of
interest. Detection is modeled as happening when a
specified threshold number of sequencing reads that match a
specific portion of the target genome is accumulated.
The tool assumes that sequencing reads are moderately unevenly
distributed along the pathogen's genome, following the coverage
distribution we observe for SARS-CoV-2 in unpublished MU data.
The tool simulates many times (1,000 by default) and charts the
range of cumulative incidences observed. Any outcomes where the
cumulative incidence is over 30% are marked as a failure by showing a
"0"; on the chart. The simulation is not correct for high cumulative
incidence, as it only models the initial exponential stage of the
epidemic.
Certain values allow you to specify how much noise to add
to the model. The noise is generated once per simulation.
For the inputs marked "CV" the noise is normally; for ones
marked "CVg" the noise is lognormally
distributed. The higher the CV you set, the more uncertainty
will be introduced into the simulation, generally causing
the low-percentile outputs to be more optimistic and the
high-percentile outputs to be more pessimistic.
If we're starting with a set of individual-level relative
abundances we multiply them by lognormally distributed noise with
μ=0 and σ=σg. This noise shifts all
provided values in the same direction to represent our uncertainty
about whether these relative abundances are systematically too high
or low. For example, if in one of the 1,000 simulations our
lognormal draw gives us a noise value of is 0.7 and the provided
values are 1e-5 and 1e-6, for that simulation each time someone is
sick we'll pick one of 7e-6 and 7e-7.
On the other hand, if we're starting with an imported
RAi(1%) distribution we draw from the distribution once
for each simulation. For example, if in one of the 1,000 simulations
our RAi(1%) draw gives us 1e-7, then when 1% of people
became infected in the last week we'll draw the number of sequencing
reads that match the pathogen from a Poisson distribution with a mean
of the total number of sequencing reads times 1e-7.
more...
You can see the actual implementation of the tool by
reading the simulate_one
function in the source
of this page or
on
github.
Relative abundance estimates refer to the fraction of sequencing
reads that would come from the modeled virus if everyone
contributing to the sample was currently infected. We estimated shedding for
SARS-CoV-2 and Influenza A using different sources:
We use our Grimm
et al. 2023's distribution for RAi(1%) assuming sampling and sequencing similar
to Rothman
et al. 2021.
We use the distributions we calculated
in Predicting
Influenza Abundance in Wastewater Metagenomic Sequencing Data for
RAi(1%), based on applying the methods
from Grimm
et al. 2023 to unpublished MU and UCI sequencing data.
We use the individual-level relative abundances we collected
in Investigating
the Sensitivity of Pooled Swab Sampling for Pathogen Early Detection,
treating each outcome as equally likely.
more...
There are several main places where this tool does not include
considerations we know are significant:
Real epidemics don't grow perfectly exponentially. Early on
individual infection events can have a large contribution,
such as with superspreader events. This tool assumes simple
exponential growth.
Real epidemics would happen to progress faster in some
locations and slower in others. If you're sampling at mutiple
sites you should see an epidemic be farther ahead in some
places than others. This tool assumes the epidemic grows
equally quickly at each site, which underestimates the value
of running multiple sites.
Real epidemics fall below exponential, even if they're
spreading unnoticed, once there's an appreciable number of
succeptible people. This tool only simulates the early
portion of the curve, where exponential is a good
approximation. In practice this is a minimal limitation,
because we're really only concerned with scenarios that would
let us flag a pandemic before too many people had been
infected.
For very substantial efforts, with very deep sequencing
across many sites, a real detection system would start to see
economies of scale. We don't model this, and instead assume
linear cost per site.
more...
Please treat this tool as a robot that will scribble on the back of an envelope for you, not a precise estimator!
Scenario | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
Pathogen and Sampling | |||||||||||||||
Doubling time (days): | CV: | ||||||||||||||
Genome length (bp): | |||||||||||||||
(source) (source) (source) (source) (source) (source) (source) | |||||||||||||||
Relative abundance: | σ: | ||||||||||||||
If you ran metagenomic sequencing on a sample that was from just a single infected person, what fraction of sequencing reads would be from the pathogen? | |||||||||||||||
This could be a single number to model all infected people as contributing an equal amount, or multiple space-separated numbers to represent a set of equally likely contribution amounts. For σ you specify the standard deviation of the log, giving per-simulation noise representing our uncertainty about whether the true relative abundance would be higher or lower. | |||||||||||||||
Directly specify relative abundance of flaggable reads? | |||||||||||||||
Normally we discount the fraction of reads that are flaggable by considering the genome length, read length, and fragment quality. If you're modeling these outside of this tool, however, and just want to specify this fraction directly, check this box and provide your fraction in the "relative abundance" box above. | |||||||||||||||
Shedding duration (days): | σ: | ||||||||||||||
Once someone is infected, how many days do they shed for? This is a simplified model where we assume people shed equally throughout their infection. For σ, specify the standard deviation of the log. | |||||||||||||||
Low-quality fragments? | |||||||||||||||
When sequencing wastewater the nucleic acid fragments tend to be pretty torn up by the hostile environment, making it difficult to get much value from long-read sequencing. If this box is checked we assume that sequencing reads won't be longer than 170bp even if you're using a sequencing machine capable of producing longer ones when given good material. | |||||||||||||||
Sample population (people): | |||||||||||||||
Each time you go out and sample people, how many people are contributing? | |||||||||||||||
Sampling schedule: |
| ||||||||||||||
Cost per sample (dollars): | |||||||||||||||
Sequencing | |||||||||||||||
| |||||||||||||||
Sequencing run depth (reads): | |||||||||||||||
Each time you run your sequencer, how many reads does it produce? | |||||||||||||||
Read length (bp): | |||||||||||||||
How long are the reads your sequencer generates? Note that actual reads will be lower if the sample is low quality; see "low-quality fragments?" above. | |||||||||||||||
Cost per sequencing run (dollars): | |||||||||||||||
Sequencing schedule: |
| ||||||||||||||
Repeat frequency (weeks): | |||||||||||||||
Processing delay (days): | |||||||||||||||
How long does it take from collecting a sample until you have the bioinformatic results? Remember to include lab time to prepare the sample for sequencing, waiting for your sample to get a turn on the sequencer, time on the sequencer, and bioinformatic processing. | |||||||||||||||
Global | |||||||||||||||
Minimum Observations: | |||||||||||||||
This model simulates detection as happening when a particular portion of the genome has been observed some number of times. Using a higher number here corresponds to requiring more certainty before raising an alarm. | |||||||||||||||
Sites: | |||||||||||||||
In this model you can parallelize across
multiple sites, each of which works exactly the same way. As a rough
approximation, running twice as many sites will cost twice as much and
detect at half the cumulative incidence.
Note that we don't model the most obvious benefit of multiple sites, where each one has a chance of being a location where the epidemic is farther ahead. | |||||||||||||||
Overhead (percentage) | |||||||||||||||
Assume costs are higher than the inputs specified above by this fixed percentage. | |||||||||||||||
Annual Cost (dollars): | |||||||||||||||
Simulation | |||||||||||||||
Simulations: | |||||||||||||||
We will run the simulation the specified number of times, and display the results in a chart below. | |||||||||||||||
Simulation Label | |||||||||||||||
Use a descriptive label here, which will appear in your chart and on the outcomes table. |
Cumulative Incidence at Detection: when the system raises the alarm, what fraction of people do we estimate will have ever been infected? Lower is better, since it represents identifying the pandemic earlier and having more time to respond.