Sample Simulator

This is a modeling tool to compare methods of metagenomic biosurveillance. It puts you in the position of a person designing an early detection system to identify a novel stealth pathogen before it spreads widely.

While we've tried to make the tool realistic, many of the input parameter estimates are very rough. Without actually building a pilot metagenomic monitoring system it's quite possible that important inputs are considerably wrong.

The tool simulates an epidemic that starts with one person and grows exponentially with a specified doubling time. There are one or more sites, each implementing the program in parallel. At each site, a form of sampling and sequencing is performed on a weekly schedule. We model the pathogen as detected once a specific portion of its genome has been observed a minimum number of times.

At a technical level, the tool runs many simulations and graphs them by the observed cumulative incidence at detection.

more...

Each time the tool samples, it uses a Poisson approximation of binomial sampling to determine how many sick people are in the sample. Similarly, each time it sequences, it uses a Poisson approximation of binomial sampling to determine how many sequencing reads are obtained for the pathogen of interest. Detection is modeled as happening when a specified threshold number of sequencing reads that match a specific portion of the target genome is accumulated.
The tool assumes that sequencing reads are moderately unevenly distributed along the pathogen's genome, following the coverage distribution we observe for SARS-CoV-2 in unpublished MU data.
The tool simulates many times (1,000 by default) and charts the range of cumulative incidences observed. Any outcomes where the cumulative incidence is over 30% are marked as a failure by showing a "0"; on the chart. The simulation is not correct for high cumulative incidence, as it only models the initial exponential stage of the epidemic.
Certain values allow you to specify how much noise to add to the model. The noise is generated once per simulation. For the inputs marked "CV" the noise is normally; for ones marked "CV_g" the noise is lognormally distributed. The higher the CV you set, the more uncertainty will be introduced into the simulation, generally causing the low-percentile outputs to be more optimistic and the high-percentile outputs to be more pessimistic.
If we're starting with a set of individual-level relative abundances we multiply them by lognormally distributed noise with μ=0 and σ=σ_g. This noise shifts all provided values in the same direction to represent our uncertainty about whether these relative abundances are systematically too high or low. For example, if in one of the 1,000 simulations our lognormal draw gives us a noise value of is 0.7 and the provided values are 1e-5 and 1e-6, for that simulation each time someone is sick we'll pick one of 7e-6 and 7e-7.
On the other hand, if we're starting with an imported RA_i(1%) distribution we draw from the distribution once for each simulation. For example, if in one of the 1,000 simulations our RA_i(1%) draw gives us 1e-7, then when 1% of people became infected in the last week we'll draw the number of sequencing reads that match the pathogen from a Poisson distribution with a mean of the total number of sequencing reads times 1e-7.

You can see the actual implementation of the tool by reading the simulate_one function in the source of this page or on github.

Relative abundance estimates refer to the fraction of sequencing reads that would come from the modeled virus if everyone contributing to the sample was currently infected. We estimated shedding for SARS-CoV-2 and Influenza A using different sources:

more...

SARS-CoV-2 in Municipal Wastewater:: We use our Grimm et al. 2023's distribution for RA_i(1%) assuming sampling and sequencing similar to Rothman et al. 2021.
Influenza in Municipal Wastewater:: We use the distributions we calculated in Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data for RA_i(1%), based on applying the methods from Grimm et al. 2023 to unpublished MU and UCI sequencing data.
Nasal/Throat Swabs:: We use the individual-level relative abundances we collected in Investigating the Sensitivity of Pooled Swab Sampling for Pathogen Early Detection, treating each outcome as equally likely.

There are several main places where this tool does not include considerations we know are significant:

more...

Real epidemics don't grow perfectly exponentially. Early on individual infection events can have a large contribution, such as with superspreader events. This tool assumes simple exponential growth.
Real epidemics would happen to progress faster in some locations and slower in others. If you're sampling at mutiple sites you should see an epidemic be farther ahead in some places than others. This tool assumes the epidemic grows equally quickly at each site, which underestimates the value of running multiple sites.
Real epidemics fall below exponential, even if they're spreading unnoticed, once there's an appreciable number of succeptible people. This tool only simulates the early portion of the curve, where exponential is a good approximation. In practice this is a minimal limitation, because we're really only concerned with scenarios that would let us flag a pandemic before too many people had been infected.
For very substantial efforts, with very deep sequencing across many sites, a real detection system would start to see economies of scale. We don't model this, and instead assume linear cost per site.

Please treat this tool as a robot that will scribble on the back of an envelope for you, not a precise estimator!

To use the tool, choose a base secenario, modify any parameters you're interested in, and click "Run". If you share the full URL that will let anyone loading that URL see (a) your last set of parameters and (b) any saved runs.

Scenario

Covid Wastewater $1M/y Covid Nasal Nanopore $1M/y
Flu Wastewater $1M/y Flu Nasal Nanopore $1M/y
Custom

Pathogen and Sampling

Doubling time (days):

CV:

Genome length (bp):

SARS-CoV-2 in Wastewater with Rothman et al. 2021 Sequencing (source)
Flu A in Wastewater with MU Sequencing (source)
Flu B in Wastewater with MU Sequencing (source)
Flu A in Wastewater with UCI Sequencing (source)
Flu B in Wastewater with UCI sequencing (source)
SARS-CoV-2 in Nasal and/or Throat Swabs (source)
Flu A in Nasal and/or Throat Swabs (source)
Custom

Relative abundance:

σ:

If you ran metagenomic sequencing on a sample that was from just a single infected person, what fraction of sequencing reads would be from the pathogen?

This could be a single number to model all infected people as contributing an equal amount, or multiple space-separated numbers to represent a set of equally likely contribution amounts. For σ you specify the standard deviation of the log, giving per-simulation noise representing our uncertainty about whether the true relative abundance would be higher or lower.

Directly specify relative abundance of flaggable reads?

Normally we discount the fraction of reads that are flaggable by considering the genome length, read length, and fragment quality. If you're modeling these outside of this tool, however, and just want to specify this fraction directly, check this box and provide your fraction in the "relative abundance" box above.

Shedding duration (days):

σ:

Once someone is infected, how many days do they shed for? This is a simplified model where we assume people shed equally throughout their infection. For σ, specify the standard deviation of the log.

Low-quality fragments?

When sequencing wastewater the nucleic acid fragments tend to be pretty torn up by the hostile environment, making it difficult to get much value from long-read sequencing. If this box is checked we assume that sequencing reads won't be longer than 170bp even if you're using a sequencing machine capable of producing longer ones when given good material.

Sample population (people):

Each time you go out and sample people, how many people are contributing?

Sampling schedule:

Cost per sample (dollars):

Sequencing

NovaSeq X 25B Lane	Element Aviti 2x150
NovaSeq 6000 S4	NovaSeq 6000 SP 2x150	NovaSeq 6000 SP 2x250
Nanopore MinION	Custom

Sequencing run depth (reads):

Each time you run your sequencer, how many reads does it produce?

Read length (bp):

How long are the reads your sequencer generates? Note that actual reads will be shorter if the sample is low quality; see "low-quality fragments?" above.

Cost per sequencing run (dollars):

Sequencing schedule:

Repeat frequency (weeks):

Simulate only sequencing every N weeks. Positive integers only.

Processing delay (days):

How long does it take from collecting a sample until you have the bioinformatic results? Remember to include lab time to prepare the sample for sequencing, waiting for your sample to get a turn on the sequencer, time on the sequencer, and bioinformatic processing.

Global

This model simulates detection as happening when a particular portion of the genome has been observed in a minimum number of samples, and a minimum number of times total.

Minimum Samples:

How many separate samples need to match the relevant genome portion?

Minimum Reads:

How many total reads, across all samples, need to match the relevant genome portion?

Sites:

In this model you can parallelize across multiple sites, each of which works exactly the same way. As a rough approximation, running twice as many sites will cost twice as much and detect at half the cumulative incidence.

Note that we don't model the most obvious benefit of multiple sites, where each one has a chance of being a location where the epidemic is farther ahead.

Overhead (percentage)

Assume costs are higher than the inputs specified above by this fixed percentage.

Annual Cost (dollars):

Simulation

Simulations:

We will run the simulation the specified number of times, and display the results in a chart below.

Simulation Label

Use a descriptive label here, which will appear in your chart and on the outcomes table.

Sampling and Sequencing Simulator