Sampling and Sequencing Simulator

This is a modeling tool to compare methods of metagenomic biosurveillance. It puts you in the position of a person designing an early detection system to identify a novel stealth pathogen before it spreads widely.

While we've tried to make the tool realistic, many of the input parameter estimates are very rough. Without actually building a pilot metagenomic monitoring system it's quite possible that important inputs are considerably wrong.

The tool simulates an epidemic that starts with one person and grows exponentially with a specified doubling time. There are one or more sites, each implementing the program in parallel. At each site, a form of sampling and sequencing is performed on a weekly schedule. We model the pathogen as detected once a specific portion of its genome has been observed a minimum number of times.

At a technical level, the tool runs many simulations and graphs them by the observed cumulative incidence at detection.

more...

You can see the actual implementation of the tool by reading the simulate_one function in the source of this page or on github.

Relative abundance estimates refer to the fraction of sequencing reads that would come from the modeled virus if everyone contributing to the sample was currently infected. We estimated shedding for SARS-CoV-2 and Influenza A using different sources:

more...
SARS-CoV-2 in Municipal Wastewater:
We started with Grimm et al. 2023's mean weekly RA_i(1%) estimate of 1.3e-7 for SARS-CoV-2 in municipal wastewater that assumed sampling and sequencing similar to Rothman et al. 2021. We converted that estimate to an RA_p(1%) by scaling by 7/5, under the assumption that people might shed the virus for five days. Finally, we scaled it by 100x to get the relative abundance contribution of an individual infected person.
SARS-CoV in Nasal/Throat Swabs:
We took relative abundance values from Lu et al. 2021's table S1, which provides sequencing results for SARS-CoV-2 from sixteen metagenomic samples collected from COVID-19 patients hospitalized in China (blog post). We treated each value as equally likely, and the simulator selects one at random each time it models an infected person.
Influenza A in Municipal Wastewater:
We followed the same approach we detail above for SARS-CoV-2 municipal wastewater, except for the sequencing data we used unpublished data collected by the NAO and an NAO partner in the 2023-2024 flu season. When linked to CDC data on Influenza A using the same Grimm et al. 2023 approach, this gave a mean RA_i(1%) of 3.2e-8.
Influenza A Nasal/Throat Swabs:
We took relative abundance values from Lewandowski et al. 2020 Figure 4 (n=39). As with SARS-CoV Nasal/Throat above, we treat each outcome as equally likely. Note, however, that this isn't an ideal source: the authors chose which samples to sequence by trying to cover their full observed Ct range.

There are several main places where this tool does not include considerations we know are significant:

more...

Please treat this tool as a robot that will scribble on the back of an envelope for you, not a precise estimator!


To use the tool, choose a base secenario, modify any parameters you're interested in, and click "Run". If you share the full URL that will let anyone loading that URL see (a) your last set of parameters and (b) any saved runs.
Scenario
Pathogen
Doubling time (days): CV:
Genome length (bp):
Sampling
Relative abundance: σ:
If you ran metagenomic sequencing on a sample that was from just a single infected person, what fraction of sequencing reads would be from the pathogen?
This could be a single number to model all infected people as contributing an equal amount, or multiple space-separated numbers to represent a set of equally likely contribution amounts. For σ you specify the standard deviation of the log, giving per-simulation noise representing our uncertainty about whether the true relative abundance would be higher or lower.
Directly specify relative abundance of flaggable reads?
Normally we discount the fraction of reads that are flaggable by considering the genome length, read length, and fragment quality. If you're modeling these outside of this tool, however, and just want to specify this fraction directly, check this box and provide your fraction in the "relative abundance" box above.
Shedding duration (days): σ:
Once someone is infected, how many days do they shed for? This is a simplified model where we assume people shed equally throughout their infection. For σ, specify the standard deviation of the log.
Low-quality fragments?
When sequencing wastewater the nucleic acid fragments tend to be pretty torn up by the hostile environment, making it difficult to get much value from long-read sequencing. If this box is checked we assume that sequencing reads won't be longer than 120bp even if you're using a sequencing machine capable of producing longer ones when given good material.
Sample population (people):
Each time you go out and sample people, how many people are contributing?
Sampling schedule:
MTWRFSU
Cost per sample (dollars):
Sequencing
Sequencing run depth (reads):
Each time you run your sequencer, how many reads does it produce?
Read length (bp):
How long are the reads your sequencer generates? Note that actual reads will be lower if the sample is low quality; see "low-quality fragments?" above.
Cost per sequencing run (dollars):
Sequencing schedule:
MTWRFSU
Processing delay (days):
How long does it take from collecting a sample until you have the bioinformatic results? Remember to include lab time to prepare the sample for sequencing, waiting for your sample to get a turn on the sequencer, time on the sequencer, and bioinformatic processing.
Global
Minimum Observations:
This model simulates detection as happening when a particular portion of the genome has been observed some number of times. Using a higher number here corresponds to requiring more certainty before raising an alarm.
Sites:
In this model you can parallelize across multiple sites, each of which works exactly the same way. As a rough approximation, running twice as many sites will cost twice as much and detect at half the cumulative incidence.

Note that we don't model the most obvious benefit of multiple sites, where each one has a chance of being a location where the epidemic is farther ahead.

Overhead (percentage)
Assume costs are higher than the inputs specified above by this fixed percentage.
Annual Cost (dollars):
Simulation
Simulations:
We will run the simulation the specified number of times, and display the results in a chart below.
Simulation Label
Use a descriptive label here, which will appear in your chart and on the outcomes table.