Sampling and Sequencing Simulator

This is a modeling tool to compare methods of metagenomic biosurveillance. It puts you in the position of a person designing an early detection system to identify a novel stealth pathogen before it spreads widely.

While we've tried to make the tool realistic, many of the input parameter estimates are very rough. Without actually building a pilot metagenomic monitoring system it's quite possible that important inputs are considerably wrong.

The tool simulates an epidemic that starts with one person and grows exponentially with a specified doubling time. There are one or more sites, each implementing the program in parallel. At each site, a form of sampling and sequencing is performed on a weekly schedule. We model the pathogen as detected once a specific portion of its genome has been observed a minimum number of times.

At a technical level, the tool runs many simulations and graphs them by the observed cumulative incidence at detection.

more...

You can see the actual implementation of the tool by reading the simulate_one function in the source of this page or on github.

Relative abundance estimates refer to the fraction of sequencing reads that would come from the modeled virus if everyone contributing to the sample was currently infected. We estimated shedding for SARS-CoV-2 and Influenza A using different sources:

more...
SARS-CoV-2 in Municipal Wastewater:

We use our Grimm et al. 2023's distribution for RAi(1%) assuming sampling and sequencing similar to Rothman et al. 2021.

Influenza in Municipal Wastewater:

We use the distributions we calculated in Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data for RAi(1%), based on applying the methods from Grimm et al. 2023 to unpublished MU and UCI sequencing data.

Nasal/Throat Swabs:

We use the individual-level relative abundances we collected in Investigating the Sensitivity of Pooled Swab Sampling for Pathogen Early Detection, treating each outcome as equally likely.

There are several main places where this tool does not include considerations we know are significant:

more...

Please treat this tool as a robot that will scribble on the back of an envelope for you, not a precise estimator!


To use the tool, choose a base secenario, modify any parameters you're interested in, and click "Run". If you share the full URL that will let anyone loading that URL see (a) your last set of parameters and (b) any saved runs.
Scenario


Pathogen and Sampling
Doubling time (days): CV:
Genome length (bp):
(source)
(source)
(source)
(source)
(source)
(source)
(source)

Relative abundance: σ:
If you ran metagenomic sequencing on a sample that was from just a single infected person, what fraction of sequencing reads would be from the pathogen?
This could be a single number to model all infected people as contributing an equal amount, or multiple space-separated numbers to represent a set of equally likely contribution amounts. For σ you specify the standard deviation of the log, giving per-simulation noise representing our uncertainty about whether the true relative abundance would be higher or lower.
Directly specify relative abundance of flaggable reads?
Normally we discount the fraction of reads that are flaggable by considering the genome length, read length, and fragment quality. If you're modeling these outside of this tool, however, and just want to specify this fraction directly, check this box and provide your fraction in the "relative abundance" box above.
Shedding duration (days): σ:
Once someone is infected, how many days do they shed for? This is a simplified model where we assume people shed equally throughout their infection. For σ, specify the standard deviation of the log.
Low-quality fragments?
When sequencing wastewater the nucleic acid fragments tend to be pretty torn up by the hostile environment, making it difficult to get much value from long-read sequencing. If this box is checked we assume that sequencing reads won't be longer than 170bp even if you're using a sequencing machine capable of producing longer ones when given good material.
Sample population (people):
Each time you go out and sample people, how many people are contributing?
Sampling schedule:
MTWRFSU
Cost per sample (dollars):
Sequencing
Sequencing run depth (reads):
Each time you run your sequencer, how many reads does it produce?
Read length (bp):
How long are the reads your sequencer generates? Note that actual reads will be lower if the sample is low quality; see "low-quality fragments?" above.
Cost per sequencing run (dollars):
Sequencing schedule:
MTWRFSU
Processing delay (days):
How long does it take from collecting a sample until you have the bioinformatic results? Remember to include lab time to prepare the sample for sequencing, waiting for your sample to get a turn on the sequencer, time on the sequencer, and bioinformatic processing.
Global
Minimum Observations:
This model simulates detection as happening when a particular portion of the genome has been observed some number of times. Using a higher number here corresponds to requiring more certainty before raising an alarm.
Sites:
In this model you can parallelize across multiple sites, each of which works exactly the same way. As a rough approximation, running twice as many sites will cost twice as much and detect at half the cumulative incidence.

Note that we don't model the most obvious benefit of multiple sites, where each one has a chance of being a location where the epidemic is farther ahead.

Overhead (percentage)
Assume costs are higher than the inputs specified above by this fixed percentage.
Annual Cost (dollars):
Simulation
Simulations:
We will run the simulation the specified number of times, and display the results in a chart below.
Simulation Label
Use a descriptive label here, which will appear in your chart and on the outcomes table.