This is a modeling tool to compare methods of metagenomic biosurveillance. It puts you in the position of a person designing an early detection system to identify a novel stealth pathogen before it spreads widely.

While we've tried to make the tool realistic, many of the input parameter estimates are very rough. Without actually building a pilot metagenomic monitoring system it's quite possible that important inputs are considerably wrong.

The tool simulates an epidemic that starts with one person and grows exponentially with a specified doubling time. There are one or more sites, each implementing the program in parallel. At each site, a form of sampling and sequencing is performed on a weekly schedule. We model the pathogen as detected once a specific portion of its genome has been observed a minimum number of times.

At a technical level, the tool runs many simulations and graphs them by the observed cumulative incidence at detection.

Each time the tool samples, it uses a Poisson approximation of binomial sampling to determine how many sick people are in the sample. Similarly, each time it sequences, it uses a Poisson approximation of binomial sampling to determine how many sequencing reads are obtained for the pathogen of interest. Detection is modeled as happening when a specified threshold number of sequencing reads that match a specific portion of the target genome is accumulated.

The tool assumes that sequencing reads are equally likely to come from any part of the pathogen's genome. It simulates many times (1,000 by default) and charts the range of cumulative incidences observed. Any outcomes where the cumulative incidence is over 30% are marked as a failure by showing a "0"; on the chart. The simulation is not correct for high cumulative incidence, as it only models the initial exponential stage of the epidemic.

Certain values allow you to specify how much noise to add to the model. The noise is generated once per simulation. For the inputs marked "CV" the noise is normally; for ones marked "CV

_{g}" the noise is lognormally distributed. The higher the CV you set, the more uncertainty will be introduced into the simulation, generally causing the low-percentile outputs to be more optimistic and the high-percentile outputs to be more pessimistic.For the relative abundance amounts, we add lognormally distributed noise with a geometric mean of zero and standard deviation of σ

_{g}. This noise shifts all provided values in the same direction. For example, if in one of the 1,000 simulations our lognormal draw gives us a noise value of is 0.7 and the provided values are 1e-5 and 1e-6, for that simulation we'll use 7e-6 and 7e-7 instead.

You can see the actual implementation of the tool by
reading the `simulate_one`

function in the source
of this page or
on
github.

Relative abundance estimates refer to the fraction of sequencing reads that would come from the modeled virus if everyone contributing to the sample was currently infected. We estimated shedding for SARS-CoV-2 and Influenza A using different sources:

- SARS-CoV-2 in Municipal Wastewater:
- We started
with Grimm
et al. 2023's mean weekly
`RA_i(1%)`

estimate of 1.3e-7 for SARS-CoV-2 in municipal wastewater that assumed sampling and sequencing similar to Rothman et al. 2021. We converted that estimate to an`RA_p(1%)`

by scaling by 7/5, under the assumption that people might shed the virus for five days. Finally, we scaled it by 100x to get the relative abundance contribution of an individual infected person. - SARS-CoV in Nasal/Throat Swabs:
- We took relative abundance values from Lu et al. 2021's table S1, which provides sequencing results for SARS-CoV-2 from sixteen metagenomic samples collected from COVID-19 patients hospitalized in China (blog post). We treated each value as equally likely, and the simulator selects one at random each time it models an infected person.
- Influenza A in Municipal Wastewater:
- We followed the same approach we detail above for
SARS-CoV-2 municipal wastewater, except for the sequencing
data we used unpublished data collected by the NAO and an NAO
partner in the 2023-2024 flu season. When linked to CDC data
on Influenza A using the same Grimm et al. 2023 approach, this
gave a mean
`RA_i(1%)`

of 3.2e-8. - Influenza A Nasal/Throat Swabs:
- We took relative abundance values from Lewandowski et al. 2020 Figure 4 (n=39). As with SARS-CoV Nasal/Throat above, we treat each outcome as equally likely. Note, however, that this isn't an ideal source: the authors chose which samples to sequence by trying to cover their full observed Ct range.

There are several main places where this tool does not include considerations we know are significant:

Real epidemics don't grow perfectly exponentially. Early on individual infection events can have a large contribution, such as with superspreader events. This tool assumes simple exponential growth.

Real epidemics would happen to progress faster in some locations and slower in others. If you're sampling at mutiple sites you should see an epidemic be farther ahead in some places than others. This tool assumes the epidemic grows equally quickly at each site, which underestimates the value of running multiple sites.

Real epidemics fall below exponential, even if they're spreading unnoticed, once there's an appreciable number of succeptible people. This tool only simulates the early portion of the curve, where exponential is a good approximation. In practice this is a minimal limitation, because we're really only concerned with scenarios that would let us flag a pandemic before too many people had been infected.

For very substantial efforts, with very deep sequencing across many sites, a real detection system would start to see economies of scale. We don't model this, and instead assume linear cost per site.

Please treat this tool as a robot that will scribble on the back of an envelope for you, not a precise estimator!

To use the tool, choose a base secenario, modify any parameters you're interested in, and click "Run". If you share the full URL that will let anyone loading that URL see (a) your last set of parameters and (b) any saved runs.

Scenario | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Pathogen | |||||||||||||||

Doubling time (days): | CV: | ||||||||||||||

Genome length (bp): | |||||||||||||||

Sampling | |||||||||||||||

Relative abundance: | σ: | ||||||||||||||

If you ran metagenomic sequencing on a sample that was from just a single infected person, what fraction of sequencing reads would be from the pathogen? This could be a single number to model all infected people as contributing an equal amount, or multiple space-separated numbers to represent a set of equally likely contribution amounts. For σ you specify the standard deviation of the log, giving per-simulation noise representing our uncertainty about whether the true relative abundance would be higher or lower. | |||||||||||||||

Shedding duration (days): | σ: | ||||||||||||||

Once someone is infected, how many days do they shed for? This is a simplified model where we assume people shed equally throughout their infection. For σ, specify the standard deviation of the log. | |||||||||||||||

Low-quality fragments? | |||||||||||||||

When sequencing wastewater the nucleic acid fragments tend to be pretty torn up by the hostile environment, making it difficult to get much value from long-read sequencing. If this box is checked we assume that sequencing reads won't be longer than 120bp even if you're using a sequencing machine capable of producing longer ones when given good material. | |||||||||||||||

Sample population (people): | |||||||||||||||

Each time you go out and sample people, how many people are contributing? | |||||||||||||||

Sampling schedule: |
| ||||||||||||||

Cost per sample (dollars): | |||||||||||||||

Sequencing | |||||||||||||||

| |||||||||||||||

Sequencing run depth (reads): | |||||||||||||||

Each time you run your sequencer, how many reads does it produce? | |||||||||||||||

Read length (bp): | |||||||||||||||

How long are the reads your sequencer generates? Note that actual reads will be lower if the sample is low quality; see "low-quality fragments?" above. | |||||||||||||||

Cost per sequencing run (dollars): | |||||||||||||||

Sequencing schedule: |
| ||||||||||||||

Processing delay (days): | |||||||||||||||

How long does it take from collecting a sample until you have the bioinformatic results? Remember to include lab time to prepare the sample for sequencing, waiting for your sample to get a turn on the sequencer, time on the sequencer, and bioinformatic processing. | |||||||||||||||

Global | |||||||||||||||

Minimum Observations: | |||||||||||||||

This model simulates detection as happening when a particular portion of the genome has been observed some number of times. Using a higher number here corresponds to requiring more certainty before raising an alarm. | |||||||||||||||

Sites: | |||||||||||||||

In this model you can parallelize across
multiple sites, each of which works exactly the same way. As a rough
approximation, running twice as many sites will cost twice as much and
detect at half the cumulative incidence.
Note that we don't model the most obvious benefit of multiple sites, where each one has a chance of being a location where the epidemic is farther ahead. | |||||||||||||||

Overhead (percentage) | |||||||||||||||

Assume costs are higher than the inputs specified above by this fixed percentage. | |||||||||||||||

Annual Cost (dollars): | |||||||||||||||

Simulation | |||||||||||||||

Simulations: | |||||||||||||||

We will run the simulation the specified number of times, and display the results in a chart below. | |||||||||||||||

Simulation Label | |||||||||||||||

Use a descriptive label here, which will appear in your chart and on the outcomes table. |

To run the simulation, please fix the following invalid inputs:

Cumulative Incidence at Detection: when the system raises the alarm, what fraction of people do we estimate will have ever been infected? Lower is better, since it represents identifying the pandemic earlier and having more time to respond.