NAO Cost Estimate – Summary
Background
The goal of this project was to build a model that allows us to:
- Estimate the cost of sampling and sequencing required to run an effective NAO.
- Calculate the sequencing depth necessary to detect a virus by the time it reaches a target cumulative incidence.
- Understand which parameters are most important to understand and/or optimize to determine the viability of an NAO.
We had previously done a very simple version of this for the P2RA project. Here, we wanted to formalize the approach and include more details.
Previous documents:
The model
Using the framework developed in NAO Cost Estimate Outline, our model has the following components. Unless otherwise noted, see NAO Cost Estimate MVP for details.
Epidemic
The prevalence of the virus grows exponentially and deterministically in a single population. The fraction of people currently infectious and shedding is equal and given by:
The cumulative incidence (as a fraction of the population) in this model is:
Data collection
We collect samples from a single sampling site at regular intervals, spaced
We also consider the a delay of
Read counts
We considered three different models of the number of reads in each sample from the epidemic virus:
- A deterministic model where the number of reads in a sample at time t is
where is the P2RA factor that converts between prevalence and relative abundance. - A stochastic model that accounts for Poisson counting noise and variation in the latent relative abundance. In this model, the number of reads is a random variable drawn from a Poisson-gamma mixture with mean
(as in 1.) and inverse overdispersion parameter . Large means that the relative abundance is well-predicted by our deterministic model, whereas small means that there is a lot of excess variation beyond what comes automatically from having a finite read depth. - A stochastic model where we sequence a pooled sample of
individuals. This allows us to consider the effect of sampling a small number of, e.g., nasal swabs rather than wastewater.
See NAO Cost Estimate – Adding noise for stochastic models.
Detection
We model detection based on the cumulative number of viral reads over all samples. When this number reaches a threshold value
Costs
We considered two components of cost:
- The per-read cost of sequencing
- The per-sample cost of collection and processing
See NAO Cost Estimate – Optimizing the sampling interval for details.
Key results
Sequencing effort required in a deterministic model
In NAO Cost Estimate MVP, we found the sampled depth per unit time required to detect a virus by the time it reaches cumulative incidence
The first two terms on the right-hand side are equivalent to the result from the P2RA model using the conversion between prevalence and incidence implied by our exponential growth model.
The third term in parentheses is an adustment factor for collecting samples at
- the delay between when the virus is theoretically detectable and the next sample taken, and
- the benefit of taking a grab sample late in the sampling interval when the prevalence is higher.
This term has Taylor expansion
The final term is the cost of the
Optimal sampling interval
In NAO Cost Estimate – Optimizing the sampling interval, we found the sampling interval
When sampling optimally, the per-sample sequencing cost (
Additional sequencing required to ensure a high probability of detection
In NAO Cost Estimate – Adding noise, we change our detection criterion from requiring the expected number of reads to reach the threshold
We find that a key parameter is
When
Numerical exploration of these regimes suggests that we expect to need 1.5–3 times more sequencing than the deterministic model predicts to detect with 95% probability by the target cumulative incidence.
Small pool noise
In the Appendix to the noise post, we showed that the effect of pooling a small number of samples is controlled by
Discussion
- Nothing in our analysis here changes the intuition that the P2RA factor (here
) is very important for cost, especially because it appears to vary over several orders of magnitude for different viruses and studies. - The sampling interval is not expected to be very important for cost, assuming
. The cost of delay from longer interval is partially offset by the benefit of sampling later when the prevalence is higher. - In constrast, the delay between sample collection and data analysis could matter a lot because it does not have a corresponding benefit. The required depth grows exponentially with
. - We have sometimes considered the benefit to noise in the read count distribution. Noisier distributions sometimes let us detect something while it is still too rare to detect on average. However, our analysis here shows that if our goal is to detect by the target cumulative incidence with high probability, noise is unambiguously bad and could increase our required depth several times over.
- We currently do not have any estimate of
, the inverse overdispersion of read counts relative to Poisson. We should try to measure it empirically in our sequence data.
Potential extensions
- We could turn this analysis into a “plausibility map”: Given a system design (budget or
, , , etc.), what ranges of growth rates and P2RA factors could we detect reliably by a target cumulative incidence? - We could extend the model to consider multiple sampling sites.
- The current epidemic model is completely deterministic. It would be good to check whether adding randomness changes our conclusions. (I suspect it won’t in a single-population model, but may matter for multiple sampling sites.)
- We could consider a more sophisticated detection model than just cumulative reads. For example we could analyze a toy model of EGD.
- We could explore the noise distribution of real data and try to measure
and whether the latent noise is mostly independent or correlated between samples.