Computational Approaches to Pathogen Detection


While this post is my perspective and not an official post of my employer's, it also draws on a lot of collaborative work with others at the Nucleic Acid Observatory (NAO).

One of the future scenarios I'm most worried about is someone creating a "stealth" pandemic. Imagine a future HIV that first infects a large number of people with minimal side effects and only shows its nasty side after it has spread very widely. This is not something we're prepared for today: current detection approaches (symptom reporting, a doctor noticing an unusual pattern) require visible effects.

Over the last year, with my colleagues at the NAO, I've been exploring one promising method of identifying this sort of pandemic. The overall idea is:

  1. Collect some sort of biological material from a lot of people on an ongoing basis, for example by sampling sewage.

  2. Use metagenomic sequencing to learn what nucleic acids are in these samples.

  3. Run novel-pathogen detection algorithms on the sequencing data.

  4. When you find some thing sufficiently concerning, follow-up with tests for the specific thing you've found.

While there are important open questions in all four of these, I've been most focused on the third: once you have metagenomic sequencing data, what do you do?

I see four main approaches. You can look for sequences that are:


There are some genetic sequences that code for dangerous things that we should not normally see in our samples. If you see a series of base pairs that are unique to smallpox that's very concerning! The main downside of this approach if you want to extend it beyond smallpox etc is that you need to make a list of non-obvious dangerous things, which is in itself a dangerous thing to do: what if your list is stolen and it points people to sequences they wouldn't have thought to try using?

This is similar to another problem: how do you check if people are synthesizing dangerous sequences without risking a list of all the things that shouldn't be synthesized? SecureDNA has been working on this problem, with an encrypted database with a distributed key system that allows flagging sequences without it being practical to get a list of all flagged sequences (paper).

There are some blockers to using SecureDNA for this today, since it was designed for slightly different constraints, but I think they are all surmountable and I'm hoping to implement a SecureDNA-based metagenomic sequencing screening system at some point in the next year.

An alternative and somewhat longer-term approach here would be to use tools that are able to estimate the function of novel sequences to extend this to sequences that aren't closely derived from existing ones. I'm less enthusiastic about this: not only could this work end up increasing risk by improving humanity's ability to judge how dangerous a novel sequence is, it's not clear to me that this approach is likely to catch things the other methods wouldn't.


In engineering a new virus for a stealth pandemic, the easiest way would likely be to begin with an existing virus. If we see a sequencing read where part matches a known viral genome and part does not (a "chimera"), one potential explanation is that the read comes from a genetically engineered virus.

But this is not the only reason this approach could flag a read. For example, it could come from:

  • Lack of knowledge. Perhaps a virus has a lot of variation, much more than is reflected in the databases you are using to define "normal". It will look like you have found a novel virus when it's just an incomplete database. And, of course, the database will always be incomplete: viruses are always evolving. Still, solving this seems practical: handling these initial false positives requires expanding our knowledge of the variety of existing viruses, but that is something many virologists are deeply interested in.

  • Sequencing: perhaps some of the biological processing you do prior to (or during) sequencing can attach unrelated fragments. When you see a chimera how do you know whether that existed in the sample you originally collected vs if it was created accidentally in the lab? On the other hand, you can (a) compare the fraction of chimeras in different sequencing approaches and pick ones where this is rare and (b) pay more attention to cases where you've seen the same chimera multiple times.

  • Biological chimerism: bacteria will occasionally incorporate viral sequences. This method would flag this as genetic engineering even if it was a natural and unconcerning process. As long as this is rare enough, however, we can deal with this by surfacing such reads to a biologist who figures out how concerned to be and what next steps makes sense.

This is the main approach I've been working on lately, trying to get the false positive rate down.


If we understood what "normal" looked like well enough, then we could flag anything new for investigation. This is a serious research project: if you take data from a sewage sample and run it through basic tooling, it's common to have 50% of reads unclassified. Making progress here will require, among other things, much better tooling (and maybe algorithms) for metagenomic assembly: I'm not aware of anything that could efficiently integrate trillions of bases a week into an assembly graph.

Ryan Teo, a first-year graduate student with Nicole Wheeler at the University of Birmingham has started his thesis in this area, which I'm really excited to see. Lenni Justen, another first-year graduate student, with Kevin Esvelt, is also exploring this area as part of his work with the NAO. I'd be excited to see more work, however, and if you're working on this or interested in working on it but blocked by not having access to enough metagenomic sequencing data please get in touch!


It may turn out that our samples are deeply complex: potentially as you sequence the rate of seeing new things falls off very slowly. If it falls off slowly enough, and then you will keep seeing "new" things that are just so rare that you haven't happened to see them before. I am quite unsure how likely this is, and I expect it varies by sample type (sewage is likely much more complex than, say, blood) but it seems possible. An approach that's robust to this is that instead of flagging some thing just for being new, you could flag it based on its growth pattern: first you've never seen it, then you see it once, then you start seeing it more often, then you start seeing it many times per sample. In theory a new pandemic should begin with approximately exponentially spread, since with few people already infected the number of new infections should be proportional to the number of infectious people.

At the NAO we've been calling this "exponential growth detection" (EGD). We worked on this some in 2022, but have put it on hold until we have a deep enough timeseries dataset to work with.

These approaches can also be combined: if a sequence originally comes to your attention because it's chimeric but you're not sure how seriously to take it, you could look at the growth pattern of its components. Or, while you can detect growing things with a genome-free approach simply by looking for increasing k-mers, the kind of "thoroughly understand the metagenome" work that I described above as an approach for identifying new things can also be used to make a much more sensitive tool that detects growing things.

In terms of prioritization, I'm enthusiastic about work on all of these, and would like to see them progress in parallel. The approaches of detecting dangerous and modified sequences require less scientific progress and should work on amounts of data that are achievable with philanthropic funding. De novo protein design is getting more capable and more accessible, however, which allows creation of pathogens those two methods don't catch. We will need approaches that don't depend on matching known things, which is where detecting new and/or growing sequences comes in. Those two methods will require a lot more data, enough that unless sequencing goes through another round of the kind of massive cost improvement we saw in 2008-2011 we're talking about large-scale government-funded projects. Advances in detection methods make it more likely that we'll be able to make the case for these larger projects, and reduce the risk that the detection ability might lag infrastructure creation.

← back