In previous entries, I presented the results of my Nextflow workflow on recent BMC RNA-seq data. To validate and calibrate my Bowtie2/Kraken2 pipeline for identifying human-viral reads, I manually BLASTed putative viral reads against NCBI’s nt database using their online BLAST tool, then assessed the results by eye to determine if each read likely came from a human virus. This approach was effective, but slow and manual, and thus hard to scale to other datasets I wanted to analyze with a higher relative abundance of viral sequences. I thus investigated options for automating the process to generate custom tabular BLAST results on the command line, which could then be read and processed programmatically.
Among the several options I investigated, the most promising were (1) Elastic-BLAST, NCBI’s own cloud-based BLAST tool, and (2) running regular command-line blast with the -remote option enabled to query NCBI’s online databases. I don’t currently have the AWS permissions needed to run Elastic-BLAST on our AWS service, and blastn -remote initially seemed too slow to be useful when run on an EC2 instance. However, I discovered that the latter option appears to run much more quickly when run on my local machine, making this a much more attractive option, at least until I’m able to run Elastic-BLAST.
To evaluate the potential for this approach to semi-automate the process of HV read validation, I ran BLASTN on the 227 putative HV reads from my previous BMC entry, using the following command:
To begin with, I first needed to identify a list of taxids I could use to designate a read as human-viral. To do this, I took the list of HV taxids generated by my Nextflow workflow and expanded it to include any missing descendent taxids using the NCBI taxid tree structure. I also generated another list of taxids that included any taxid descended from any of these taxids or the overall virus taxid (10239); this latter approach decreases the risk of false negatives at the cost of potentially increasing the risk of false positives from phage sequences. The two approaches generated lists of 28,105 and 234,499 taxids, respectively.
I next imported the BLAST alignment results and performed exploratory data analysis. Following visual inspection, it appeared that the following process would generate good results:
If both the forward and reverse reads from a read pair align to the same viral taxid (given the specified filters on query coverage, percent identity, etc), classify that read pair as human viral.
If only one of the reads in a read pair aligns to any given viral taxid, flag the read for manual inspection.
If neither read aligns to a viral taxid, classify that read pair as non-human-viral.
Using the inferred list of human-virus taxids results in high precision and specificity, but low sensitivity, missing many reads flagged as human-viral by manual inspection. Conversely, using a list of all viral taxids as the reference achieves perfect sensitivity and near-perfect precision and specificity, with only one false positive result, resulting in an F1 score of over 99%.
That one false positive turned out to be a sequence with a high-identity but low-query-coverage human-viral match (to human enterovirus 71) which was excluded during my previous manual assessment; in my opinion, it’s unclear whether this should really be classed as a false positive. If we want to class this sequence as non-human-viral, increasing the query coverage threshold from 30% to 35% successfully excludes it without losing any true positives, resulting in perfect precision relative to manual assignments:
In summary, using command-line blastn -remote, combined with tabular processing in R against a reference class of all viral taxids, works well as a substitute for manual inspection of online BLAST results for validating human-viral read assignments. It’s too slow and failure-prone to be added to the pipeline as a replacement or supplement to the Bowtie/Kraken automated approach, but will be my go-to approach in future for manual validation and refinement of human-viral assignment criteria – at least until I’m able to try out Elastic-BLAST.
Source Code
---title: "Automating BLAST validation of human viral read assignment"subtitle: "Experiments with BLASTN remote mode"author: "Will Bradshaw"date: 2024-01-30format: html: code-fold: true code-tools: true code-link: true df-print: pagededitor: visualtitle-block-banner: black---```{r}#| label: load-packages#| include: falselibrary(tidyverse)library(cowplot)library(patchwork)library(fastqcr)source("../scripts/aux_plot-theme.R")theme_kit <- theme_base +theme(aspect.ratio =1,axis.text.x =element_text(hjust =1, angle =45),axis.title.x =element_blank())```In previous entries, I presented the results of my [Nextflow workflow](https://github.com/naobservatory/mgs-workflow) on recent BMC RNA-seq data. To validate and calibrate my Bowtie2/Kraken2 pipeline for identifying human-viral reads, I manually BLASTed putative viral reads against NCBI's nt database using their online BLAST tool, then assessed the results by eye to determine if each read likely came from a human virus. This approach was effective, but slow and manual, and thus hard to scale to other datasets I wanted to analyze with a higher relative abundance of viral sequences. I thus investigated options for automating the process to generate custom tabular BLAST results on the command line, which could then be read and processed programmatically.Among the several options I investigated, the most promising were (1) [Elastic-BLAST](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/), NCBI's own cloud-based BLAST tool, and (2) running regular command-line blast with the `-remote` option enabled to query NCBI's online databases. I don't currently have the AWS permissions needed to run Elastic-BLAST on our AWS service, and `blastn -remote` initially seemed too slow to be useful when run on an EC2 instance. However, I discovered that the latter option appears to run much more quickly when run on my local machine, making this a much more attractive option, at least until I'm able to run Elastic-BLAST.To evaluate the potential for this approach to semi-automate the process of HV read validation, I ran BLASTN on the 227 putative HV reads from my previous BMC entry, using the following command:``` blastn -query <input_path> -out <output_path> -db nt -remote -perc_identity 60 -max_hsps 5 -num_alignments 50 -qcov_hsp_perc 30 -outfmt "6 qseqid sseqid sgi staxid qlen evalue bitscore qcovs length pident mismatch gapopen sstrand qstart qend sstart send"```To begin with, I first needed to identify a list of taxids I could use to designate a read as human-viral. To do this, I took the list of HV taxids generated by my Nextflow workflow and expanded it to include any missing descendent taxids using the NCBI taxid tree structure. I also generated another list of taxids that included any taxid descended from any of these taxids or the overall virus taxid (10239); this latter approach decreases the risk of false negatives at the cost of potentially increasing the risk of false positives from phage sequences. The two approaches generated lists of 28,105 and 234,499 taxids, respectively.```{r}#| warning: false# Set file pathsdata_dir <-"../data/2024-01-30_blast/"reads_db_path <-file.path(data_dir, "bmc-hv-putative.tsv")blast_results_path <-file.path(data_dir, "bmc-hv-putative.blast")hv_taxids_path <-file.path(data_dir, "human-virus-taxids-all.txt")tax_nodes_path <-file.path(data_dir, "nodes.dmp.gz")# Import files for viral taxid searchhv_taxids <-read_tsv(hv_taxids_path, show_col_types =FALSE, col_names ="taxid")tax_nodes <-read_delim(tax_nodes_path, delim ="\t|\t", show_col_types =FALSE, col_names =FALSE) %>%select(X1:X3) %>%rename(child_taxid = X1, parent_taxid = X2, rank = X3)# Define taxid search functionexpand_taxids <-function(taxids_in, nodes){ taxids_out <- taxids_in taxids_new <-filter(nodes, parent_taxid %in% taxids_out, !child_taxid %in% taxids_out) %>%pull(child_taxid) %>% sortwhile (length(taxids_new) >0){ taxids_out <-c(taxids_out, taxids_new) %>% sort taxids_new <-filter(nodes, parent_taxid %in% taxids_out, !child_taxid %in% taxids_out) %>%pull(child_taxid) %>% sort }return(taxids_out)}v_taxids_1 <-expand_taxids(hv_taxids$taxid, tax_nodes)v_taxids_2 <-expand_taxids(c(10239, hv_taxids$taxid), tax_nodes)```I next imported the BLAST alignment results and performed exploratory data analysis. Following visual inspection, it appeared that the following process would generate good results:1. If both the forward and reverse reads from a read pair align to the same viral taxid (given the specified filters on query coverage, percent identity, etc), classify that read pair as human viral.2. If only one of the reads in a read pair aligns to any given viral taxid, flag the read for manual inspection.3. If neither read aligns to a viral taxid, classify that read pair as non-human-viral.```{r}# Prepare and import BLAST resultsreads_db <-read_tsv(reads_db_path, show_col_types =FALSE)blast_cols <-c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")blast_results <-read_tsv(blast_results_path, show_col_types =FALSE, col_names = blast_cols,col_types =cols(.default="c"))# Filter BLAST results by read IDreads_db_seqs <- reads_db %>%mutate(read_id =paste(sample, seq_num, sep="_")) %>%pull(read_id) %>%c(paste(., "1", sep="_"), paste(., "2", sep="_"))blast_results_filtered <- blast_results %>%filter(qseqid %in% reads_db_seqs)# Summarize BLAST results by read pair and subject taxidblast_results_paired <- blast_results_filtered %>%separate(qseqid, c("sample", "seq_num", "read_pair"), "_") %>%group_by(sample, seq_num, read_pair, staxid) %>%filter(bitscore ==max(bitscore)) %>%filter(length ==max(length)) %>%filter(row_number() ==1) %>%group_by(sample, seq_num, staxid) %>%mutate(bitscore =as.numeric(bitscore), seq_num =as.numeric(seq_num)) %>%summarize(bitscore_max =max(bitscore), bitscore_min =min(bitscore), n_reads =n(), .groups ="drop")# Assign viral status to BLAST resultsblast_results_hviral <- blast_results_paired %>%mutate(viral_1 = staxid %in% v_taxids_1,viral_2 = staxid %in% v_taxids_2) %>%mutate(viral_1_full = viral_1 & n_reads ==2,viral_2_full = viral_2 & n_reads ==2)# Assign putative viral status to read pairsblast_results_out <- blast_results_hviral %>%group_by(sample, seq_num) %>%summarize(viral_status_1 =ifelse(any(viral_1_full), "TRUE", ifelse(any(viral_1), "INSPECT", "FALSE")),viral_status_2 =ifelse(any(viral_2_full), "TRUE", ifelse(any(viral_2), "INSPECT", "FALSE")),.groups ="drop") %>%inner_join(reads_db %>%select(sample, seq_num, hv_blast), by=c("sample", "seq_num"))# Assume manual inspection produces same results as previousblast_results_inspected <- blast_results_out %>%mutate(viral_status_1 =ifelse(viral_status_1 =="INSPECT", hv_blast, as.logical(viral_status_1)),viral_status_2 =ifelse(viral_status_2 =="INSPECT", hv_blast, as.logical(viral_status_2)))# Evaluate performance vs fully manual inspectionblast_status <- blast_results_inspected %>%mutate(pos_tru_1 = viral_status_1 & hv_blast,pos_fls_1 = viral_status_1 &!hv_blast,neg_tru_1 =!viral_status_1 &!hv_blast,neg_fls_1 =!viral_status_1 & hv_blast,pos_tru_2 = viral_status_2 & hv_blast,pos_fls_2 = viral_status_2 &!hv_blast,neg_tru_2 =!viral_status_2 &!hv_blast,neg_fls_2 =!viral_status_2 & hv_blast)blast_performance_1 <- blast_status %>%summarize(n_pos_tru =sum(pos_tru_1),n_pos_fls =sum(pos_fls_1),n_neg_tru =sum(neg_tru_1),n_neg_fls =sum(neg_fls_1),precision = n_pos_tru / (n_pos_tru + n_pos_fls),sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),specificity = n_neg_tru / (n_neg_tru + n_pos_fls),f1 =2* precision * sensitivity / (precision + sensitivity)) %>%mutate(ref_taxids ="Expanded human viral")blast_performance_2 <- blast_status %>%summarize(n_pos_tru =sum(pos_tru_2),n_pos_fls =sum(pos_fls_2),n_neg_tru =sum(neg_tru_2),n_neg_fls =sum(neg_fls_2),precision = n_pos_tru / (n_pos_tru + n_pos_fls),sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),specificity = n_neg_tru / (n_neg_tru + n_pos_fls),f1 =2* precision * sensitivity / (precision + sensitivity)) %>%mutate(ref_taxids ="All viral")blast_performance <-bind_rows(blast_performance_1, blast_performance_2) %>%select(ref_taxids, n_pos_tru:f1)blast_performance```Using the inferred list of human-virus taxids results in high precision and specificity, but low sensitivity, missing many reads flagged as human-viral by manual inspection. Conversely, using a list of all viral taxids as the reference achieves perfect sensitivity and near-perfect precision and specificity, with only one false positive result, resulting in an F1 score of over 99%.That one false positive turned out to be a sequence with a high-identity but low-query-coverage human-viral match (to human enterovirus 71) which was excluded during my previous manual assessment; in my opinion, it's unclear whether this should really be classed as a false positive. If we want to class this sequence as non-human-viral, increasing the query coverage threshold from 30% to 35% successfully excludes it without losing any true positives, resulting in perfect precision relative to manual assignments:```{r}# Filter BLAST results by query coverageblast_results_qcov <- blast_results_filtered %>%filter(as.numeric(qcovs) >=35)# Summarize BLAST results by read pair and subject taxidblast_results_qcov_paired <- blast_results_qcov %>%separate(qseqid, c("sample", "seq_num", "read_pair"), "_") %>%group_by(sample, seq_num, read_pair, staxid) %>%filter(bitscore ==max(bitscore)) %>%filter(length ==max(length)) %>%filter(row_number() ==1) %>%group_by(sample, seq_num, staxid) %>%mutate(bitscore =as.numeric(bitscore), seq_num =as.numeric(seq_num)) %>%summarize(bitscore_max =max(bitscore), bitscore_min =min(bitscore), n_reads =n(), .groups ="drop")# Assign viral status to BLAST resultsblast_results_qcov_hviral <- blast_results_qcov_paired %>%mutate(viral_1 = staxid %in% v_taxids_1,viral_2 = staxid %in% v_taxids_2) %>%mutate(viral_1_full = viral_1 & n_reads ==2,viral_2_full = viral_2 & n_reads ==2)# Assign putative viral status to read pairsblast_results_qcov_out <- blast_results_qcov_hviral %>%group_by(sample, seq_num) %>%summarize(viral_status_1 =ifelse(any(viral_1_full), "TRUE", ifelse(any(viral_1), "INSPECT", "FALSE")),viral_status_2 =ifelse(any(viral_2_full), "TRUE", ifelse(any(viral_2), "INSPECT", "FALSE")),.groups ="drop") %>%full_join(reads_db %>%select(sample, seq_num, hv_blast), by=c("sample", "seq_num")) %>%mutate(viral_status_1 =replace_na(viral_status_1, "FALSE"),viral_status_2 =replace_na(viral_status_2, "FALSE"))# Assume manual inspection produces same results as previousblast_results_qcov_inspected <- blast_results_qcov_out %>%mutate(viral_status_1 =ifelse(viral_status_1 =="INSPECT", hv_blast, as.logical(viral_status_1)),viral_status_2 =ifelse(viral_status_2 =="INSPECT", hv_blast, as.logical(viral_status_2)))# Evaluate performance vs fully manual inspectionblast_status_qcov <- blast_results_qcov_inspected %>%mutate(pos_tru_1 = viral_status_1 & hv_blast,pos_fls_1 = viral_status_1 &!hv_blast,neg_tru_1 =!viral_status_1 &!hv_blast,neg_fls_1 =!viral_status_1 & hv_blast,pos_tru_2 = viral_status_2 & hv_blast,pos_fls_2 = viral_status_2 &!hv_blast,neg_tru_2 =!viral_status_2 &!hv_blast,neg_fls_2 =!viral_status_2 & hv_blast)blast_performance_qcov_1 <- blast_status_qcov %>%summarize(n_pos_tru =sum(pos_tru_1),n_pos_fls =sum(pos_fls_1),n_neg_tru =sum(neg_tru_1),n_neg_fls =sum(neg_fls_1),precision = n_pos_tru / (n_pos_tru + n_pos_fls),sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),specificity = n_neg_tru / (n_neg_tru + n_pos_fls),f1 =2* precision * sensitivity / (precision + sensitivity)) %>%mutate(ref_taxids ="Expanded human viral")blast_performance_qcov_2 <- blast_status_qcov %>%summarize(n_pos_tru =sum(pos_tru_2),n_pos_fls =sum(pos_fls_2),n_neg_tru =sum(neg_tru_2),n_neg_fls =sum(neg_fls_2),precision = n_pos_tru / (n_pos_tru + n_pos_fls),sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),specificity = n_neg_tru / (n_neg_tru + n_pos_fls),f1 =2* precision * sensitivity / (precision + sensitivity)) %>%mutate(ref_taxids ="All viral")blast_performance_qcov <-bind_rows(blast_performance_qcov_1, blast_performance_qcov_2) %>%select(ref_taxids, n_pos_tru:f1)blast_performance_qcov```In summary, using command-line `blastn -remote,` combined with tabular processing in R against a reference class of all viral taxids, works well as a substitute for manual inspection of online BLAST results for validating human-viral read assignments. It's too slow and failure-prone to be added to the pipeline as a replacement or supplement to the Bowtie/Kraken automated approach, but will be my go-to approach in future for manual validation and refinement of human-viral assignment criteria – at least until I'm able to try out Elastic-BLAST.