Author

Harmon Bhasin

Published

July 22, 2024

This is another potential study of this series. In this post, I analyze Thijssen 2023, a dataset with 20 pooled samples from 100 healthy individuals in Iran.

This notebook uses the MGS Workflow v2.2.1 (note that this is old).

1 The raw data

1.1 About

This dataset comprises 20 samples, each derived from plasma pools of 5 people from Iran. In total, 100 healthy individuals contributed to these pools. For each pooled sample, a combined DNA and RNA library preparation was performed, resulting in a single sequencing output that captures both nucleic acid types. This approach provides comprehensive genetic information but precludes separate analysis of DNA and RNA from individual samples.

Code

# Import libraries and extract metadata from sample names
libraries <- read_csv(libraries_path, show_col_types = FALSE)
meta_data <- read_csv('/Users/harmonbhasin/work/securebio/nao-harmon/thijssen2023/preprocessing/SraRunTable.txt') %>%
  select('Library Name', Run) %>% rename(sample = Run, library = 'Library Name')
libraries <- libraries %>%
  left_join(meta_data, by = 'sample') %>%
  select(sample, library=library.y) %>% 
  filter(library != 'PCL21') %>%
  mutate(library = factor(library, levels = c(paste0('PCL', 1:20))))
libraries

1.1.1 Sample + library preparation

Paraphrased from the paper:

Study was conducted from 2017-2018 in the Boushehr Province, Iran. Participants were recruited in one of the five transfusion clinics located in Boushehr. Seven milliliters of blood were collected from each individual during blood transfusion or donation. Immediately after collection, plasma was separated from the samples and stored at −70 °C.

Initially, the samples were centrifuged and 100 µL of the supernatant was pooled with five samples per pool. The pooled plasma samples were subjected to an adapted version of the NetoVIR protocol for viral particle enrichment and metagenomic sequencing (more information can be found here). Pooled plasma samples were centrifuged for 3 min at 17,000× g and filtered through 0.8 µm polyether sulphone filters. To remove free-floating nucleic acids, the filtered samples were subjected to a nuclease treatment with a cocktail of 1 µL micrococcal nuclease and 2 µL benzonase for 2 h at 37 °C. Both viral DNA and RNA were extracted and randomly amplified (including primary step of reverse transcription) with the Whole Transcriptome Amplification 2 kit for 20 cycles.

The amplification product was purified with the MSB SPIN PCRAPACE kit and prepared for sequencing by using the Nextera XT kit. DNA products were quantified with the Qubit fluorometer, and the High-Sensitivity DNA kit for the Bioanalyzer 2100 was used to determine the average library fragment size. Samples were pooled in equimolar ratios, and paired-end sequencing (2 × 150 bp) was performed on a NextSeq 500 Illumina platform with an average of 10 million reads per sample.

1.2 Quality control metrics

In total, these 20 samples contained ~116M read pairs. The samples had 1.3M - 11.1M (mean ~5.8M) read pairs each.

The number of reads looks pretty good, although a few samples have a much lower number of reads. The total number of base pairs also looks reasonable, matching the trends of the number of reads. The duplication rate is low. Adapter content is quite high for nextera-transposase-sequence (library contamination).

As we’d expect, we see higher quality Q scores near the beginning of the read and a gradual decline towards the end of the read, however all positions seem to have a Q score of 35 which means that our reads are ~99.97% accurate. When looking at the Q score over all sequences, we can see a sharp peak around 35, which corresponds to the previous plot, indicating high read quality.

Code

# Prepare data
basic_stats_raw_metrics <- basic_stats_raw %>%
  select(library,
         `# Read pairs` = n_read_pairs,
         `Total base pairs\n(approx)` = n_bases_approx,
         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
  pivot_longer(-library, names_to = "metric", values_to = "value") %>%
  mutate(metric = fct_inorder(metric)) 

# Set up plot templates

g_basic <- ggplot(basic_stats_raw_metrics, aes(x = library, y = value)) +
  geom_col(position = "dodge") +
  scale_x_discrete() +
  scale_y_continuous(expand = c(0, 0)) +
  expand_limits(y = c(0, 100)) +
  facet_grid(metric ~ ., scales = "free", space = "free_x", switch = "y") +
  theme_kit + theme(
    axis.title.y = element_blank(),
    strip.text.y = element_text(face = "plain")
  )
g_basic

Code

# Set up plotting templates
g_qual_raw <- ggplot(mapping=aes(linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters_raw <- g_qual_raw + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
  scale_y_continuous(name="% Adapters", limits=c(0,NA),
                     breaks = seq(0,100,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0)) +
  facet_grid(.~adapter)
g_adapters_raw

Code

# Visualize quality
g_quality_base_raw <- g_qual_raw +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0))
g_quality_base_raw

Code

g_quality_seq_raw <- g_qual_raw +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0))
g_quality_seq_raw

2 Preprocessing

2.1 Summary

The average fraction of reads at each stage in the preprocessing pipeline is shown in the following table. Adapter trimming & filtering doesn’t lose too many reads. Deduplication loses us about 30% of reads on average, then ribodepletion only loses as about 0.7% on average.

Code

# TODO: Group by pool size as well
# Count read losses
n_reads_rel <- basic_stats %>% 
  select(sample, stage, percent_duplicates, n_read_pairs) %>%
  group_by(sample) %>% 
  arrange(sample, stage) %>%
  mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),
         p_reads_lost = 1 - p_reads_retained,
         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
         p_reads_lost_abs = 1-p_reads_retained_abs,
         p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))
n_reads_rel_display <- n_reads_rel %>% 
  rename(Stage=stage) %>% 
  group_by(Stage) %>% 
  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
  filter(Stage != "raw_concat") %>%
  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
n_reads_rel_display

2.2 Quality control metrics

Adapter trimming gets rid of nextera_transposase_sequenece and poly_g, although there is still a bit of poly_a adapter contamination (overivew of fastqc adapters). Not sure how to interpret the poly_a contamination, but it’s not that high so I’m not going to worry about it for now. Q score remain the same during read cleaning when looking at the positions, with the end of the read actually improving in score. Q scores across all sequences look pretty much the same throughout cleaning.

Code

g_qual <- ggplot(mapping=aes(linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters <- g_qual + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
  scale_y_continuous(name="% Adapters", limits=c(0,20),
                     breaks = seq(0,50,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~adapter)
g_adapters

Code

# Visualize quality
g_quality_base <- g_qual +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_base

Code

g_quality_seq <- g_qual +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(stage~., scales = "free_y")
g_quality_seq

These plots below show the trends from above in each sample. All samples tend to follow similar trends for deduplication, with a large decrease in read length post adapter trimming. Ribosomal reads were quite low, near 1% for every sample.

Code

g_stage_trace <- ggplot(basic_stats, aes(x=stage, group=sample)) +
  theme_kit

# Plot reads over preprocessing
g_reads_stages <- g_stage_trace +
  geom_line(aes(y=n_read_pairs)) +
  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
g_reads_stages

Code

# Plot relative read losses during preprocessing
g_reads_rel <- ggplot(n_reads_rel, 
                      aes(x=stage, group=sample)) +
  geom_line(aes(y=p_reads_lost_abs_marginal)) +
  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
                     labels = function(x) x*100) +
  theme_kit
g_reads_rel

Code

stage_dup <- basic_stats %>% group_by(stage) %>% 
  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), .groups = "drop")

g_dup_stages <- g_stage_trace +
  geom_line(aes(y=percent_duplicates)) +
  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
g_dup_stages

Code

g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +
  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
g_readlen_stages

Code

# Calculate reads lost during ribodepletion (approximation for % ribosomal reads)
reads_ribo <- n_reads_rel %>% 
  filter(stage %in% c("dedup", "ribo_secondary")) %>% 
  group_by(sample) %>% 
  summarize(p_reads_ribo=1-n_read_pairs[2]/n_read_pairs[1], .groups = "drop") %>%
  inner_join(libraries, by = 'sample')
reads_ribo_summ <- reads_ribo %>%
  group_by(sample) %>%
  summarize(min=min(p_reads_ribo), max=max(p_reads_ribo),
            mean=mean(p_reads_ribo), .groups = "drop") %>%
  inner_join(libraries, by = 'sample')
g_reads_ribo <- ggplot(reads_ribo, 
                       aes(x=library, y=p_reads_ribo)) +
  geom_point() + 
  scale_y_continuous(name="Approx % ribosomal reads", limits=c(0,1),
                     breaks=seq(0,1,0.2), expand=c(0,0), labels = function(y) y*100)+
  theme_kit
g_reads_ribo

3 Taxonomic composition

3.1 High-level composition

To assess the high-level composition of the reads, I ran the ribodepleted files through Kraken2 and summarized the results with Bracken.

The groups listed below were created by Will:

Filtered (removed during cleaning)
Duplicate (removed during deduplication)
Ribosomal (removed during ribodepletion)
Unassigned (non-ribosomal reads that were not assigned to any taxon by Kraken/Bracken)
Bacterial (non-ribosomal reads assigned to the Bacteria domain by Kraken/Bracken)
Archaeal (non-ribosomal reads assigned to the Archaea domain by Kraken/Bracken)
Viral (non-ribosomal reads assigned to the Viruses domain by Kraken/Bracken)
Human (non-ribosomal reads assigned to the Eukarya domain by Kraken/Bracken)

Code

# Prepare plotting templates
g_comp_base <- ggplot(mapping=aes(x=library, y=p_reads, fill=classification)) +
  scale_x_discrete(name="Plasma pool") +
  theme_kit + 
  theme(plot.title = element_text(hjust=0, face="plain", size=rel(1.5)))
scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
                                   expand = c(0,0), labels = function(y) y*100)
geom_comp <- purrr::partial(geom_col, position = "stack", width = 1)

# Define a color palette for the classification
classification_colors <- brewer.pal(8, "Accent")
names(classification_colors) <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned", "Bacterial", "Archaeal", "Viral", "Human")
scale_fill_classification <- function() {
  scale_fill_manual(values = classification_colors, name = "Classification")
}

# Plot overall composition
g_comp <- g_comp_base + geom_comp(data = comp) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (all reads, all groups)")
g_comp

Code

g_comp_assigned <- g_comp_base + 
  geom_comp(data = comp_assigned) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (assigned reads, all groups)")
g_comp_assigned

3.2 Total viral content

Total viral fraction average \(2.36 \times 10^{-4}\) across samples. As a fraction of assigned (rather than total) reads, this jumped to \(3.19 \times 10^{-3}\):

Code

p_reads_viral_all <- comp %>% filter(classification == "Viral") %>%
  mutate(read_group = "All reads")
p_reads_viral_assigned <- comp_assigned %>% filter(classification == "Viral") %>%
  mutate(read_group = "Classified reads")
p_reads_viral <- bind_rows(p_reads_viral_all, p_reads_viral_assigned)

# Plot
g_viral <- ggplot(p_reads_viral, aes(x=library, y=p_reads, color = read_group, group = library)) +
  geom_point(size = 5) +
  geom_line(color = 'black') +
  scale_x_discrete(name="Plasma pool") +
  scale_y_log10(name="Viral read fraction", labels = label_log(digits=2), limits = c(1e-6, 1e-2)) +
  theme_kit + 
  coord_flip()

g_viral

3.3 Taxonomic composition of viruses

The two dominant viruses we see are Anellovirdae and Phycodnaviridae. The threshold for the label “other” are the set of families that make up less than 5% composition in all samples.

Code

major_threshold <- 0.05

# Identify major viral families
viral_families_major_tab <- viral_families %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
  filter(p_reads_viral_max >= major_threshold)
viral_families_major_list <- viral_families_major_tab %>% pull(name)
viral_families_major <- viral_families %>% 
  filter(name %in% viral_families_major_list) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_minor <- viral_families_major %>% 
  group_by(sample) %>%
  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_display <- viral_families_major %>% 
  bind_rows(viral_families_minor) %>%
  arrange(desc(p_reads_viral)) %>% 
  mutate(name = factor(name, levels=c(viral_families_major_list, "Other"))) %>%
  rename(p_reads = p_reads_viral, classification=name) %>%
  inner_join(libraries, by='sample')

# Plot
g_families <- g_comp_base + 
  geom_comp(data=viral_families_display) +
  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_brewer(palette = 'Accent')
g_families

4 Human-infecting virus reads

4.1 Overall relative abundance

I calculated the relative abundance of human-infecting viruses in two ways:

First, as the total number of deduplicated human-virus reads in each sample, divided by the number of raw reads (“All reads”). This results in a viral fraction of \(1.64 \times 10^{-4}\) across all samples
Second, as a fraction of preprocessed (cleaned, deduplicated, computationally ribodepleted) reads (“Preprocessed reads”). This results in a viral fraction of \(2.42 \times 10^{-4}\) across all samples.

Code

# Get raw read counts
read_counts_raw <- filter(basic_stats_raw) %>%
  select(sample, n_reads_raw = n_read_pairs)
read_counts_preproc <- basic_stats %>% filter(stage == "ribo_initial") %>%
  select(sample, n_reads_preproc = n_read_pairs)

# Get HV read counts
read_counts_hv <- mrg_hv %>% filter(hv_status) %>% 
  group_by(sample) %>% 
  count(name="n_reads_hv")
read_counts <- read_counts_raw %>%
  left_join(read_counts_hv, by=c("sample")) %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>%
  left_join(read_counts_preproc, by=c("sample")) %>%
  inner_join(libraries, by=c("sample")) %>%
  select(sample, n_reads_raw, n_reads_preproc, n_reads_hv) %>%
  mutate(n_samples = 1,
         p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc)
read_counts_long <- read_counts %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads"))

# Combine for display
read_counts_agg <- read_counts %>%
  mutate(p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc) %>%
  inner_join(libraries, by=c("sample"))
read_counts_agg_long <- read_counts_agg %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads (HV)", "Preprocessed reads (HV)")) 

# Visualize
g_read_counts <- ggplot(read_counts_agg_long, aes(x=library, y=p_reads, color = read_group, group = library)) +
  geom_point(size = 5) +
  geom_line(color = 'black') +
  scale_y_log10(name = "Unique human-viral read fraction", labels = label_log(digits=2), limits = c(1e-6, 1e-2)) +
  theme_kit + 
  coord_flip()
g_read_counts

4.2 Overall taxonomy and composition

Composition of HV reads was changed from when looking at all viral reads. The two dominant viruses we see are Anellovirdae and Microviridae (bacteriophage). The threshold for the label “other” are the set of families that make up less than 5% composition in all samples.

Code

threshold_major_family <- 0.05

# Count reads for each human-viral family
hv_family_counts <- hv_reads_family %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
hv_family_counts_major <- hv_family_counts %>%
  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
  group_by(sample, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_family_major_tab$name, "Other")))
hv_family_counts_display <- hv_family_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display) %>%
  inner_join(libraries, by = 'sample')

# Plot
g_hv_family <- g_comp_base + 
  geom_col(data=hv_family_counts_display, position = "stack", width=1) +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_brewer(palette = 'Accent', name = "Viral family") +
  labs(title="Family composition of human-viral reads") +
  guides(fill=guide_legend(ncol=4)) +
  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
g_hv_family

Code

# Get most prominent families for text
hv_family_collate <- hv_family_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            p_reads_max = max(p_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))
hv_family_collate %>%
 select(name,taxid, n_reads_tot) %>%
 rename(
  'family' = 'name',
  '# of total reads' = 'n_reads_tot',
)

4.3 Analyzing specific families

We now investigate the composition of specific families that had more than 5 viral reads. In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 1% of human-viral reads:

4.3.1 Anelloviridae (Number of reads: 12,804)

Code

# Anelloviridae
plot_viral_family_histogram(taxid_chosen=687329)

Code

# Anelloviridae
plot_viral_family_composition(taxid_chosen=687329, threshold_major_species = 0.01)

4.3.2 Microviridae (Number of reads: 5,194)

Code

# Microviridae
plot_viral_family_histogram(taxid_chosen=10841)

Code

# Microviridae
plot_viral_family_composition(taxid_chosen=10841, threshold_major_species = 0.01)

4.3.3 Sedreoviridae (Number of reads: 278)

Code

# Sedreoviridae
plot_viral_family_histogram(taxid_chosen=2946186)

Code

# Sedreoviridae
plot_viral_family_composition(taxid_chosen=2946186, threshold_major_species = 0.01)

4.3.4 Parvoviridae (Number of reads: 266)

Code

# Parvoviridae
plot_viral_family_histogram(taxid_chosen=10780)

Code

# Parvoviridae
plot_viral_family_composition(taxid_chosen=10780, threshold_major_species = 0.01)

4.3.5 Adenoviridae (Number of reads: 104)

Code

# Adenoviridae
plot_viral_family_histogram(taxid_chosen=10508)

None of these reads made up more than 1% of the total reads of any sample.

4.3.6 Polyomaviridae (Number of reads: 95)

Code

# Polyomaviridae
plot_viral_family_histogram(taxid_chosen=151341)

None of these reads made up more than 1% of the total reads of any sample.

4.3.7 Papillomaviridae (Number of reads: 16)

Code

# Polyomaviridae
plot_viral_family_histogram(taxid_chosen=151340)

None of these reads made up more than 1% of the total reads of any sample.

4.4 Relative abundance of pathogenic viruses of interest

Each dot represents a sample, colored by viral family. The x-axis shows the relative abundance of human-infecting viruses, and the y-axis shows the species.

Code

species_family <- result %>% select(species, family) %>% rename('name' = 'species')

play <- hv_species_counts %>% 
  ungroup() %>%
  inner_join(libraries, by = 'sample') %>%
  inner_join(species_family, by = 'name') %>%
  mutate(name = ifelse(name == "Severe acute respiratory syndrome-related coronavirus", "SARS-related coronavirus", name)) %>%
  filter(family %in% c("Sedoreoviridae", "Parvoviridae", "Adenoviridae", "Papillomaviridae", "Polyomaviridae", "Coronaviridae", "Caliciviridae", "Picornaviridae")) 

adjusted_play <- play %>% 
  group_by(name) %>%
  mutate(virus_prevalence_num = n_distinct(sample)/n_distinct(libraries),
         total_reads_hv = sum(n_reads_hv)) %>%
  ungroup() %>%
  mutate(name = fct_reorder(name, virus_prevalence_num, .desc=TRUE)) %>% 
  select(name, ra_reads_hv, family, virus_prevalence_num, total_reads_hv, n_reads_hv)

pal <- c(brewer.pal(8, 'Dark2'),brewer.pal(8, 'Accent'))

#, labels = label_log(digits=2)

ra_dot <- ggplot(adjusted_play, aes(x = ra_reads_hv, y=name)) +
  geom_quasirandom(orientation = 'y', aes(color = family)) +
  scale_color_manual(values = pal) + 
    scale_x_log10(
    "Relative abundance human virus reads",
    labels=label_log(digits=3)
  ) +
  labs(y =  "",
       color = 'Viral family') + 
  guides(color = guide_legend(ncol=2)) +
  theme_light() + 
    theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 12),
    axis.ticks.y = element_blank(),axis.line = element_line(colour = "black"),
    axis.title.x = element_text(size = 14),    
    legend.text = element_text(size = 10),
    legend.title = element_text(size = 10, face="bold"),
    #legend.position = 'bottom',
    legend.position = c(1, 1),  # Move legend to top right
    legend.justification = c(1, 1),  # Align legend to top right
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank())
ra_dot

Code

#  geom_text(aes(label = total_reads_hv, y = name), 
#            x = 1e-9, hjust = 0, vjust = 0.5, size = 3, 
#            check_overlap = TRUE)

I can then take all of the viruses that we found and look up what they’re responsible for and whether they’re dangerous.

Virus Name	Common Name	Pathogenic Potential
Rotavirus A	Rotavirus	High, causes severe diarrhea in children
Enterovirus C	Includes poliovirus and some coxsackieviruses	High, can cause various diseases including polio
Severe acute respiratory syndrome-related coronavirus	SARS coronavirus	High, causes severe respiratory illness
Sapporo virus	Sapporo virus or Sapovirus	Moderate, causes gastroenteritis
Human mastadenovirus F	Adenovirus F	Moderate, can cause gastroenteritis
Alphapolyomavirus quintihominis	Not widely known	Low, but some polyomaviruses can cause disease in immunocompromised individuals
Alphapapillomavirus 4	Human papillomavirus (HPV) type 4	Low to moderate, can cause warts

None of these viruses are too suprising to see other than the Severe acute respiratory syndrome-related coronavirus, which when BLASTED, comes up as SARS-CoV-2. We believe that this is due to contamination with the COVID-19 pandemic since these samples were taken pre-pandemic.

4.5 Relative abundance assuming perfect human read removal

Assuming we’re able to perfectly remove all human reads, the average relative abundance of known human infecting virus is \(1.87 \times 10^{-4}\).

5 Conclusion

There were some interesting takeways from this analysis:

Contamination with SARS-Cov-2 seems to be a common occurence and should be accounted for when interpretting these results.
Gasterointestinal diseases are picked up in plasma.
Sexually transmitted infections are picked up in plasma.
We pick up a decent amount of baceriophage (i.e. Microviridae).
A lot of the Microviridae and Anelloviridae are largely uncharacterized (denoted by the family name followed by the abbreviation “sp.”).

6 Appendix

6.1 Human-infecting virus families, genera, and species

To get a good overview of families, genera, and species, we can look at a Sankey plot where the magnitude of relative abundance, averaged over all samples, is shown in parentheses.

Code

# Function to create links
create_links <- function(data) {
  family_to_genus <- data %>%
    filter(!is.na(genus)) %>%
    group_by(family, genus) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = genus)
  
  genus_to_species <- data %>%
    group_by(genus, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = genus, target = species)

  family_to_species <- data %>%
    filter(is.na(genus)) %>%
    group_by(family, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = species)

  bind_rows(family_to_genus, genus_to_species, family_to_species) %>%
    filter(!is.na(source))
}

# Function to create nodes
create_nodes <- function(links) {
  data.frame(
    name = c(links$source, links$target) %>% unique()
  )
}

# Function to prepare data for Sankey diagram
prepare_sankey_data <- function(links, nodes) {
  links$IDsource <- match(links$source, nodes$name) - 1
  links$IDtarget <- match(links$target, nodes$name) - 1
  list(links = links, nodes = nodes)
}

# Function to create Sankey plot
create_sankey_plot <- function(sankey_data) {
  sankeyNetwork(
    Links = sankey_data$links, 
    Nodes = sankey_data$nodes,
    Source = "IDsource", 
    Target = "IDtarget",
    Value = "value", 
    NodeID = "name",
    sinksRight = TRUE,
    nodeWidth = 25,
    fontSize = 14,
  )
}

save_sankey_as_png <- function(sankey_plot, width = 1000, height = 800) {
  # Save the plot as an HTML file
  saveWidget(sankey_plot, sprintf('%s/sankey.html',data_dir))
}

format_scientific <- function(x, digits=2) {
  sapply(x, function(val) {
    if (is.na(val) || abs(val) < 1e-15) {
      return("0")
    } else {
      exponent <- floor(log10(abs(val)))
      coef <- val / 10^exponent
      #return(sprintf("%.1f × 10^%d", round(coef, digits), exponent))
      # May or may not be smart, just keeping magnitude
      return(sprintf("10^%d", exponent))
    }
  })
}

data <- result %>% 
  mutate(across(c(genus_n_reads_tot, genus_ra_reads_tot), ~replace_na(., 0)),
         genus = ifelse(is.na(genus), "Unknown Genus", genus)) %>%
  mutate(
  species = paste0(species, sprintf(' (%s)', format_scientific(species_ra_reads_tot))),
  genus = paste0(genus, sprintf(' (%s)', format_scientific(genus_ra_reads_tot))),
  family = paste0(family, sprintf(' (%s)', format_scientific(family_ra_reads_tot)))
)
links <- as.data.frame(create_links(data))
nodes <- create_nodes(links)
sankey_data <- prepare_sankey_data(links, nodes)
sankey <- create_sankey_plot(sankey_data)

sankey

---
title: "Workflow of Thijssen et al. (2023)"
subtitle: "Pooled plasma from Iran (DNA + RNA)"
author: "Harmon Bhasin"
date: 2024-07-22
format:
  html:
    toc: true # table of contents
    toc-title: "Table of contents" # table of contents title
    number-sections: true # number sections
    number-depth: 3 # number depth to show in table of contents
    toc-location: right # table of contents location
    page-layout: full # full page layout
    code-fold: true # Keep option to fold code (i.e. show or hide it; default: hide)
    code-tools: true # Code menu in the header of your document that provides various tools for readers to interact with the source code
    code-link: true # Enables hyper-linking of functions within code blocks to their online documentation
    df-print: paged # print data frame
    fig-format: svg
    other-links:
      - text: Paper
        href: https://doi.org/10.3390/v15071425
      - text: Data
        href: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA983534
    code-links:
      - text: Code for this post
        icon: file-code
        href: https://github.com/naobservatory/harmons-public-notebook/blob/main/notebooks/2024-07-22-thijssen.qmd
editor: 
  visual: true
  render-on-save: true
comments:
  hypothesis: true # hypothesis
execute: 
  freeze: auto
  cache: true
title-block-banner: "#de2d26"
---

quarto-executable-code-5450563D

```r
#| label: load-packages
#| include: false
 
library(tidyverse)
library(cowplot)
library(patchwork)
library(fastqcr)
library(RColorBrewer)
library(networkD3)
library(readxl)
library(htmlwidgets)
library(ggbeeswarm)
library(latex2exp)
library(languageserver)
library(scales)
library(ggpubr)

head_dir <- "/Users/harmonbhasin/work/securebio/"
source(sprintf("%s/sampling-strategies/scripts/aux_plot-theme.R", head_dir))
image_dir <- sprintf('%s/lab_meeting/thijssen', head_dir)
theme_base <- theme_base + theme(aspect.ratio = NULL)
theme_kit <- theme_base + theme(
  axis.text.x = element_text(hjust = 1, angle = 45),
  axis.title.x = element_blank(),
)
tnl <- theme(legend.position = "none")
```

quarto-executable-code-5450563D

```r
#| label: set-up-paths
#| include: false

# Data input paths
data_dir <- "/Users/harmonbhasin/work/securebio/nao-harmon/thijssen2023/analysis/"
input_dir <- file.path(data_dir, "input")
results_dir <- file.path(data_dir, "results")
qc_dir <- file.path(results_dir, "qc")
hv_dir <- file.path(results_dir, "hv")
libraries_path <- file.path(input_dir, "libraries.csv")
basic_stats_path <- file.path(qc_dir, "qc_basic_stats.tsv.gz")
adapter_stats_path <- file.path(qc_dir, "qc_adapter_stats.tsv.gz")
quality_base_stats_path <- file.path(qc_dir, "qc_quality_base_stats.tsv.gz")
quality_seq_stats_path <- file.path(qc_dir, "qc_quality_sequence_stats.tsv.gz")
```

This is another potential study of [this series](https://data.securebio.org/harmons-public-notebook/notebooks/2024-07-08_cebria-mendoza.html). In this post, I analyze [Thijssen 2023](https://doi.org/10.3390/v15071425), a dataset with 20 pooled samples from 100 healthy individuals in Iran.

*This notebook uses the MGS Workflow v2.2.1 (note that this is old).*

# The raw data

## About

[This dataset](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA983534) comprises 20 samples, each derived from plasma pools of 5 people from Iran. In total, 100 healthy individuals contributed to these pools. For each pooled sample, a combined DNA and RNA library preparation was performed, resulting in a single sequencing output that captures both nucleic acid types. This approach provides comprehensive genetic information but precludes separate analysis of DNA and RNA from individual samples.
quarto-executable-code-5450563D

```r
#| warning: false
#| label: load-libraries

# Import libraries and extract metadata from sample names
libraries <- read_csv(libraries_path, show_col_types = FALSE)
meta_data <- read_csv('/Users/harmonbhasin/work/securebio/nao-harmon/thijssen2023/preprocessing/SraRunTable.txt') %>%
  select('Library Name', Run) %>% rename(sample = Run, library = 'Library Name')
libraries <- libraries %>%
  left_join(meta_data, by = 'sample') %>%
  select(sample, library=library.y) %>% 
  filter(library != 'PCL21') %>%
  mutate(library = factor(library, levels = c(paste0('PCL', 1:20))))
libraries

```

### Sample + library preparation 

Paraphrased from the [paper](https://www.mdpi.com/1999-4915/15/7/1425):

> Study was conducted from 2017-2018 in the Boushehr Province, Iran. Participants were recruited in one of the five transfusion clinics located in Boushehr. Seven milliliters of blood were collected from each individual during blood transfusion or donation. Immediately after collection, plasma was separated from the samples and stored at −70 °C. 
> 
> Initially, the samples were centrifuged and 100 µL of the supernatant was pooled with five samples per pool. The pooled plasma samples were subjected to an adapted version of the NetoVIR protocol for viral particle enrichment and metagenomic sequencing ([more information can be found here](https://www.mdpi.com/1999-4915/15/7/1425#B20-viruses-15-01425)). Pooled plasma samples were centrifuged for 3 min at 17,000× g and filtered through 0.8 µm polyether sulphone filters. To remove free-floating nucleic acids, the filtered samples were subjected to a nuclease treatment with a cocktail of 1 µL micrococcal nuclease and 2 µL benzonase for 2 h at 37 °C. Both viral DNA and RNA were extracted and randomly amplified (including primary step of reverse transcription) with the Whole Transcriptome Amplification 2 kit for 20 cycles. 
> 
> The amplification product was purified with the MSB SPIN PCRAPACE kit and prepared for sequencing by using the Nextera XT kit. DNA products were quantified with the Qubit fluorometer, and the High-Sensitivity DNA kit for the Bioanalyzer 2100 was used to determine the average library fragment size. Samples were pooled in equimolar ratios, and paired-end sequencing (2 × 150 bp) was performed on a NextSeq 500 Illumina platform with an average of 10 million reads per sample.

## Quality control metrics

quarto-executable-code-5450563D

```r
#| label: read-qc-data
#| include: false

# Import QC data
stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
basic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         sample = fct_inorder(sample))

adapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))

quality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
  
quality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
  
  
 # Filter to raw data
 basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
 adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
 quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
 quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")
 
 
 # Get key values for readout
raw_read_counts <- basic_stats_raw %>% 
  summarize(rmin = min(n_read_pairs), 
            rmax=max(n_read_pairs),
            rmean=mean(n_read_pairs), 
            rtot = sum(n_read_pairs),
            btot = sum(n_bases_approx),
            dmin = min(percent_duplicates), 
            dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), 
            .groups = "drop")
```

In total, these 20 samples contained ~116M read pairs. The samples had 1.3M - 11.1M (mean ~5.8M) read pairs each. 

The number of reads looks pretty good, although a few samples have a much lower number of reads. The total number of base pairs also looks reasonable, matching the trends of the number of reads. The duplication rate is low. Adapter content is quite high for nextera-transposase-sequence ([library contamination](https://www.biostars.org/p/119774/)).

As we'd expect, we see higher quality Q scores near the beginning of the read and a gradual decline towards the end of the read, however all positions seem to have a Q score of 35 which means that our reads are \~99.97% accurate. When looking at the Q score over all sequences, we can see a sharp peak around 35, which corresponds to the previous plot, indicating high read quality. 

quarto-executable-code-5450563D

```r
#| fig-width: 9
#| warning: false
#| label: plot-basic-stats

# Prepare data
basic_stats_raw_metrics <- basic_stats_raw %>%
  select(library,
         `# Read pairs` = n_read_pairs,
         `Total base pairs\n(approx)` = n_bases_approx,
         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
  pivot_longer(-library, names_to = "metric", values_to = "value") %>%
  mutate(metric = fct_inorder(metric)) 

# Set up plot templates

g_basic <- ggplot(basic_stats_raw_metrics, aes(x = library, y = value)) +
  geom_col(position = "dodge") +
  scale_x_discrete() +
  scale_y_continuous(expand = c(0, 0)) +
  expand_limits(y = c(0, 100)) +
  facet_grid(metric ~ ., scales = "free", space = "free_x", switch = "y") +
  theme_kit + theme(
    axis.title.y = element_blank(),
    strip.text.y = element_text(face = "plain")
  )
g_basic
```

quarto-executable-code-5450563D

```r
#| label: plot-raw-quality
#| fig-width: 8

# Set up plotting templates
g_qual_raw <- ggplot(mapping=aes(linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters_raw <- g_qual_raw + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
  scale_y_continuous(name="% Adapters", limits=c(0,NA),
                     breaks = seq(0,100,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0)) +
  facet_grid(.~adapter)
g_adapters_raw

# Visualize quality
g_quality_base_raw <- g_qual_raw +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0))
g_quality_base_raw

g_quality_seq_raw <- g_qual_raw +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0))
g_quality_seq_raw
```

# Preprocessing

## Summary

The average fraction of reads at each stage in the preprocessing pipeline is shown in the following table. Adapter trimming & filtering doesn't lose too many reads. Deduplication loses us about 30% of reads on average, then ribodepletion only loses as about 0.7% on average.

quarto-executable-code-5450563D

```r
#| label: preproc-table

# TODO: Group by pool size as well
# Count read losses
n_reads_rel <- basic_stats %>% 
  select(sample, stage, percent_duplicates, n_read_pairs) %>%
  group_by(sample) %>% 
  arrange(sample, stage) %>%
  mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),
         p_reads_lost = 1 - p_reads_retained,
         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
         p_reads_lost_abs = 1-p_reads_retained_abs,
         p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))
n_reads_rel_display <- n_reads_rel %>% 
  rename(Stage=stage) %>% 
  group_by(Stage) %>% 
  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
  filter(Stage != "raw_concat") %>%
  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
n_reads_rel_display
```

## Quality control metrics

Adapter trimming gets rid of nextera_transposase_sequenece and poly_g, although there is still a bit of poly_a adapter contamination ([overivew of fastqc adapters](https://github.com/s-andrews/FastQC/blob/master/Configuration/adapter_list.txt)). Not sure how to interpret the poly_a contamination, but it's not that high so I'm not going to worry about it for now. Q score remain the same during read cleaning when looking at the positions, with the end of the read actually improving in score. Q scores across all sequences look pretty much the same throughout cleaning.

quarto-executable-code-5450563D

```r
#| warning: false
#| label: plot-quality
#| fig-height: 8
#| fig-width: 10

g_qual <- ggplot(mapping=aes(linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters <- g_qual + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
  scale_y_continuous(name="% Adapters", limits=c(0,20),
                     breaks = seq(0,50,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~adapter)
g_adapters

# Visualize quality
g_quality_base <- g_qual +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_base

g_quality_seq <- g_qual +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(stage~., scales = "free_y")
g_quality_seq
```

These plots below show the trends from above in each sample. All samples tend to follow similar trends for deduplication, with a large decrease in read length post adapter trimming. Ribosomal reads were quite low, near 1% for every sample.

quarto-executable-code-5450563D

```r
#| label: preproc-figures
#| warning: false
#| fig-height: 3
#| fig-width: 6

g_stage_trace <- ggplot(basic_stats, aes(x=stage, group=sample)) +
  theme_kit

# Plot reads over preprocessing
g_reads_stages <- g_stage_trace +
  geom_line(aes(y=n_read_pairs)) +
  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
g_reads_stages

# Plot relative read losses during preprocessing
g_reads_rel <- ggplot(n_reads_rel, 
                      aes(x=stage, group=sample)) +
  geom_line(aes(y=p_reads_lost_abs_marginal)) +
  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
                     labels = function(x) x*100) +
  theme_kit
g_reads_rel
```

quarto-executable-code-5450563D

```r
#| label: preproc-dedup
#| fig-width: 10

stage_dup <- basic_stats %>% group_by(stage) %>% 
  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), .groups = "drop")

g_dup_stages <- g_stage_trace +
  geom_line(aes(y=percent_duplicates)) +
  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
g_dup_stages

g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +
  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
g_readlen_stages
```

quarto-executable-code-5450563D

```r
#| label: ribo-frac
#| fig-width: 10
# Calculate reads lost during ribodepletion (approximation for % ribosomal reads)
reads_ribo <- n_reads_rel %>% 
  filter(stage %in% c("dedup", "ribo_secondary")) %>% 
  group_by(sample) %>% 
  summarize(p_reads_ribo=1-n_read_pairs[2]/n_read_pairs[1], .groups = "drop") %>%
  inner_join(libraries, by = 'sample')
reads_ribo_summ <- reads_ribo %>%
  group_by(sample) %>%
  summarize(min=min(p_reads_ribo), max=max(p_reads_ribo),
            mean=mean(p_reads_ribo), .groups = "drop") %>%
  inner_join(libraries, by = 'sample')
g_reads_ribo <- ggplot(reads_ribo, 
                       aes(x=library, y=p_reads_ribo)) +
  geom_point() + 
  scale_y_continuous(name="Approx % ribosomal reads", limits=c(0,1),
                     breaks=seq(0,1,0.2), expand=c(0,0), labels = function(y) y*100)+
  theme_kit
g_reads_ribo
```

# Taxonomic composition

## High-level composition

To assess the high-level composition of the reads, I ran the ribodepleted files through Kraken2 and summarized the results with Bracken.

The groups listed below were created by Will:

* Filtered (removed during cleaning)
* Duplicate (removed during deduplication)
* Ribosomal (removed during ribodepletion)
* Unassigned (non-ribosomal reads that were not assigned to any taxon by Kraken/Bracken)
* Bacterial (non-ribosomal reads assigned to the Bacteria domain by Kraken/Bracken)
* Archaeal (non-ribosomal reads assigned to the Archaea domain by Kraken/Bracken)
* Viral (non-ribosomal reads assigned to the Viruses domain by Kraken/Bracken)
* Human (non-ribosomal reads assigned to the Eukarya domain by Kraken/Bracken)


quarto-executable-code-5450563D

```r
#| label: prepare-composition
#| include: false

classifications <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned",
                     "Bacterial", "Archaeal", "Viral", "Human")

# Import composition data
tax_final_dir <- file.path(results_dir, "taxonomy_final")
comp_path <- file.path(tax_final_dir, "taxonomic_composition.tsv.gz")
comp <- read_tsv(comp_path) %>% left_join(libraries) %>% filter(library != "PCL21")
comp_minor <- comp %>% 
  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))
comp_assigned <- comp %>%
  filter(! classification %in% c("Filtered", "Duplicate", 
                                 "Ribosomal", "Unassigned")) %>%
  group_by(sample) %>%
  mutate(p_reads = n_reads/sum(n_reads))
comp_assigned_minor <- comp_assigned %>% 
  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))

# Summarize composition
read_comp_summ <- comp %>% 
  group_by(classification) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
  mutate(n_reads = replace_na(n_reads,0),
         p_reads = n_reads/sum(n_reads),
         pc_reads = p_reads*100) %>%
  mutate(classification = factor(classification, levels=classifications)) %>%
  select(classification, n_reads, pc_reads) %>%
  rename(`# of reads` = n_reads, "% of reads" = pc_reads) %>%
  mutate(`% of reads` = sprintf("%.2f", `% of reads`))
read_comp_summ

```

quarto-executable-code-5450563D

```r
#| label: plot-composition-all
#| fig-height: 7
#| fig-width: 8

# Prepare plotting templates
g_comp_base <- ggplot(mapping=aes(x=library, y=p_reads, fill=classification)) +
  scale_x_discrete(name="Plasma pool") +
  theme_kit + 
  theme(plot.title = element_text(hjust=0, face="plain", size=rel(1.5)))
scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
                                   expand = c(0,0), labels = function(y) y*100)
geom_comp <- purrr::partial(geom_col, position = "stack", width = 1)

# Define a color palette for the classification
classification_colors <- brewer.pal(8, "Accent")
names(classification_colors) <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned", "Bacterial", "Archaeal", "Viral", "Human")
scale_fill_classification <- function() {
  scale_fill_manual(values = classification_colors, name = "Classification")
}

# Plot overall composition
g_comp <- g_comp_base + geom_comp(data = comp) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (all reads, all groups)")
g_comp

g_comp_assigned <- g_comp_base + 
  geom_comp(data = comp_assigned) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (assigned reads, all groups)")
g_comp_assigned

```

## Total viral content

Total viral fraction average $2.36 \times 10^{-4}$ across samples. As a fraction of assigned (rather than total) reads, this jumped to $3.19 \times 10^{-3}$:

quarto-executable-code-5450563D

```r
#| label: p-viral
p_reads_viral_all <- comp %>% filter(classification == "Viral") %>%
  mutate(read_group = "All reads")
p_reads_viral_assigned <- comp_assigned %>% filter(classification == "Viral") %>%
  mutate(read_group = "Classified reads")
p_reads_viral <- bind_rows(p_reads_viral_all, p_reads_viral_assigned)

# Plot
g_viral <- ggplot(p_reads_viral, aes(x=library, y=p_reads, color = read_group, group = library)) +
  geom_point(size = 5) +
  geom_line(color = 'black') +
  scale_x_discrete(name="Plasma pool") +
  scale_y_log10(name="Viral read fraction", labels = label_log(digits=2), limits = c(1e-6, 1e-2)) +
  theme_kit + 
  coord_flip()

g_viral

```

## Taxonomic composition of viruses

The two dominant viruses we see are Anellovirdae and Phycodnaviridae. The threshold for the label "other" are the set of families that make up less than 5% composition in all samples.

quarto-executable-code-5450563D

```r
#| label: extract-viral-taxa
#| include: false

# Get viral taxonomy
viral_taxa_path <- file.path(data_dir, "total-virus-db.tsv.gz")
viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)

# Get Kraken reports
reports_path <- file.path(tax_final_dir, "kraken_reports.tsv.gz")
reports <- read_tsv(reports_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample)

# Filter to viral taxa
kraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%
  group_by(sample) %>%
  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
kraken_reports_viral_cleaned <- kraken_reports_viral %>%
  select(-pc_reads_total, -n_reads_direct, -contains("minimizers")) %>%
  select(name, taxid, p_reads_viral, n_reads_clade, everything()) %>% ungroup

viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C")
viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F")

```

quarto-executable-code-5450563D

```r
#| label: viral-family-composition
#| fig-width: 10

major_threshold <- 0.05

# Identify major viral families
viral_families_major_tab <- viral_families %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
  filter(p_reads_viral_max >= major_threshold)
viral_families_major_list <- viral_families_major_tab %>% pull(name)
viral_families_major <- viral_families %>% 
  filter(name %in% viral_families_major_list) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_minor <- viral_families_major %>% 
  group_by(sample) %>%
  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_display <- viral_families_major %>% 
  bind_rows(viral_families_minor) %>%
  arrange(desc(p_reads_viral)) %>% 
  mutate(name = factor(name, levels=c(viral_families_major_list, "Other"))) %>%
  rename(p_reads = p_reads_viral, classification=name) %>%
  inner_join(libraries, by='sample')

# Plot
g_families <- g_comp_base + 
  geom_comp(data=viral_families_display) +
  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_brewer(palette = 'Accent')
g_families
```

# Human-infecting virus reads

## Overall relative abundance

I calculated the relative abundance of human-infecting viruses in two ways:

-   First, as the total number of deduplicated human-virus reads in each sample, divided by the number of raw reads ("All reads"). This results in a viral fraction of $1.64 \times 10^{-4}$ across all samples

-   Second, as a fraction of preprocessed (cleaned, deduplicated, computationally ribodepleted) reads ("Preprocessed reads"). This results in a viral fraction of $2.42 \times 10^{-4}$ across all samples.

quarto-executable-code-5450563D

```r
#| label: prepare-hv
#| include: false

# Import and format reads
hv_reads_path <- file.path(hv_dir, "hv_hits_putative_collapsed.tsv.gz")
mrg_hv <- read_tsv(hv_reads_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% arrange(sample) %>%
  mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV assignment",
                               "No Kraken2 assignment")) %>%
  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),
         highscore = adj_score_max >= 20,
         hv_status = assigned_hv | highscore) %>%
  rename(taxid_all = taxid, taxid = taxid_best)
```

quarto-executable-code-5450563D

```r
#| label: count-hv-reads
#| fig-width: 10
#| warning: false
# Get raw read counts
read_counts_raw <- filter(basic_stats_raw) %>%
  select(sample, n_reads_raw = n_read_pairs)
read_counts_preproc <- basic_stats %>% filter(stage == "ribo_initial") %>%
  select(sample, n_reads_preproc = n_read_pairs)

# Get HV read counts
read_counts_hv <- mrg_hv %>% filter(hv_status) %>% 
  group_by(sample) %>% 
  count(name="n_reads_hv")
read_counts <- read_counts_raw %>%
  left_join(read_counts_hv, by=c("sample")) %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>%
  left_join(read_counts_preproc, by=c("sample")) %>%
  inner_join(libraries, by=c("sample")) %>%
  select(sample, n_reads_raw, n_reads_preproc, n_reads_hv) %>%
  mutate(n_samples = 1,
         p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc)
read_counts_long <- read_counts %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads"))

# Combine for display
read_counts_agg <- read_counts %>%
  mutate(p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc) %>%
  inner_join(libraries, by=c("sample"))
read_counts_agg_long <- read_counts_agg %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads (HV)", "Preprocessed reads (HV)")) 

# Visualize
g_read_counts <- ggplot(read_counts_agg_long, aes(x=library, y=p_reads, color = read_group, group = library)) +
  geom_point(size = 5) +
  geom_line(color = 'black') +
  scale_y_log10(name = "Unique human-viral read fraction", labels = label_log(digits=2), limits = c(1e-6, 1e-2)) +
  theme_kit + 
  coord_flip()
g_read_counts

```

quarto-executable-code-5450563D

```r
#| label: compare-virus-to-hv
#| fig-width: 10
#| include: false
combine_viruses <- rbind(p_reads_viral  %>% group_by(read_group) %>% summarize(mean=mean(p_reads)) %>% mutate(type = "Viral reads"), read_counts_agg_long  %>% group_by(read_group) %>% summarize(mean=mean(p_reads))%>% mutate(type = "HV reads"))
# Format the mean values in scientific notation
combine_viruses <- combine_viruses %>%
  mutate(mean_formatted = sprintf("%.2e", mean))

combine_viruses %>% select(type, read_group, mean=mean_formatted)

#g_read_counts +
#  geom_point(data = p_reads_viral, aes(x = library, y = p_reads, color = read_group), size = 5) +
#  geom_line(data = p_reads_viral, aes(x = library, y = p_reads, group = library), color = 'grey') +
#  scale_color_brewer(palette = 'Accent')

#ggarrange(g_viral, g_read_counts, ncol=2)
```

## Overall taxonomy and composition

Composition of HV reads was changed from when looking at all viral reads. The two dominant viruses we see are Anellovirdae and Microviridae (bacteriophage). The threshold for the label "other" are the set of families that make up less than 5% composition in all samples.

quarto-executable-code-5450563D

```r
#| label: raise-hv-taxa
#| include: false

# Filter samples and add viral taxa information
samples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)
mrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by="taxid") 

# Discover viral species & genera for HV reads
raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
  # Get higher ranks than search rank
  ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
  rank_match <- which.max(ranks == out_rank)
  high_ranks <- ranks[rank_match:length(ranks)]
  # Merge read DB and taxid DB
  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
    left_join(taxid_db, by="taxid")
  # Extract sequences that are already at appropriate rank
  reads_rank <- filter(reads, rank == out_rank)
  # Drop sequences at a higher rank and return unclassified sequences
  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
    # Promote read taxids and re-merge with taxid DB, then re-classify and filter
    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
      select(-parent_taxid, -rank, -name) %>%
      left_join(taxid_db, by="taxid")
    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
      bind_rows(reads_rank)
    reads_norank <- reads_remaining %>%
      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
  }
  # Finally, extract and append reads that were excluded during the process
  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
    select(-parent_taxid, -rank, -name) %>%
    left_join(taxid_db, by="taxid")
  return(reads_out)
}
hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus")
hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family")
```

quarto-executable-code-5450563D

```r
#| label: hv-family
#| fig-width: 10
threshold_major_family <- 0.05

# Count reads for each human-viral family
hv_family_counts <- hv_reads_family %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
hv_family_counts_major <- hv_family_counts %>%
  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
  group_by(sample, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_family_major_tab$name, "Other")))
hv_family_counts_display <- hv_family_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display) %>%
  inner_join(libraries, by = 'sample')

# Plot
g_hv_family <- g_comp_base + 
  geom_col(data=hv_family_counts_display, position = "stack", width=1) +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_brewer(palette = 'Accent', name = "Viral family") +
  labs(title="Family composition of human-viral reads") +
  guides(fill=guide_legend(ncol=4)) +
  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
g_hv_family

# Get most prominent families for text
hv_family_collate <- hv_family_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            p_reads_max = max(p_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))
hv_family_collate %>%
 select(name,taxid, n_reads_tot) %>%
 rename(
  'family' = 'name',
  '# of total reads' = 'n_reads_tot',
)
```

## Analyzing specific families

We now investigate the composition of specific families that had more than 5 viral reads. In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 1% of human-viral reads:

quarto-executable-code-5450563D

```r
#| label: plot-viral-family-scripts
#| include: false

plot_viral_family_composition <- function(taxid_chosen, threshold_major_species = 0.1) {
  # Get set of chosen family reads
  family_samples <- hv_family_counts %>% filter(taxid == taxid_chosen) %>%
    filter(p_reads_hv >= 0.1) %>%
    pull(sample)
  family_ids <- hv_reads_family %>% 
    filter(taxid == taxid_chosen, sample %in% family_samples) %>%
    pull(seq_id)

  # Count reads for each species in the chosen family
  species_counts <- hv_reads_species %>%
    filter(seq_id %in% family_ids) %>%
    group_by(sample, name, taxid) %>%
    count(name = "n_reads_hv") %>%
    group_by(sample) %>%
    mutate(p_reads_family = n_reads_hv/sum(n_reads_hv))

  # Identify high-ranking species and group others
  species_major_tab <- species_counts %>% group_by(name) %>% 
    filter(p_reads_family == max(p_reads_family)) %>% 
    filter(row_number() == 1) %>%
    arrange(desc(p_reads_family)) %>% 
    filter(p_reads_family > threshold_major_species)
  species_counts_major <- species_counts %>%
    mutate(name_display = ifelse(name %in% species_major_tab$name, 
                                 name, "Other")) %>%
    group_by(sample, name_display) %>%
    summarize(n_reads_family = sum(n_reads_hv),
              p_reads_family = sum(p_reads_family), 
              .groups="drop") %>%
    mutate(name_display = factor(name_display, 
                                 levels = c(species_major_tab$name, "Other")))
  species_counts_display <- species_counts_major %>%
    rename(p_reads = p_reads_family, classification = name_display) %>%
    right_join(libraries, by = c('sample'))
    #mutate(p_reads = coalesce(p_reads, 0),
    #       classification = coalesce(classification, species_major_tab$name[1])) %>%
    #group_by(location) %>%
    #mutate(samples_with_reads = sum(p_reads > 0),
    #  label = paste0(location, " (", sprintf("%.1f%%", 100 * samples_with_reads / total_samples), ")")) %>%
    #ungroup()

  # Get family name for the title
  family_name <- viral_taxa %>% 
    filter(taxid == taxid_chosen) %>% 
    pull(name)

  species_count <- species_counts_display %>% select(classification) %>% distinct() %>% pull() %>% length()
  if (species_count <= 8) {
    pal <- brewer.pal(name = "Accent", n = species_count)
  } else {
    pal <- c(brewer.pal(name = "Paired", n = 12), brewer.pal(name = "Set3", n =12), brewer.pal(name = "Set1", n = 9))[1:species_count]
  }

  # Plot
  g_species <- g_comp_base + 
    geom_col(data=species_counts_display, position = "stack") +
    scale_y_continuous(name=paste0("% ", family_name, " reads"), limits=c(0,1.01), 
                       breaks = seq(0,1,0.2),
                       expand=c(0,0), labels = function(y) y*100) +
    scale_fill_manual(values = pal, name = "Viral species") +
    labs(title=paste0("Species composition of ", family_name, " reads")) +
    guides(fill=guide_legend(ncol=3)) +
    theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
  return(g_species)
}

plot_viral_family_histogram <- function(taxid_chosen) {
  # Filter data for the specified virus family
  family_data <- hv_family_counts %>%
    inner_join(libraries, by = c('sample')) %>% 
    filter(taxid == taxid_chosen)
  
  # Get family name for the title
  family_name <- viral_taxa %>% 
    filter(taxid == taxid_chosen) %>% 
    pull(name)
  
  # Create the histogram
  g_histogram <- ggplot(family_data, aes(x = library, y = n_reads_hv)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = scales::comma(n_reads_hv)), vjust = -0.5, size = 3) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_x_discrete(limits = libraries %>% pull(library)) +
    labs(
      title = paste("Distribution of", family_name, "Reads by Location"),
      x = "Location",
      y = "Number of reads"
    ) +
    scale_y_continuous(labels = scales::comma) +
    expand_limits(y = max(family_data$n_reads_hv) * 1.1)+
    theme_kit  # Expand y-axis to make room for labels
  return(g_histogram)
}


```

### Anelloviridae (Number of reads: 12,804)
quarto-executable-code-5450563D

```r
#| label: plot-anelloviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Anelloviridae
plot_viral_family_histogram(taxid_chosen=687329)
```
quarto-executable-code-5450563D

```r
#| label: plot-anelloviridae-composition
#| fig-width: 10
#| fig-height: 7.5
#| warning: false
# Anelloviridae
plot_viral_family_composition(taxid_chosen=687329, threshold_major_species = 0.01)
```

### Microviridae (Number of reads: 5,194)
quarto-executable-code-5450563D

```r
#| label: plot-microviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Microviridae
plot_viral_family_histogram(taxid_chosen=10841)
```
quarto-executable-code-5450563D

```r
#| label: plot-microviridae-composition
#| fig-width: 10
#| fig-height: 7.5
#| warning: false
# Microviridae
plot_viral_family_composition(taxid_chosen=10841, threshold_major_species = 0.01)
```

### Sedreoviridae (Number of reads: 278)
quarto-executable-code-5450563D

```r
#| label: plot-sedreoviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Sedreoviridae
plot_viral_family_histogram(taxid_chosen=2946186)
```
quarto-executable-code-5450563D

```r
#| label: plot-sedreoviridae-composition
#| fig-width: 10
#| fig-height: 7.5
#| warning: false
# Sedreoviridae
plot_viral_family_composition(taxid_chosen=2946186, threshold_major_species = 0.01)
```

### Parvoviridae (Number of reads: 266) 
quarto-executable-code-5450563D

```r
#| label: plot-parvoviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Parvoviridae
plot_viral_family_histogram(taxid_chosen=10780)
```

quarto-executable-code-5450563D

```r
#| label: plot-parvoviridae-composition
#| fig-width: 10
#| fig-height: 7.5
#| warning: false
# Parvoviridae
plot_viral_family_composition(taxid_chosen=10780, threshold_major_species = 0.01)
```

### Adenoviridae (Number of reads: 104) 
quarto-executable-code-5450563D

```r
#| label: plot-adenoviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Adenoviridae
plot_viral_family_histogram(taxid_chosen=10508)
```

None of these reads made up more than 1% of the total reads of any sample.

### Polyomaviridae (Number of reads: 95) 
quarto-executable-code-5450563D

```r
#| label: plot-polyomaviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Polyomaviridae
plot_viral_family_histogram(taxid_chosen=151341)
```

None of these reads made up more than 1% of the total reads of any sample.

### Papillomaviridae (Number of reads: 16) 
quarto-executable-code-5450563D

```r
#| label: plot-papillomaviridae-histogram
#| fig-width: 10
#| fig-height: 2.5

# Polyomaviridae
plot_viral_family_histogram(taxid_chosen=151340)
```

None of these reads made up more than 1% of the total reads of any sample.

## Relative abundance of pathogenic viruses of interest

quarto-executable-code-5450563D

```r
#| label: hv-composition-exclusion
#| include: false

total <- basic_stats_raw %>% select(sample, n_read_pairs)

hv_family_counts <- hv_reads_family %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  inner_join(total, by='sample') %>%
  group_by(sample) %>%
  mutate(ra_reads_hv = n_reads_hv/n_read_pairs)

hv_family_collate <- hv_family_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))

hv_family_collate <- hv_family_collate %>% 
rename(
'family' = 'name',
'family_n_reads_tot' = 'n_reads_tot',
'family_ra_reads_tot' = 'ra_reads_tot',
)

hv_genus_counts <- hv_reads_genus %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  inner_join(total, by='sample') %>%
  group_by(sample) %>%
  mutate(ra_reads_hv = n_reads_hv/n_read_pairs)

hv_genus_collate <- hv_genus_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))

hv_genus_collate <- hv_genus_collate %>% 
rename(
'genus' = 'name',
'genus_n_reads_tot' = 'n_reads_tot',
'genus_ra_reads_tot' = 'ra_reads_tot',
)

hv_species_counts <- hv_reads_species %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  inner_join(total, by='sample') %>%
  group_by(sample) %>%
  mutate(ra_reads_hv = n_reads_hv/n_read_pairs)

hv_species_collate <- hv_species_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))

hv_species_collate <- hv_species_collate %>% 
rename(
'species' = 'name',
'species_n_reads_tot' = 'n_reads_tot',
'species_ra_reads_tot' = 'ra_reads_tot',
)

```

quarto-executable-code-5450563D

```r
#| label: hv-filter-exclusion
#| include: false

# Assuming the data is already loaded into tibbles:
#hv_family_collate
#hv_genus_collate
#hv_species_collate
#viral_taxa

# Function to find parent taxa recursively
find_parent_taxa <- function(taxid, target_ranks, taxa_data) {
 current_taxid <- taxid
  result <- setNames(vector("list", length(target_ranks)), target_ranks)
  while (TRUE) {
    row <- taxa_data %>% filter(taxid == current_taxid) %>% slice(1)
    if (nrow(row) == 0) break  # If no matching taxid is found, break the loop
    if (row$rank %in% target_ranks) {
      result[[row$rank]] <- list(taxid = row$taxid, name = row$name)
      if (all(!sapply(result, is.null))) {
        break  # If all target ranks are found, break the loop
      }
    }
    if (row$taxid == row$parent_taxid) {
      break  # Avoid infinite loop at root
    }
    current_taxid <- row$parent_taxid
  }
  # Replace NULL with NA for any unfound ranks
  result <- lapply(result, function(x) if (is.null(x)) list(taxid = NA, name = NA) else x)
  return(result)
}

# Find all relevant parent taxa for each species
result <- hv_species_collate %>%
  mutate(
    parent_info = map(taxid, ~find_parent_taxa(.x, c("species","subgenus", "genus", "subfamily", "family"), viral_taxa)),
    genus_taxid = map_dbl(parent_info, ~.x$genus$taxid),
    genus = map_chr(parent_info, ~.x$genus$name),
    family_taxid = map_dbl(parent_info, ~.x$family$taxid),
    family = map_chr(parent_info, ~.x$family$name)
  ) %>%
  select(-parent_info)

# Join with genus counts
result <- result %>%
  left_join(
    hv_genus_collate %>% select(taxid, genus_n_reads_tot, genus_ra_reads_tot),
    by = c("genus_taxid" = "taxid")
  )

# Join with family counts
result <- result %>%
  left_join(
    hv_family_collate %>% select(taxid, family_n_reads_tot, family_ra_reads_tot),
    by = c("family_taxid" = "taxid")
  )

# Clean up
result <- result %>%
  select(-ends_with("_taxid"))

#result
```

Each dot represents a sample, colored by viral family. The x-axis shows the relative abundance of human-infecting viruses, and the y-axis shows the species.
quarto-executable-code-5450563D

```r
#| label: plot-pathogenic-viruses
#| fig-width: 10
#| fig-height: 10
#| warning: false

species_family <- result %>% select(species, family) %>% rename('name' = 'species')

play <- hv_species_counts %>% 
  ungroup() %>%
  inner_join(libraries, by = 'sample') %>%
  inner_join(species_family, by = 'name') %>%
  mutate(name = ifelse(name == "Severe acute respiratory syndrome-related coronavirus", "SARS-related coronavirus", name)) %>%
  filter(family %in% c("Sedoreoviridae", "Parvoviridae", "Adenoviridae", "Papillomaviridae", "Polyomaviridae", "Coronaviridae", "Caliciviridae", "Picornaviridae")) 

adjusted_play <- play %>% 
  group_by(name) %>%
  mutate(virus_prevalence_num = n_distinct(sample)/n_distinct(libraries),
         total_reads_hv = sum(n_reads_hv)) %>%
  ungroup() %>%
  mutate(name = fct_reorder(name, virus_prevalence_num, .desc=TRUE)) %>% 
  select(name, ra_reads_hv, family, virus_prevalence_num, total_reads_hv, n_reads_hv)

pal <- c(brewer.pal(8, 'Dark2'),brewer.pal(8, 'Accent'))

#, labels = label_log(digits=2)

ra_dot <- ggplot(adjusted_play, aes(x = ra_reads_hv, y=name)) +
  geom_quasirandom(orientation = 'y', aes(color = family)) +
  scale_color_manual(values = pal) + 
    scale_x_log10(
    "Relative abundance human virus reads",
    labels=label_log(digits=3)
  ) +
  labs(y =  "",
       color = 'Viral family') + 
  guides(color = guide_legend(ncol=2)) +
  theme_light() + 
    theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 12),
    axis.ticks.y = element_blank(),axis.line = element_line(colour = "black"),
    axis.title.x = element_text(size = 14),    
    legend.text = element_text(size = 10),
    legend.title = element_text(size = 10, face="bold"),
    #legend.position = 'bottom',
    legend.position = c(1, 1),  # Move legend to top right
    legend.justification = c(1, 1),  # Align legend to top right
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank())
ra_dot
#  geom_text(aes(label = total_reads_hv, y = name), 
#            x = 1e-9, hjust = 0, vjust = 0.5, size = 3, 
#            check_overlap = TRUE)
```

I can then take all of the viruses that we found and look up what they're responsible for and whether they're dangerous.

| Virus Name                                            | Common Name                                   | Pathogenic Potential                                                            |
|-------------------------------------------------------|-----------------------------------------------|---------------------------------------------------------------------------------|
| Rotavirus A                                           | Rotavirus                                     | High, causes severe diarrhea in children                                        |
| Enterovirus C                                         | Includes poliovirus and some coxsackieviruses | High, can cause various diseases including polio                                |
| Severe acute respiratory syndrome-related coronavirus | SARS coronavirus                              | High, causes severe respiratory illness                                         |
| Sapporo virus                                         | Sapporo virus or Sapovirus                    | Moderate, causes gastroenteritis                                                |
| Human mastadenovirus F                                | Adenovirus F                                  | Moderate, can cause gastroenteritis                                             |
| Alphapolyomavirus quintihominis                       | Not widely known                              | Low, but some polyomaviruses can cause disease in immunocompromised individuals |
| Alphapapillomavirus 4                                 | Human papillomavirus (HPV) type 4             | Low to moderate, can cause warts                                                |

None of these viruses are too suprising to see other than the Severe acute respiratory syndrome-related coronavirus, which when BLASTED, comes up as SARS-CoV-2. We believe that this is due to contamination with the COVID-19 pandemic since these samples were taken pre-pandemic.

## Relative abundance assuming perfect human read removal
quarto-executable-code-5450563D

```r
#| label: perfect-hv-removal
#| include: false


human_reads <- comp_assigned %>% filter(classification %in% c("Human")) %>% select(sample, human_reads=n_reads)

all <- hv_family_counts %>% 
  inner_join(human_reads, by='sample') %>%
  group_by(sample) %>%
  summarize(sum_reads=sum(n_reads_hv), n_read_pairs=first(n_read_pairs) - first(human_reads)) %>%
  mutate(ra = sum_reads/n_read_pairs) %>% 
  pull(ra) %>% mean()

all
```

Assuming we're able to perfectly remove all human reads, the average relative abundance of known human infecting virus is $1.87 \times 10^{-4}$.

# Conclusion
There were some interesting takeways from this analysis:

1. Contamination with SARS-Cov-2 seems to be a common occurence and should be accounted for when interpretting these results.
2. Gasterointestinal diseases are picked up in plasma.
3. Sexually transmitted infections are picked up in plasma.
4. We pick up a decent amount of baceriophage (i.e. Microviridae).
5. A lot of the Microviridae and Anelloviridae are largely uncharacterized (denoted by the family name followed by the abbreviation "sp.").

# Appendix

## Human-infecting virus families, genera, and species

To get a good overview of families, genera, and species, we can look at a Sankey plot where the magnitude of relative abundance, averaged over all samples, is shown in parentheses.

quarto-executable-code-5450563D

```r
#| label: hv-sankey
#| fig-height: 22.5
#| fig-width: 10
#| warning: false

# Function to create links
create_links <- function(data) {
  family_to_genus <- data %>%
    filter(!is.na(genus)) %>%
    group_by(family, genus) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = genus)
  
  genus_to_species <- data %>%
    group_by(genus, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = genus, target = species)

  family_to_species <- data %>%
    filter(is.na(genus)) %>%
    group_by(family, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = species)

  bind_rows(family_to_genus, genus_to_species, family_to_species) %>%
    filter(!is.na(source))
}

# Function to create nodes
create_nodes <- function(links) {
  data.frame(
    name = c(links$source, links$target) %>% unique()
  )
}

# Function to prepare data for Sankey diagram
prepare_sankey_data <- function(links, nodes) {
  links$IDsource <- match(links$source, nodes$name) - 1
  links$IDtarget <- match(links$target, nodes$name) - 1
  list(links = links, nodes = nodes)
}

# Function to create Sankey plot
create_sankey_plot <- function(sankey_data) {
  sankeyNetwork(
    Links = sankey_data$links, 
    Nodes = sankey_data$nodes,
    Source = "IDsource", 
    Target = "IDtarget",
    Value = "value", 
    NodeID = "name",
    sinksRight = TRUE,
    nodeWidth = 25,
    fontSize = 14,
  )
}

save_sankey_as_png <- function(sankey_plot, width = 1000, height = 800) {
  # Save the plot as an HTML file
  saveWidget(sankey_plot, sprintf('%s/sankey.html',data_dir))
}

format_scientific <- function(x, digits=2) {
  sapply(x, function(val) {
    if (is.na(val) || abs(val) < 1e-15) {
      return("0")
    } else {
      exponent <- floor(log10(abs(val)))
      coef <- val / 10^exponent
      #return(sprintf("%.1f × 10^%d", round(coef, digits), exponent))
      # May or may not be smart, just keeping magnitude
      return(sprintf("10^%d", exponent))
    }
  })
}

data <- result %>% 
  mutate(across(c(genus_n_reads_tot, genus_ra_reads_tot), ~replace_na(., 0)),
         genus = ifelse(is.na(genus), "Unknown Genus", genus)) %>%
  mutate(
  species = paste0(species, sprintf(' (%s)', format_scientific(species_ra_reads_tot))),
  genus = paste0(genus, sprintf(' (%s)', format_scientific(genus_ra_reads_tot))),
  family = paste0(family, sprintf(' (%s)', format_scientific(family_ra_reads_tot)))
)
links <- as.data.frame(create_links(data))
nodes <- create_nodes(links)
sankey_data <- prepare_sankey_data(links, nodes)
sankey <- create_sankey_plot(sankey_data)

sankey
```