This one of the studies that we hope to discuss in our third blog post, which will cover the metagenomic analysis of whole blood/plasma. In this notebook, I analyze Mengyi 2023, a dataset from China with 201 pools of plasma, where each pool contains 160 samples (we don’t know whether 1 sample = 1 donation) from 7 different locations, for a total of 10,720 samples.

I’d like to thank Lenni for giving me feedback on the notebook, Will for providing me with a boilerplate rmarkdown file, and Simon for giving me feedback on my figures. This notebook uses the MGS Workflow v2.2.1 (note that this is old).

1 Raw data

1.1 About

This dataset from China has 201 pools of plasma, where each pool contains 160 samples from 7 different locations between 2012-2018, for a total of 10,720 samples. This paper did not discuss the number of individuals that contributed to the samples, but we’re going to attempt to contact the authors to get this information (this shouldn’t hold up any further analysis, but would be good to have this information). They did DNA-sequencing for each pool, with Illumina HiSeq 4500, producing 2x150 bp reads.

Code

# Import libraries and extract metadata from sample names
libraries <- read_csv(libraries_path, show_col_types = FALSE)
metadata <- read_tsv(sprintf('%s/metadata.tsv', data_dir), show_col_types = FALSE) %>%
  select('sample'=`Library`, `library`=`Sample title`, 'Geographic location') %>%
  mutate('location' = str_split(`Geographic location`, ":", simplify = TRUE)[, ncol(str_split(`Geographic location`, ":", simplify = TRUE))])
libraries <- libraries %>%
  left_join(metadata, by = 'sample') %>%
  select(sample, library=library.y, location) %>%
  mutate(library = factor(library))

libraries %>% group_by(location) %>% summarize(n_pools = n())

1.1.1 Sample + library preparation

The following excerpt from the paper describes the sample and library preparation process. It’s crucial to note that while blood samples were initially collected from volunteers, these samples were converted to plasma through ultracentrifugation prior to sequencing:

From January 1, 2012, to December 31, 2018, a total of 10,720 blood samples of 10 ml each were randomly selected from voluntary blood donors in 7 regions. The blood samples taken from various places were mixed in units of 160 (each 100 μl) for ultracentrifugation (32,000 rpm, 120 min, maximum centrifugal radius of 91.9 mm). Afterward, we rinsed and resuspended the precipitate with 500 μl PBS.

The pooled suspensions were subjected to extraction of total DNA using QIAamp® DNA Blood mini Kit (QIAGEN Cat. NO.160019269, Frankfurt, Germany), DNA concentration was measured by Equalbit® 1 × dsDNA HS Assay Kit (Vazyme Cat. NO. 7E302K9, Nanjing, China).

The metagenomic library was constructed using KAPA HyperPlus Kit (KAPA Cat. NO. 0000097583, Boston, USA) with dual-indexed Adapters (KAPA Cat. NO. 0000093370, Boston, USA), the DNA was fragmented to 250 bp approximately by the enzyme at 37 °C for 20 min, after end repair and A-tailing, adapter ligation, post-ligation cleanup, library amplification, and post-amplification cleanup, the library was constructed.

Agilent 2100 Bioanalyzer (Agilent Technologies, Beijing, China) was used for library quality control, and qualified DNA library was sent to the Novogene company to sequence in HiSeq 4500.

1.2 Quality control metrics

These 201 samples contained 3.4B read pairs. The samples had 8.2M - 30.1M (mean 17.1M) read pairs each. The number of read pairs and total bases look relatively evenly distributed across the locations. The duplication rate is also quite low, around ~10%. Adapter content is low. As we’d expect, we see higher quality Q scores near the beginning of the read and a gradual decline towards the end of the read, however all positions seem to have a Q score of 35 which means that our reads are ~99.97% accurate. When looking at the Q score over all sequences, we can see a sharp peak after 35, which closely follows the previous plot, indicating high read quality.

Code

# Prepare data
basic_stats_raw_metrics <- basic_stats_raw %>%
  select(library,
         location,
         `# Read pairs` = n_read_pairs,
         `Total base pairs\n(approx)` = n_bases_approx,
         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
  pivot_longer(-(library:location), names_to = "metric", values_to = "value") %>%
  mutate(metric = fct_inorder(metric)) 

# Set up plot templates

g_basic <- ggplot(basic_stats_raw_metrics, aes(x = library, y = value, fill=location)) +
  geom_col(position = "dodge") +
  scale_x_discrete() +
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(expand = c(0, 0)) +
  expand_limits(y = c(0, 100)) +
  facet_grid(metric ~ ., scales = "free", space = "free_x", switch = "y") +
  theme_kit + theme(
    axis.title.y = element_blank(),
    strip.text.y = element_text(face = "plain"),
    axis.text.x = element_blank()
  )
g_basic

Code

# Set up plotting templates
g_qual_raw <- ggplot(mapping=aes(color = location, linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  scale_color_brewer(palette = "Dark2") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters_raw <- g_qual_raw + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
  scale_y_continuous(name="% Adapters", limits=c(0,5),
                     expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0)) +
  facet_grid(.~adapter)
g_adapters_raw

Code

# Visualize quality
g_quality_base_raw <- g_qual_raw +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,500,20), expand=c(0,0))
g_quality_base_raw

Code

g_quality_seq_raw <- g_qual_raw +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0))
g_quality_seq_raw

2 Preprocessing

2.1 Summary

The average fraction of reads at each stage in the preprocessing pipeline is shown in the following table. Reads lost during trimming and filtering approximately matches what we’d expect based on the raw adapter content. Deduplication loses us about 10% of reads which matched the amount of estimated duplicated reads by QC. Low ribodepletion is observed which makes sense because they only sequenced DNA.

Code

# TODO: Group by pool size as well
# Count read losses
n_reads_rel <- basic_stats %>% 
  select(sample,location, stage, percent_duplicates, n_read_pairs) %>%
  group_by(sample) %>% 
  arrange(sample, stage) %>%
  mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),
         p_reads_lost = 1 - p_reads_retained,
         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
         p_reads_lost_abs = 1-p_reads_retained_abs,
         p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))
n_reads_rel_display <- n_reads_rel %>% 
  rename(Stage=stage) %>% 
  group_by(Stage) %>% 
  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
  filter(Stage != "raw_concat") %>%
  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
n_reads_rel_display

2.2 Quality control metrics

Trimming and cleaning gets rid of the Illumina adapter as well as polyg, and we see a decrease in polya. Q score remain the same during read cleaning when looking at the positions, with the end of the read actually improving in score. Q scores across all sequences look pretty much the same throughout cleaning.

Code

g_qual <- ggplot(mapping=aes(color = location, linetype=read_pair, group=interaction(sample,read_pair))) + 
  scale_linetype_discrete(name = "Read Pair") +
  scale_color_brewer(palette = "Dark2") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters <- g_qual + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
  scale_y_continuous(name="% Adapters", limits=c(0,5),
                     expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~adapter)
g_adapters

Code

# Visualize quality
g_quality_base <- g_qual +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_base

Code

g_quality_seq <- g_qual +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(stage~., scales = "free_y")
g_quality_seq

Number of read pairs look reasonable, same with a large portion being lost during deduplication. At the end of deduplication it seems that only 5% of reads are duplicates which doesn’t seem too bad. Mean read length stays around 150 bp which is pretty good.

Code

g_stage_trace <- ggplot(basic_stats, aes(x=stage, group=sample, color = location)) +
  theme_kit +
  scale_color_brewer(palette = "Dark2")
# Plot reads over preprocessing
g_reads_stages <- g_stage_trace +
  geom_line(aes(y=n_read_pairs)) +
  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
g_reads_stages

Code

# Plot relative read losses during preprocessing
g_reads_rel <- ggplot(n_reads_rel, 
                      aes(x=stage, group=sample, color = location)) +
  geom_line(aes(y=p_reads_lost_abs_marginal)) +
  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
                     labels = function(x) x*100) +
  scale_color_brewer(palette = "Dark2") +
  theme_kit
g_reads_rel

Code

stage_dup <- basic_stats %>% group_by(stage) %>% 
  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), .groups = "drop")

g_dup_stages <- g_stage_trace +
  geom_line(aes(y=percent_duplicates)) +
  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
g_dup_stages

Code

g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +
  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
g_readlen_stages

3 Taxonomic composition

3.1 High-level composition

To assess the high-level composition of the reads, I ran the files through Kraken2 and summarized the results with Bracken.

The groups listed below were created by Will:

Filtered (removed during cleaning)
Duplicate (removed during deduplication)
Ribosomal (removed during ribodepletion)
Unassigned (non-ribosomal reads that were not assigned to any taxon by Kraken/Bracken)
Bacterial (non-ribosomal reads assigned to the Bacteria domain by Kraken/Bracken)
Archaeal (non-ribosomal reads assigned to the Archaea domain by Kraken/Bracken)
Viral (non-ribosomal reads assigned to the Viruses domain by Kraken/Bracken)
Human (non-ribosomal reads assigned to the Eukarya domain by Kraken/Bracken)

Human reads make up a large proportion of all reads at 82% total composition, whereas viruses only account for 0.01% of all reads.

Code

classifications <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned",
                     "Bacterial", "Archaeal", "Viral", "Human")

# Import composition data
tax_final_dir <- file.path(results_dir, "taxonomy_final")
comp <- read_tsv(sprintf("%s/taxonomic_composition.tsv.gz", tax_final_dir)) %>% 
  inner_join(libraries, by='sample')

comp_minor <- comp %>% 
  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))
comp_assigned <- comp %>%
  filter(! classification %in% c("Filtered", "Duplicate", 
                                 "Ribosomal", "Unassigned")) %>%
  group_by(sample) %>%
  mutate(p_reads = n_reads/sum(n_reads))
comp_assigned_minor <- comp_assigned %>% 
  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))

# Summarize composition
read_comp_summ <- comp %>% 
  group_by(classification) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
  mutate(n_reads = replace_na(n_reads,0),
         p_reads = n_reads/sum(n_reads),
         pc_reads = p_reads*100) %>%
  mutate(classification = factor(classification, levels=classifications)) %>%
  select(classification, n_reads, pc_reads) %>%
  rename(`# of reads` = n_reads, "% of reads" = pc_reads) %>%
  mutate(`% of reads` = sprintf("%.2f", `% of reads`))
read_comp_summ

Code

# Prepare plotting templates
g_comp_base <- ggplot(mapping=aes(x=library, y=p_reads, fill=classification)) +
  scale_x_discrete(name="Plasma pool") +
  theme_kit + 
  theme(plot.title = element_text(hjust=0, face="plain", size=rel(1.5)),
        axis.text.x = element_blank())

scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
                                   expand = c(0,0), labels = function(y) y*100)
geom_comp <- purrr::partial(geom_col, position = "stack", width = 1)

# Define a color palette for the classification
classification_colors <- brewer.pal(8, "Accent")
names(classification_colors) <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned", "Bacterial", "Archaeal", "Viral", "Human")
scale_fill_classification <- function() {
  scale_fill_manual(values = classification_colors, name = "Classification")
}

# Plot overall composition
g_comp <- g_comp_base + geom_comp(data = comp) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (all reads, all groups)")
g_comp

Code

# Create a faceted plot of composition by location
#g_comp_by_location <- g_comp_base +
#  geom_comp(data = comp) +
#  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
#  scale_fill_classification() +
#  facet_wrap(~ location, scales = "free_x", ncol = 2) +
#  theme(axis.text.x = element_blank(),  # Remove x-axis labels
#        axis.ticks.x = element_blank()) +  # Remove x-axis ticks
#  ggtitle("Read composition (all reads, all groups)")

# Display the plot
#g_comp_by_location

# Repeat for classified reads only
g_comp_assigned <- g_comp_base + 
  geom_comp(data = comp_assigned) +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_classification() + 
  ggtitle("Read composition (assigned reads, all groups)")
g_comp_assigned

Code

# Plot composition of minor components
#g_comp_minor <- g_comp_base + 
#  geom_comp(data=comp_minor) +
#  scale_y_pc_reads() +
#  scale_fill_classification() + 
#  ggtitle("Read composition (all reads, minor groups)")
#g_comp_minor
#g_comp_assigned_minor <- g_comp_base + 
#  geom_comp(data=comp_assigned_minor) +
#  scale_y_pc_reads() +
#  scale_fill_classification() + 
#  ggtitle("Read composition (assigned reads, minor groups)")
#g_comp_assigned_minor

3.2 Total viral content

Total viral fraction average $1.03 \times 10^{-4}$ across samples. As a fraction of assigned (rather than total) reads, this jumped to $1.28 \times 10^{-4}$. The increase in viral fraction from assigned reads to total reads was constant across all samples, however, we can see that the viral fractions differed by location. Note that one of the pools from “Heilongjiang,Mudanjiang” had no viral reads.

Code

#p_reads_viral_all <- comp %>% filter(classification == "Viral") %>%
#  mutate(read_group = "All reads")
p_reads_viral_assigned <- comp_assigned %>% filter(classification == "Viral") %>%
  mutate(read_group = "Classified reads")
#p_reads_viral <- bind_rows(p_reads_viral_all, p_reads_viral_assigned)

location_viral <- comp_assigned %>% filter(classification == "Viral" ) %>% group_by(location) 

# Plot
# g_viral <- ggplot(p_reads_viral, aes(x=p_reads, y=library, color=location, shape=read_group)) +
#   geom_point() +
#   scale_y_discrete(name="Plasma pool") +
#   scale_x_log10(name="Viral read fraction") +
#   scale_color_brewer(palette = "Dark2") +
#   #facet_grid(.~read_group, scales = "free") +
#   guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2),
#          linetype=guide_legend(nrow=2)) +
#   theme_kit

#ggplot(location_viral, aes(x=p_reads, y=location, color = location)) +
#  geom_quasirandom(size=2) +
#  scale_x_log10(name="Viral read fraction", labels = label_log(digits=2)) +
#  scale_y_discrete(name="Location") +
#  scale_color_brewer(palette = "Dark2") +
#  theme_light() + 
#    theme(
#    axis.text.y = element_text(size = 8),
#    axis.text.x = element_text(size = 14),
#    axis.ticks.y = element_blank(),
#    axis.line = element_line(colour = "black"),
#    axis.title.x = element_text(size = 15),    
#    axis.title.y = element_text(size = 15),  
#    legend.text = element_text(size = 13),
#    legend.title = element_text(size = 16),
#    legend.position = c(1, 1),  # Move legend to top right
##    legend.justification = c(1, 1),  # Align legend to top right
#    panel.grid.minor = element_blank(),
#    panel.border = element_blank(),
#    panel.background = element_blank())

# Calculate the number of samples per location
location_counts <- location_viral %>%
  group_by(location) %>%
  summarise(count = n()) %>%
  mutate(label = paste0(location, " (n=", count, ")"))

# Create a named vector for easy mapping
location_labels <- setNames(location_counts$label, location_counts$location)

g_viral <- ggplot(location_viral, aes(x=location, y=p_reads, fill=location)) +
  geom_violin(trim=FALSE) +
  geom_boxplot(width=0.1, fill="white", color="black", outlier.shape=NA) +
  geom_jitter(width=0.1, size=0.5, alpha=0.5) +
  scale_y_log10(name="Viral read fraction", labels = label_log(digits=2)) +
  scale_x_discrete(name="Location", labels = location_labels) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title="Total viral fraction based on assigned reads over all locations") +
  theme_light() + 
  theme(
    axis.text.y = element_text(size = 12),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
    axis.title.x = element_text(size = 15),    
    axis.title.y = element_text(size = 15),  
    legend.position = "none",
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()
  ) +
  theme_kit +
  coord_flip()  # This flips the coordinates to make the plot horizontal

g_viral

3.3 Taxonomic composition of viruses

The threshold for the label “other” are the set of families that make up less than 20% composition in all samples (the only reason I did this was because there were too many matches). We can see that other makes up a big portion of the viral composition. Outside of the “other” label, Hepadnaviridae makes up the largest portion of the viral composition.

Code

major_threshold <- 0.20

# Identify major viral families
viral_families_major_tab <- viral_families %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
  filter(p_reads_viral_max >= major_threshold)
viral_families_major_list <- viral_families_major_tab %>% pull(name)
viral_families_major <- viral_families %>% 
  filter(name %in% viral_families_major_list) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_minor <- viral_families_major %>% 
  group_by(sample) %>%
  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
  select(name, taxid, sample, p_reads_viral)
viral_families_display <- viral_families_major %>% 
  bind_rows(viral_families_minor) %>%
  arrange(desc(p_reads_viral)) %>% 
  mutate(name = factor(name, levels=c(viral_families_major_list, "Other"))) %>%
  rename(p_reads = p_reads_viral, classification=name) %>%
  inner_join(libraries, by='sample')

# Create a custom color palette with up to 20 colors
custom_palette <- c(
  brewer.pal(12, "Paired"),
  brewer.pal(12, "Set3")
)

n_classifications <- length(unique(viral_families_display$classification)) - 1

custom_palette_with_black <- c(
  custom_palette[1:n_classifications],
  "black"
)

# Create a new color palette with black for "Other"
custom_palette_with_black <- c(
  custom_palette[1:n_classifications],
  "black"
)

# Plot
g_families <- g_comp_base + 
  geom_comp(data=viral_families_display) +
  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values = custom_palette_with_black) +
  facet_wrap(~ location, scales = "free_x", ncol = 2) 
g_families

4 Human-infecting virus reads

4.1 Overall relative abundance

I calculated the relative abundance of human-infecting viruses in two ways:

First, as the total number of deduplicated human-virus reads in each sample, divided by the number of raw reads (“All reads”). This results in a viral fraction of $9.84 \times 10^{-5}$ across all samples.
Second, as a fraction of preprocessed (cleaned, deduplicated, computationally ribodepleted) reads (“Preprocessed reads”). This results in a viral fraction of $1.18 \times 10^{-4}$ across all samples.

Note that 10 of the pools had no viral reads: 5 from “Heilongjiang, Mudanjiang”, 4 from “Jiangsu, Nanjing”, and 1 from “Xinjiang, Urmqi”.

Code

# Get raw read counts
read_counts_raw <- filter(basic_stats_raw) %>%
  select(sample, n_reads_raw = n_read_pairs)
read_counts_preproc <- basic_stats %>% filter(stage == "ribo_initial") %>%
  select(sample, n_reads_preproc = n_read_pairs)

# Get HV read counts
read_counts_hv <- mrg_hv %>% filter(hv_status) %>% 
  group_by(sample) %>% 
  count(name="n_reads_hv")
read_counts <- read_counts_raw %>%
  left_join(read_counts_hv, by=c("sample")) %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>%
  left_join(read_counts_preproc, by=c("sample")) %>%
  inner_join(libraries, by=c("sample")) %>%
  select(sample, n_reads_raw, n_reads_preproc, n_reads_hv) %>%
  mutate(n_samples = 1,
         p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc)
read_counts_long <- read_counts %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads"))

# Combine for display
read_counts_agg <- read_counts %>%
  mutate(p_reads_total = n_reads_hv/n_reads_raw,
         p_reads_preproc = n_reads_hv/n_reads_preproc) %>%
  inner_join(libraries, by=c("sample"))
read_counts_agg_long <- read_counts_agg %>%
  pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>%
  mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads")) 

# Visualize
#g_read_counts <- ggplot(read_counts_agg_long, aes(x=library, y=p_reads)) +
#  geom_point() +
#  scale_y_log10(name = "Unique human-viral read fraction") +
#  facet_grid(.~read_group, scales = "free") +
#  theme_kit
#g_read_counts

#g_viral <- ggplot(read_counts_agg_long, aes(x=p_reads, y=library, color=location, shape=read_group)) +
#  geom_point() +
#  scale_y_discrete(name="Plasma pool") +
#  scale_x_log10(name="Viral read fraction") +
#  scale_color_brewer(palette = "Dark2") +
#  facet_grid(.~read_group, scales = "free") +
#  guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2),
#         linetype=guide_legend(nrow=2)) +
#  theme_kit
#g_viral

hv_location <- read_counts_agg_long %>% filter(read_group == "Preprocessed reads" ) %>% group_by(location) 

hv_location_counts <- hv_location %>%
  group_by(location) %>%
  summarise(count = n()) %>%
  mutate(label = paste0(location, " (n=", count, ")"))

  # Create a named vector for easy mapping
location_labels <- setNames(hv_location_counts$label, hv_location_counts$location)

# Create a named vector for easy mapping
# Plot
# g_viral <- ggplot(p_reads_viral, aes(x=p_reads, y=library, color=location, shape=read_group)) +
#   geom_point() +
#   scale_y_discrete(name="Plasma pool") +
#   scale_x_log10(name="Viral read fraction") +
#   scale_color_brewer(palette = "Dark2") +
#   #facet_grid(.~read_group, scales = "free") +
#   guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2),
#          linetype=guide_legend(nrow=2)) +
#   theme_kit

#ggplot(location_viral, aes(x=p_reads, y=location, color = location)) +
#  geom_quasirandom(size=2) +
#  scale_x_log10(name="Viral read fraction", labels = label_log(digits=2)) +
#  scale_y_discrete(name="Location") +
#  scale_color_brewer(palette = "Dark2") +
#  theme_light() + 
#    theme(
#    axis.text.y = element_text(size = 8),
#    axis.text.x = element_text(size = 14),
#    axis.ticks.y = element_blank(),
#    axis.line = element_line(colour = "black"),
#    axis.title.x = element_text(size = 15),    
#    axis.title.y = element_text(size = 15),  
#    legend.text = element_text(size = 13),
#    legend.title = element_text(size = 16),
#    legend.position = c(1, 1),  # Move legend to top right
##    legend.justification = c(1, 1),  # Align legend to top right
#    panel.grid.minor = element_blank(),
#    panel.border = element_blank(),
#    panel.background = element_blank())

g_viral <- ggplot(hv_location, aes(x=location, y=p_reads, fill=location)) +
  geom_violin(trim=FALSE) +
  geom_boxplot(width=0.1, fill="white", color="black", outlier.shape=NA) +
  geom_jitter(width=0.1, size=0.5, alpha=0.5) +
  scale_y_log10(name="Viral read fraction", labels = label_log(digits=2)) +
  scale_x_discrete(name="Location", labels = location_labels) +
  scale_fill_brewer(palette = "Dark2") +
  theme_light() + 
  theme(
    axis.text.y = element_text(size = 12),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
    axis.title.x = element_text(size = 15),    
    axis.title.y = element_text(size = 15),  
    legend.position = "none",
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()
  ) +
  theme_kit +
  coord_flip()  # This flips the coordinates to make the plot horizontal

g_viral

4.2 Overall taxonomy and composition

The two dominant viruses we see are Anellovirdae and Hepadnaviridae, with Parvoviridae . The threshold for the label “other” are the set of families that make up less than 5% composition in all samples.

Code

threshold_major_family <- 0.05

# Count reads for each human-viral family
hv_family_counts <- hv_reads_family %>% 
  group_by(sample, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
hv_family_counts_major <- hv_family_counts %>%
  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
  group_by(sample, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_family_major_tab$name, "Other")))
hv_family_counts_display <- hv_family_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display) %>%
  inner_join(libraries, by = 'sample')

# Plot
g_hv_family <- g_comp_base + 
  geom_col(data=hv_family_counts_display, position = "stack", width=1) +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_brewer(palette = "Set1", name = "Viral family") +

  labs(title="Family composition of human-viral reads") +
  guides(fill=guide_legend(ncol=4)) +
  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain")) +  
  facet_wrap(~ location, scales = "free_x", ncol = 2) 
g_hv_family

Code

# Get most prominent families for text
hv_family_collate <- hv_family_counts %>%
  group_by(name, taxid) %>% 
  summarize(n_reads_tot = sum(n_reads_hv),
            p_reads_max = max(p_reads_hv), .groups="drop") %>% 
  arrange(desc(n_reads_tot))
display_hv_family_collate <- hv_family_collate %>%
  select(name, n_reads_tot) %>%
  rename(`Family` = name, `Total number of reads` = n_reads_tot)
display_hv_family_collate

4.3 Analyzing specific families

We now investigate the composition of specific families that had more than 5 viral reads. In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:

4.3.1 Hepadnaviridae (Number of reads: 339,427)

Code

# Hepadnaviridae
plot_viral_family_histogram(taxid_chosen=10404)

Code

# Hepadnaviridae
plot_viral_family_composition(taxid_chosen=10404, threshold_major_species = 0.1)

4.3.2 Parvoviridae (Number of reads: 17,401)

Code

# Parvoviridae
plot_viral_family_histogram(taxid_chosen=10780)

Code

# Parvoviridae
plot_viral_family_composition(taxid_chosen=10780, threshold_major_species = 0.1)

4.3.3 Anelloviridae (Number of reads: 3001)

Code

# Anelloviridae
plot_viral_family_histogram(taxid_chosen=687329)

Code

# Anelloviridae
plot_viral_family_composition(taxid_chosen=687329, threshold_major_species = 0.1)

4.3.4 Orthoherpesviridae (Number of reads: 298)

Code

# Orthoherpesviridae
plot_viral_family_histogram(taxid_chosen=3044472)

Code

# Orthoherpesviridae
plot_viral_family_composition(taxid_chosen=3044472, threshold_major_species = 0.1)

4.3.5 Retroviridae (Number of reads: 115)

Code

# Retroviridae
plot_viral_family_histogram(taxid_chosen=11632)

Code

# Retroviridae
plot_viral_family_composition(taxid_chosen=11632, threshold_major_species = 0.1)

4.3.6 Microviridae (Number of reads: 61)

Code

# Microviridae
plot_viral_family_histogram(taxid_chosen=10841)

Code

# Microviridae
plot_viral_family_composition(taxid_chosen=10841, threshold_major_species = 0.1)

4.3.7 Adenoviridae (Number of reads: 28)

Code

# Adenoviridae
plot_viral_family_histogram(taxid_chosen=10508)

Code

# Adenoviridae
plot_viral_family_composition(taxid_chosen=10508, threshold_major_species = 0.1)

4.4 Relative abundance of pathogenic viruses of interest

Each dot represents a sample, colored by viral family. The x-axis shows the relative abundance of human-infecting viruses, and the y-axis shows the species.

Code

species_family <- result %>% select(species, family) %>% rename('name' = 'species')

play <- hv_species_counts %>% 
  ungroup() %>%
  inner_join(libraries, by = 'sample') %>%
  inner_join(species_family, by = 'name') %>%
  mutate(name = ifelse(name == "Severe acute respiratory syndrome-related coronavirus", "SARS-related coronavirus", name)) %>%
  filter(!(family %in% c("Anelloviridae", NA, "Microviridae", "Rhabdoviridae"))) 

adjusted_play <- play %>% 
  group_by(name) %>%
  mutate(virus_prevalence_num = n_distinct(sample)/n_distinct(libraries),
         total_reads_hv = sum(n_reads_hv)) %>%
  ungroup() %>%
  mutate(name = fct_reorder(name, virus_prevalence_num, .desc=TRUE)) %>% 
  select(name, ra_reads_hv, family, location, virus_prevalence_num, total_reads_hv, n_reads_hv)

pal <- c(brewer.pal(8, 'Dark2'),brewer.pal(8, 'Accent'))

#, labels = label_log(digits=2)

ra_dot <- ggplot(adjusted_play, aes(x = ra_reads_hv, y=name)) +
  geom_quasirandom(orientation = 'y', aes(color = family)) +
  scale_color_manual(values = pal) + 
    scale_x_log10(
    "Relative abundance human virus reads",
    labels=label_log(digits=3)
  ) +
  labs(y =  "",
       color = 'Viral family') + 
  guides(color = guide_legend(ncol=2)) +
  theme_light() + 
    theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 12),
    axis.ticks.y = element_blank(),axis.line = element_line(colour = "black"),
    axis.title.x = element_text(size = 14),    
    legend.text = element_text(size = 10),
    legend.title = element_text(size = 10, face="bold"),
    legend.position = 'bottom',
    #legend.position = c(1, 1),  # Move legend to top right
    #legend.justification = c(1, 1),  # Align legend to top right
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank())
ra_dot

Code

#  geom_text(aes(label = total_reads_hv, y = name), 
#            x = 1e-9, hjust = 0, vjust = 0.5, size = 3, 
#            check_overlap = TRUE)

We can ask Claude to analyze the pathogenicity of these viruses along with their popular names.

Scientific Name	Popular/Well-known Name	Pathogenicity to Humans
Human immunodeficiency virus 1	HIV-1	Very High
SARS-related coronavirus	SARS-CoV	High
Hepatitis B virus	HBV	High
Hepacivirus hominis	Hepatitis C virus (HCV)	High
Simplexvirus humanalpha1	Herpes simplex virus 1 (HSV-1)	Moderate to High
Cytomegalovirus humanbeta5	Human cytomegalovirus (HCMV)	Moderate to High
Lymphocryptovirus humangamma4	Epstein-Barr virus (EBV)	Moderate
Human mastadenovirus C	Adenovirus C	Moderate
Human mastadenovirus F	Adenovirus F	Moderate
Roseolovirus humanbeta6a	Human herpesvirus 6A (HHV-6A)	Low to Moderate
Roseolovirus humanbeta6b	Human herpesvirus 6B (HHV-6B)	Low to Moderate
Roseolovirus humanbeta7	Human herpesvirus 7 (HHV-7)	Low to Moderate
Rhadinovirus humangamma8	Kaposi’s sarcoma-associated herpesvirus (KSHV)	Low to Moderate
Molluscum contagiosum virus	MCV	Low
Erythroparvovirus primate1	Parvovirus B19	Low
Human erythrovirus V9	Erythrovirus V9	Low
Alphapapillomavirus 4	Human papillomavirus 4 (HPV-4)	Low
Betapolyomavirus hominis	Human polyomavirus	Low
Dependoparvovirus primate1	Adeno-associated virus (AAV)	Very Low
Murine leukemia virus	MLV	Very Low (not a human pathogen)
Murine leukemia-related retroviruses	MLV-related viruses	Very Low (not typical human pathogens)

We get a variety of pathogenic viruses, none of which are a suprise to see in plasma (other than SARS-CoV which we determined was contamination on twist).

4.5 Relative abundance assuming perfect human read removal

Assuming we’re able to perfectly remove all human reads, the average relative abundance of known human infecting virus is $5.37 \times 10^{-4}$.

5 Conclusion

There were some interesting takeways from this analysis:

Human read fraction remains quite high in plasma if special sample preparation/library preparation is not taken (see here).
There is a lot of Anelloviridae in blood, including that which has not been classified by species.
Pathogenic viruses such as HIV and HBV, and even latent viruses like Herpes are picked up by plasma MGS
Respiratory viruses are also picked up by plasma MGS (SARS), but not as much as other viruses.

We do note that we’ve analyzed other datasets and it seems that having a large catchment is crucial, as we’d expect, for picking up these more rare, pathogenic viruses. Using this data, we can also look at China’s HIV/HBV reported incidence to do a P2RA/swab sampling type analysis. We plan to do more analysis on this data, but wanted to share some preliminary results.

6 Appendix

6.1 Human-infecting virus families, genera, and species

To get a good overview of families, genera, and species, we can look at a Sankey plot where the magnitude of relative abundance, averaged over all samples, is shown in parentheses.

Code

# Function to create links
create_links <- function(data) {
  family_to_genus <- data %>%
    filter(!is.na(genus)) %>%
    group_by(family, genus) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = genus)
  
  genus_to_species <- data %>%
    group_by(genus, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = genus, target = species)

  family_to_species <- data %>%
    filter(is.na(genus)) %>%
    group_by(family, species) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(source = family, target = species)

  bind_rows(family_to_genus, genus_to_species, family_to_species) %>%
    filter(!is.na(source))
}

# Function to create nodes
create_nodes <- function(links) {
  data.frame(
    name = c(links$source, links$target) %>% unique()
  )
}

# Function to prepare data for Sankey diagram
prepare_sankey_data <- function(links, nodes) {
  links$IDsource <- match(links$source, nodes$name) - 1
  links$IDtarget <- match(links$target, nodes$name) - 1
  list(links = links, nodes = nodes)
}

# Function to create Sankey plot
create_sankey_plot <- function(sankey_data) {
  sankeyNetwork(
    Links = sankey_data$links, 
    Nodes = sankey_data$nodes,
    Source = "IDsource", 
    Target = "IDtarget",
    Value = "value", 
    NodeID = "name",
    sinksRight = TRUE,
    nodeWidth = 25,
    fontSize = 14,
  )
}

save_sankey_as_png <- function(sankey_plot, width = 1000, height = 800) {
  # Save the plot as an HTML file
  saveWidget(sankey_plot, sprintf('%s/sankey.html',data_dir))
}

format_scientific <- function(x, digits=2) {
  sapply(x, function(val) {
    if (is.na(val) || abs(val) < 1e-15) {
      return("0")
    } else {
      exponent <- floor(log10(abs(val)))
      coef <- val / 10^exponent
      #return(sprintf("%.1f × 10^%d", round(coef, digits), exponent))
      # May or may not be smart, just keeping magnitude
      return(sprintf("10^%d", exponent))
    }
  })
}

data <- result %>% 
  mutate(across(c(genus_n_reads_tot, genus_ra_reads_tot), ~replace_na(., 0)),
         genus = ifelse(is.na(genus), "Unknown Genus", genus)) %>%
  mutate(
  species = paste0(species, sprintf(' (%s)', format_scientific(species_ra_reads_tot))),
  genus = paste0(genus, sprintf(' (%s)', format_scientific(genus_ra_reads_tot))),
  family = paste0(family, sprintf(' (%s)', format_scientific(family_ra_reads_tot)))
)
links <- as.data.frame(create_links(data))
nodes <- create_nodes(links)
sankey_data <- prepare_sankey_data(links, nodes)
sankey <- create_sankey_plot(sankey_data)

sankey

--- title: "Workflow of Mengyi et al. (2023)" subtitle: "Pooled plasma from China (DNA)" author: "Harmon Bhasin" date: 2024-07-23 format: html: toc: true # table of contents toc-title: "Table of contents" # table of contents title number-sections: true # number sections number-depth: 3 # number depth to show in table of contents toc-location: right # table of contents location page-layout: full # full page layout code-fold: true # Keep option to fold code (i.e. show or hide it; default: hide) code-tools: true # Code menu in the header of your document that provides various tools for readers to interact with the source code code-link: true # Enables hyper-linking of functions within code blocks to their online documentation df-print: paged # print data frame fig-format: svg other-links: - text: Paper href: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10372899/ - text: Data href: https://bigd.big.ac.cn/gsa/browse/CRA006191 code-links: - text: Code for this post icon: file-code href: https://github.com/naobservatory/harmons-public-notebook/blob/main/notebooks/2024-07-23-mengyi.qmd editor: visual: true # visual editor render-on-save: true # render on save comments: hypothesis: true # hypothesis execute: freeze: auto cache: true title-block-banner: "#de2d26" --- ```{r} #| label: load-packages #| include: false library(tidyverse) library(grDevices) library(svglite) library(cowplot) library(patchwork) library(fastqcr) library(RColorBrewer) library(networkD3) library(readxl) library(htmlwidgets) library(ggbeeswarm) library(latex2exp) library(forcats) library(scales) library(ggpubr) library(ggbeeswarm) source("/Users/harmonbhasin/work/securebio/sampling-strategies/scripts/aux_plot-theme.R") theme_base <- theme_base + theme(aspect.ratio = NULL) theme_kit <- theme_base + theme( axis.text.x = element_text(hjust = 1, angle = 45), axis.title.x = element_blank(), ) tnl <- theme(legend.position = "none") ``` ```{r} #| warning: false #| label: set-up-paths #| include: false # Data input paths data_dir <- "/Users/harmonbhasin/work/securebio/nao-harmon/mengyi2023/analysis/" input_dir <- file.path(data_dir, "input") results_dir <- file.path(data_dir, "results") qc_dir <- file.path(results_dir, "qc") hv_dir <- file.path(results_dir, "hv") libraries_path <- file.path(input_dir, "libraries.csv") basic_stats_path <- file.path(qc_dir, "qc_basic_stats.tsv.gz") adapter_stats_path <- file.path(qc_dir, "qc_adapter_stats.tsv.gz") quality_base_stats_path <- file.path(qc_dir, "qc_quality_base_stats.tsv.gz") quality_seq_stats_path <- file.path(qc_dir, "qc_quality_sequence_stats.tsv.gz") ``` This one of the studies that we hope to discuss in [our third blog post](https://naobservatory.org/blog/exploring-blood-biosurveillance-part1), which will cover the metagenomic analysis of whole blood/plasma. In this notebook, I analyze [Mengyi 2023](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10372899/), a dataset from China with 201 pools of **plasma**, where each pool contains 160 samples (we don't know whether 1 sample = 1 donation) from 7 different locations, for a total of 10,720 samples. *I'd like to thank Lenni for giving me feedback on the notebook, Will for providing me with a boilerplate rmarkdown file, and Simon for giving me feedback on my figures. This notebook uses the MGS Workflow v2.2.1 (note that this is old).* # Raw data ## About [This dataset](https://bigd.big.ac.cn/gsa/browse/CRA006191) from China has 201 pools of plasma, where each pool contains 160 samples from 7 different locations between 2012-2018, for a total of 10,720 samples. This paper did not discuss the number of individuals that contributed to the samples, but we're going to attempt to contact the authors to get this information (this shouldn't hold up any further analysis, but would be good to have this information). They did DNA-sequencing for each pool, with Illumina HiSeq 4500, producing 2x150 bp reads. ```{r} #| label: load-libraries # Import libraries and extract metadata from sample names libraries <- read_csv(libraries_path, show_col_types = FALSE) metadata <- read_tsv(sprintf('%s/metadata.tsv', data_dir), show_col_types = FALSE) %>% select('sample'=`Library`, `library`=`Sample title`, 'Geographic location') %>% mutate('location' = str_split(`Geographic location`, ":", simplify = TRUE)[, ncol(str_split(`Geographic location`, ":", simplify = TRUE))]) libraries <- libraries %>% left_join(metadata, by = 'sample') %>% select(sample, library=library.y, location) %>% mutate(library = factor(library)) libraries %>% group_by(location) %>% summarize(n_pools = n()) ``` ### Sample + library preparation The following excerpt from the paper describes the sample and library preparation process. It's crucial to note that while blood samples were initially collected from volunteers, these samples were converted to plasma through ultracentrifugation prior to sequencing: > From January 1, 2012, to December 31, 2018, a total of 10,720 blood samples of 10 ml each were randomly selected from voluntary blood donors in 7 regions. The blood samples taken from various places were mixed in units of 160 (each 100 μl) for ultracentrifugation (32,000 rpm, 120 min, maximum centrifugal radius of 91.9 mm). Afterward, we rinsed and resuspended the precipitate with 500 μl PBS. > > The pooled suspensions were subjected to extraction of total DNA using QIAamp® DNA Blood mini Kit (QIAGEN Cat. NO.160019269, Frankfurt, Germany), DNA concentration was measured by Equalbit® 1 × dsDNA HS Assay Kit (Vazyme Cat. NO. 7E302K9, Nanjing, China). > > The metagenomic library was constructed using KAPA HyperPlus Kit (KAPA Cat. NO. 0000097583, Boston, USA) with dual-indexed Adapters (KAPA Cat. NO. 0000093370, Boston, USA), the DNA was fragmented to 250 bp approximately by the enzyme at 37 °C for 20 min, after end repair and A-tailing, adapter ligation, post-ligation cleanup, library amplification, and post-amplification cleanup, the library was constructed. > > Agilent 2100 Bioanalyzer (Agilent Technologies, Beijing, China) was used for library quality control, and qualified DNA library was sent to the Novogene company to sequence in HiSeq 4500. ## Quality control metrics ```{r} #| warning: false #| label: read-qc-data #| include: false # Import QC data stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary") basic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) %>% mutate(stage = factor(stage, levels = stages), sample = fct_inorder(sample)) adapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) %>% mutate(stage = factor(stage, levels = stages), read_pair = fct_inorder(as.character(read_pair))) quality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) %>% mutate(stage = factor(stage, levels = stages), read_pair = fct_inorder(as.character(read_pair))) quality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) %>% mutate(stage = factor(stage, levels = stages), read_pair = fct_inorder(as.character(read_pair))) # Filter to raw data basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat") adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat") quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat") quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat") # Get key values for readout raw_read_counts <- basic_stats_raw %>% summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs), rmean=mean(n_read_pairs), rtot = sum(n_read_pairs), btot = sum(n_bases_approx), dmin = min(percent_duplicates), dmax=max(percent_duplicates), dmean=mean(percent_duplicates), .groups = "drop") ``` These 201 samples contained 3.4B read pairs. The samples had 8.2M - 30.1M (mean 17.1M) read pairs each. The number of read pairs and total bases look relatively evenly distributed across the locations. The duplication rate is also quite low, around ~10%. Adapter content is low. As we'd expect, we see higher quality Q scores near the beginning of the read and a gradual decline towards the end of the read, however all positions seem to have a Q score of 35 which means that our reads are \~99.97% accurate. When looking at the Q score over all sequences, we can see a sharp peak after 35, which closely follows the previous plot, indicating high read quality. ```{r} #| fig-width: 15 #| fig-height: 10 #| warning: false #| label: plot-basic-stats # Prepare data basic_stats_raw_metrics <- basic_stats_raw %>% select(library, location, `# Read pairs` = n_read_pairs, `Total base pairs\n(approx)` = n_bases_approx, `% Duplicates\n(FASTQC)` = percent_duplicates) %>% pivot_longer(-(library:location), names_to = "metric", values_to = "value") %>% mutate(metric = fct_inorder(metric)) # Set up plot templates g_basic <- ggplot(basic_stats_raw_metrics, aes(x = library, y = value, fill=location)) + geom_col(position = "dodge") + scale_x_discrete() + scale_fill_brewer(palette = "Dark2") + scale_y_continuous(expand = c(0, 0)) + expand_limits(y = c(0, 100)) + facet_grid(metric ~ ., scales = "free", space = "free_x", switch = "y") + theme_kit + theme( axis.title.y = element_blank(), strip.text.y = element_text(face = "plain"), axis.text.x = element_blank() ) g_basic ``` ```{r} #| label: plot-raw-quality #| fig-width: 15 #| fig-height: 10 # Set up plotting templates g_qual_raw <- ggplot(mapping=aes(color = location, linetype=read_pair, group=interaction(sample,read_pair))) + scale_linetype_discrete(name = "Read Pair") + scale_color_brewer(palette = "Dark2") + guides(color=guide_legend(nrow=2,byrow=TRUE), linetype = guide_legend(nrow=2,byrow=TRUE)) + theme_base # Visualize adapters g_adapters_raw <- g_qual_raw + geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) + scale_y_continuous(name="% Adapters", limits=c(0,5), expand=c(0,0)) + scale_x_continuous(name="Position", limits=c(0,NA), breaks=seq(0,500,20), expand=c(0,0)) + facet_grid(.~adapter) g_adapters_raw # Visualize quality g_quality_base_raw <- g_qual_raw + geom_hline(yintercept=25, linetype="dashed", color="red") + geom_hline(yintercept=30, linetype="dashed", color="red") + geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) + scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) + scale_x_continuous(name="Position", limits=c(0,NA), breaks=seq(0,500,20), expand=c(0,0)) g_quality_base_raw g_quality_seq_raw <- g_qual_raw + geom_vline(xintercept=25, linetype="dashed", color="red") + geom_vline(xintercept=30, linetype="dashed", color="red") + geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) + scale_x_continuous(name="Mean Phred score", expand=c(0,0)) + scale_y_continuous(name="# Sequences", expand=c(0,0)) g_quality_seq_raw ``` # Preprocessing ## Summary The average fraction of reads at each stage in the preprocessing pipeline is shown in the following table. Reads lost during trimming and filtering approximately matches what we'd expect based on the raw adapter content. Deduplication loses us about 10% of reads which matched the amount of estimated duplicated reads by QC. Low ribodepletion is observed which makes sense because they only sequenced DNA. ```{r} #| label: preproc-table # TODO: Group by pool size as well # Count read losses n_reads_rel <- basic_stats %>% select(sample,location, stage, percent_duplicates, n_read_pairs) %>% group_by(sample) %>% arrange(sample, stage) %>% mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs), p_reads_lost = 1 - p_reads_retained, p_reads_retained_abs = n_read_pairs / n_read_pairs[1], p_reads_lost_abs = 1-p_reads_retained_abs, p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs)) n_reads_rel_display <- n_reads_rel %>% rename(Stage=stage) %>% group_by(Stage) %>% summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"), `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% filter(Stage != "raw_concat") %>% mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion"))) n_reads_rel_display ``` ## Quality control metrics Trimming and cleaning gets rid of the Illumina adapter as well as polyg, and we see a decrease in polya. Q score remain the same during read cleaning when looking at the positions, with the end of the read actually improving in score. Q scores across all sequences look pretty much the same throughout cleaning. ```{r} #| warning: false #| label: plot-quality #| fig-width: 15 #| fig-height: 10 g_qual <- ggplot(mapping=aes(color = location, linetype=read_pair, group=interaction(sample,read_pair))) + scale_linetype_discrete(name = "Read Pair") + scale_color_brewer(palette = "Dark2") + guides(color=guide_legend(nrow=2,byrow=TRUE), linetype = guide_legend(nrow=2,byrow=TRUE)) + theme_base # Visualize adapters g_adapters <- g_qual + geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) + scale_y_continuous(name="% Adapters", limits=c(0,5), expand=c(0,0)) + scale_x_continuous(name="Position", limits=c(0,NA), breaks=seq(0,140,20), expand=c(0,0)) + facet_grid(stage~adapter) g_adapters # Visualize quality g_quality_base <- g_qual + geom_hline(yintercept=25, linetype="dashed", color="red") + geom_hline(yintercept=30, linetype="dashed", color="red") + geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) + scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) + scale_x_continuous(name="Position", limits=c(0,NA), breaks=seq(0,140,20), expand=c(0,0)) + facet_grid(stage~.) g_quality_base g_quality_seq <- g_qual + geom_vline(xintercept=25, linetype="dashed", color="red") + geom_vline(xintercept=30, linetype="dashed", color="red") + geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) + scale_x_continuous(name="Mean Phred score", expand=c(0,0)) + scale_y_continuous(name="# Sequences", expand=c(0,0)) + facet_grid(stage~., scales = "free_y") g_quality_seq ``` Number of read pairs look reasonable, same with a large portion being lost during deduplication. At the end of deduplication it seems that only 5% of reads are duplicates which doesn't seem too bad. Mean read length stays around 150 bp which is pretty good. ```{r} #| label: preproc-figures #| warning: false #| fig-width: 15 #| fig-height: 10 g_stage_trace <- ggplot(basic_stats, aes(x=stage, group=sample, color = location)) + theme_kit + scale_color_brewer(palette = "Dark2") # Plot reads over preprocessing g_reads_stages <- g_stage_trace + geom_line(aes(y=n_read_pairs)) + scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA)) g_reads_stages # Plot relative read losses during preprocessing g_reads_rel <- ggplot(n_reads_rel, aes(x=stage, group=sample, color = location)) + geom_line(aes(y=p_reads_lost_abs_marginal)) + scale_y_continuous("% Total Reads Lost", expand=c(0,0), labels = function(x) x*100) + scale_color_brewer(palette = "Dark2") + theme_kit g_reads_rel ``` ```{r} #| label: preproc-dedup #| fig-width: 15 #| fig-height: 10 stage_dup <- basic_stats %>% group_by(stage) %>% summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates), dmean=mean(percent_duplicates), .groups = "drop") g_dup_stages <- g_stage_trace + geom_line(aes(y=percent_duplicates)) + scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0)) g_dup_stages ``` ```{r} #| label: preproc-mean-readlen #| fig-width: 15 #| fig-height: 10 #| fig-cap: "Reminder that this data is 2x150 bp." #| cap-location: margin g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) + scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA)) g_readlen_stages ``` # Taxonomic composition ## High-level composition To assess the high-level composition of the reads, I ran the files through Kraken2 and summarized the results with Bracken. The groups listed below were created by Will: * Filtered (removed during cleaning) * Duplicate (removed during deduplication) * Ribosomal (removed during ribodepletion) * Unassigned (non-ribosomal reads that were not assigned to any taxon by Kraken/Bracken) * Bacterial (non-ribosomal reads assigned to the Bacteria domain by Kraken/Bracken) * Archaeal (non-ribosomal reads assigned to the Archaea domain by Kraken/Bracken) * Viral (non-ribosomal reads assigned to the Viruses domain by Kraken/Bracken) * Human (non-ribosomal reads assigned to the Eukarya domain by Kraken/Bracken) Human reads make up a large proportion of all reads at 82% total composition, whereas viruses only account for 0.01% of all reads. ```{r} #| label: prepare-composition #| warning: false classifications <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned", "Bacterial", "Archaeal", "Viral", "Human") # Import composition data tax_final_dir <- file.path(results_dir, "taxonomy_final") comp <- read_tsv(sprintf("%s/taxonomic_composition.tsv.gz", tax_final_dir)) %>% inner_join(libraries, by='sample') comp_minor <- comp %>% filter(classification %in% c("Archaeal", "Viral", "Human", "Other")) comp_assigned <- comp %>% filter(! classification %in% c("Filtered", "Duplicate", "Ribosomal", "Unassigned")) %>% group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads)) comp_assigned_minor <- comp_assigned %>% filter(classification %in% c("Archaeal", "Viral", "Human", "Other")) # Summarize composition read_comp_summ <- comp %>% group_by(classification) %>% summarize(n_reads = sum(n_reads), .groups = "drop_last") %>% mutate(n_reads = replace_na(n_reads,0), p_reads = n_reads/sum(n_reads), pc_reads = p_reads*100) %>% mutate(classification = factor(classification, levels=classifications)) %>% select(classification, n_reads, pc_reads) %>% rename(`# of reads` = n_reads, "% of reads" = pc_reads) %>% mutate(`% of reads` = sprintf("%.2f", `% of reads`)) read_comp_summ ``` ```{r} #| label: plot-composition-all #| fig-width: 15 #| fig-height: 10 # Prepare plotting templates g_comp_base <- ggplot(mapping=aes(x=library, y=p_reads, fill=classification)) + scale_x_discrete(name="Plasma pool") + theme_kit + theme(plot.title = element_text(hjust=0, face="plain", size=rel(1.5)), axis.text.x = element_blank()) scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads", expand = c(0,0), labels = function(y) y*100) geom_comp <- purrr::partial(geom_col, position = "stack", width = 1) # Define a color palette for the classification classification_colors <- brewer.pal(8, "Accent") names(classification_colors) <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned", "Bacterial", "Archaeal", "Viral", "Human") scale_fill_classification <- function() { scale_fill_manual(values = classification_colors, name = "Classification") } # Plot overall composition g_comp <- g_comp_base + geom_comp(data = comp) + scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) + scale_fill_classification() + ggtitle("Read composition (all reads, all groups)") g_comp # Create a faceted plot of composition by location #g_comp_by_location <- g_comp_base + # geom_comp(data = comp) + # scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) + # scale_fill_classification() + # facet_wrap(~ location, scales = "free_x", ncol = 2) + # theme(axis.text.x = element_blank(), # Remove x-axis labels # axis.ticks.x = element_blank()) + # Remove x-axis ticks # ggtitle("Read composition (all reads, all groups)") # Display the plot #g_comp_by_location # Repeat for classified reads only g_comp_assigned <- g_comp_base + geom_comp(data = comp_assigned) + scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) + scale_fill_classification() + ggtitle("Read composition (assigned reads, all groups)") g_comp_assigned # Plot composition of minor components #g_comp_minor <- g_comp_base + # geom_comp(data=comp_minor) + # scale_y_pc_reads() + # scale_fill_classification() + # ggtitle("Read composition (all reads, minor groups)") #g_comp_minor #g_comp_assigned_minor <- g_comp_base + # geom_comp(data=comp_assigned_minor) + # scale_y_pc_reads() + # scale_fill_classification() + # ggtitle("Read composition (assigned reads, minor groups)") #g_comp_assigned_minor ``` ## Total viral content Total viral fraction average $1.03 \times 10^{-4}$ across samples. As a fraction of assigned (rather than total) reads, this jumped to $1.28 \times 10^{-4}$. The increase in viral fraction from assigned reads to total reads was constant across all samples, however, we can see that the viral fractions differed by location. Note that one of the pools from "Heilongjiang,Mudanjiang" had no viral reads. ```{r} #| label: p-viral-scatter #| fig-width: 10 #| fig-height: 10 #| warning: false #p_reads_viral_all <- comp %>% filter(classification == "Viral") %>% # mutate(read_group = "All reads") p_reads_viral_assigned <- comp_assigned %>% filter(classification == "Viral") %>% mutate(read_group = "Classified reads") #p_reads_viral <- bind_rows(p_reads_viral_all, p_reads_viral_assigned) location_viral <- comp_assigned %>% filter(classification == "Viral" ) %>% group_by(location) # Plot # g_viral <- ggplot(p_reads_viral, aes(x=p_reads, y=library, color=location, shape=read_group)) + # geom_point() + # scale_y_discrete(name="Plasma pool") + # scale_x_log10(name="Viral read fraction") + # scale_color_brewer(palette = "Dark2") + # #facet_grid(.~read_group, scales = "free") + # guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2), # linetype=guide_legend(nrow=2)) + # theme_kit #ggplot(location_viral, aes(x=p_reads, y=location, color = location)) + # geom_quasirandom(size=2) + # scale_x_log10(name="Viral read fraction", labels = label_log(digits=2)) + # scale_y_discrete(name="Location") + # scale_color_brewer(palette = "Dark2") + # theme_light() + # theme( # axis.text.y = element_text(size = 8), # axis.text.x = element_text(size = 14), # axis.ticks.y = element_blank(), # axis.line = element_line(colour = "black"), # axis.title.x = element_text(size = 15), # axis.title.y = element_text(size = 15), # legend.text = element_text(size = 13), # legend.title = element_text(size = 16), # legend.position = c(1, 1), # Move legend to top right ## legend.justification = c(1, 1), # Align legend to top right # panel.grid.minor = element_blank(), # panel.border = element_blank(), # panel.background = element_blank()) # Calculate the number of samples per location location_counts <- location_viral %>% group_by(location) %>% summarise(count = n()) %>% mutate(label = paste0(location, " (n=", count, ")")) # Create a named vector for easy mapping location_labels <- setNames(location_counts$label, location_counts$location) g_viral <- ggplot(location_viral, aes(x=location, y=p_reads, fill=location)) + geom_violin(trim=FALSE) + geom_boxplot(width=0.1, fill="white", color="black", outlier.shape=NA) + geom_jitter(width=0.1, size=0.5, alpha=0.5) + scale_y_log10(name="Viral read fraction", labels = label_log(digits=2)) + scale_x_discrete(name="Location", labels = location_labels) + scale_fill_brewer(palette = "Dark2") + labs(title="Total viral fraction based on assigned reads over all locations") + theme_light() + theme( axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12, angle = 45, hjust = 1), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), legend.position = "none", panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank() ) + theme_kit + coord_flip() # This flips the coordinates to make the plot horizontal g_viral ``` ## Taxonomic composition of viruses The threshold for the label "other" are the set of families that make up less than 20% composition in all samples (the only reason I did this was because there were too many matches). We can see that other makes up a big portion of the viral composition. Outside of the "other" label, Hepadnaviridae makes up the largest portion of the viral composition. ```{r} #| label: extract-viral-taxa #| include: false # Get viral taxonomy viral_taxa_path <- file.path(data_dir, "total-virus-db.tsv.gz") viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE) # Get Kraken reports reports_path <- file.path(tax_final_dir, "kraken_reports.tsv.gz") reports <- read_tsv(reports_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) # Filter to viral taxa kraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>% group_by(sample) %>% mutate(p_reads_viral = n_reads_clade/n_reads_clade[1]) kraken_reports_viral_cleaned <- kraken_reports_viral %>% select(-pc_reads_total, -n_reads_direct, -contains("minimizers")) %>% select(name, taxid, p_reads_viral, n_reads_clade, everything()) %>% ungroup viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C") viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F") ``` ```{r} #| label: viral-family-composition #| fig-width: 10 #| fig-height: 15 major_threshold <- 0.20 # Identify major viral families viral_families_major_tab <- viral_families %>% group_by(name, taxid) %>% summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>% filter(p_reads_viral_max >= major_threshold) viral_families_major_list <- viral_families_major_tab %>% pull(name) viral_families_major <- viral_families %>% filter(name %in% viral_families_major_list) %>% select(name, taxid, sample, p_reads_viral) viral_families_minor <- viral_families_major %>% group_by(sample) %>% summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>% mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>% select(name, taxid, sample, p_reads_viral) viral_families_display <- viral_families_major %>% bind_rows(viral_families_minor) %>% arrange(desc(p_reads_viral)) %>% mutate(name = factor(name, levels=c(viral_families_major_list, "Other"))) %>% rename(p_reads = p_reads_viral, classification=name) %>% inner_join(libraries, by='sample') # Create a custom color palette with up to 20 colors custom_palette <- c( brewer.pal(12, "Paired"), brewer.pal(12, "Set3") ) n_classifications <- length(unique(viral_families_display$classification)) - 1 custom_palette_with_black <- c( custom_palette[1:n_classifications], "black" ) # Create a new color palette with black for "Other" custom_palette_with_black <- c( custom_palette[1:n_classifications], "black" ) # Plot g_families <- g_comp_base + geom_comp(data=viral_families_display) + scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks = seq(0,1,0.2), expand=c(0,0), labels = function(y) y*100) + scale_fill_manual(values = custom_palette_with_black) + facet_wrap(~ location, scales = "free_x", ncol = 2) g_families ``` # Human-infecting virus reads ## Overall relative abundance I calculated the relative abundance of human-infecting viruses in two ways: - First, as the total number of deduplicated human-virus reads in each sample, divided by the number of raw reads ("All reads"). This results in a viral fraction of $9.84 \times 10^{-5}$ across all samples. - Second, as a fraction of preprocessed (cleaned, deduplicated, computationally ribodepleted) reads ("Preprocessed reads"). This results in a viral fraction of $1.18 \times 10^{-4}$ across all samples. Note that 10 of the pools had no viral reads: 5 from "Heilongjiang, Mudanjiang", 4 from "Jiangsu, Nanjing", and 1 from "Xinjiang, Urmqi". ```{r} #| label: prepare-hv #| include: false # Import and format reads hv_reads_path <- file.path(hv_dir, "hv_hits_putative_collapsed.tsv.gz") mrg_hv <- read_tsv(hv_reads_path, show_col_types = FALSE) %>% inner_join(libraries, by="sample") %>% arrange(sample) %>% mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV assignment", "No Kraken2 assignment")) %>% mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev), highscore = adj_score_max >= 20, hv_status = assigned_hv | highscore) %>% rename(taxid_all = taxid, taxid = taxid_best) ``` ```{r} #| label: count-hv-reads #| fig-width: 10 #| fig-height: 10 #| warning: false # Get raw read counts read_counts_raw <- filter(basic_stats_raw) %>% select(sample, n_reads_raw = n_read_pairs) read_counts_preproc <- basic_stats %>% filter(stage == "ribo_initial") %>% select(sample, n_reads_preproc = n_read_pairs) # Get HV read counts read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% count(name="n_reads_hv") read_counts <- read_counts_raw %>% left_join(read_counts_hv, by=c("sample")) %>% mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>% left_join(read_counts_preproc, by=c("sample")) %>% inner_join(libraries, by=c("sample")) %>% select(sample, n_reads_raw, n_reads_preproc, n_reads_hv) %>% mutate(n_samples = 1, p_reads_total = n_reads_hv/n_reads_raw, p_reads_preproc = n_reads_hv/n_reads_preproc) read_counts_long <- read_counts %>% pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>% mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads")) # Combine for display read_counts_agg <- read_counts %>% mutate(p_reads_total = n_reads_hv/n_reads_raw, p_reads_preproc = n_reads_hv/n_reads_preproc) %>% inner_join(libraries, by=c("sample")) read_counts_agg_long <- read_counts_agg %>% pivot_longer(starts_with("p_reads"), names_to="read_group", values_to="p_reads") %>% mutate(read_group = ifelse(read_group == "p_reads_total", "All reads", "Preprocessed reads")) # Visualize #g_read_counts <- ggplot(read_counts_agg_long, aes(x=library, y=p_reads)) + # geom_point() + # scale_y_log10(name = "Unique human-viral read fraction") + # facet_grid(.~read_group, scales = "free") + # theme_kit #g_read_counts #g_viral <- ggplot(read_counts_agg_long, aes(x=p_reads, y=library, color=location, shape=read_group)) + # geom_point() + # scale_y_discrete(name="Plasma pool") + # scale_x_log10(name="Viral read fraction") + # scale_color_brewer(palette = "Dark2") + # facet_grid(.~read_group, scales = "free") + # guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2), # linetype=guide_legend(nrow=2)) + # theme_kit #g_viral hv_location <- read_counts_agg_long %>% filter(read_group == "Preprocessed reads" ) %>% group_by(location) hv_location_counts <- hv_location %>% group_by(location) %>% summarise(count = n()) %>% mutate(label = paste0(location, " (n=", count, ")")) # Create a named vector for easy mapping location_labels <- setNames(hv_location_counts$label, hv_location_counts$location) # Create a named vector for easy mapping # Plot # g_viral <- ggplot(p_reads_viral, aes(x=p_reads, y=library, color=location, shape=read_group)) + # geom_point() + # scale_y_discrete(name="Plasma pool") + # scale_x_log10(name="Viral read fraction") + # scale_color_brewer(palette = "Dark2") + # #facet_grid(.~read_group, scales = "free") + # guides(color=guide_legend(nrow=2), shape=guide_legend(nrow=2), # linetype=guide_legend(nrow=2)) + # theme_kit #ggplot(location_viral, aes(x=p_reads, y=location, color = location)) + # geom_quasirandom(size=2) + # scale_x_log10(name="Viral read fraction", labels = label_log(digits=2)) + # scale_y_discrete(name="Location") + # scale_color_brewer(palette = "Dark2") + # theme_light() + # theme( # axis.text.y = element_text(size = 8), # axis.text.x = element_text(size = 14), # axis.ticks.y = element_blank(), # axis.line = element_line(colour = "black"), # axis.title.x = element_text(size = 15), # axis.title.y = element_text(size = 15), # legend.text = element_text(size = 13), # legend.title = element_text(size = 16), # legend.position = c(1, 1), # Move legend to top right ## legend.justification = c(1, 1), # Align legend to top right # panel.grid.minor = element_blank(), # panel.border = element_blank(), # panel.background = element_blank()) g_viral <- ggplot(hv_location, aes(x=location, y=p_reads, fill=location)) + geom_violin(trim=FALSE) + geom_boxplot(width=0.1, fill="white", color="black", outlier.shape=NA) + geom_jitter(width=0.1, size=0.5, alpha=0.5) + scale_y_log10(name="Viral read fraction", labels = label_log(digits=2)) + scale_x_discrete(name="Location", labels = location_labels) + scale_fill_brewer(palette = "Dark2") + theme_light() + theme( axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12, angle = 45, hjust = 1), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), legend.position = "none", panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank() ) + theme_kit + coord_flip() # This flips the coordinates to make the plot horizontal g_viral ``` ## Overall taxonomy and composition The two dominant viruses we see are Anellovirdae and Hepadnaviridae, with Parvoviridae . The threshold for the label "other" are the set of families that make up less than 5% composition in all samples. ```{r} #| label: raise-hv-taxa #| include: false # Filter samples and add viral taxa information #samples_keep <- read_counts %>% filter(n_reads_hv > 1) %>% pull(sample) samples_keep <- read_counts %>% pull(sample) mrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by="taxid") # Discover viral species & genera for HV reads raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){ # Get higher ranks than search rank ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom") rank_match <- which.max(ranks == out_rank) high_ranks <- ranks[rank_match:length(ranks)] # Merge read DB and taxid DB reads <- read_db %>% select(-parent_taxid, -rank, -name) %>% left_join(taxid_db, by="taxid") # Extract sequences that are already at appropriate rank reads_rank <- filter(reads, rank == out_rank) # Drop sequences at a higher rank and return unclassified sequences reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid)) while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences... # Promote read taxids and re-merge with taxid DB, then re-classify and filter reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>% select(-parent_taxid, -rank, -name) %>% left_join(taxid_db, by="taxid") reads_rank <- reads_remaining %>% filter(rank == out_rank) %>% bind_rows(reads_rank) reads_norank <- reads_remaining %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid)) } # Finally, extract and append reads that were excluded during the process reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id) reads_out <- reads_rank %>% bind_rows(reads_dropped) %>% select(-parent_taxid, -rank, -name) %>% left_join(taxid_db, by="taxid") return(reads_out) } hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species") hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus") hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family") ``` ```{r} #| label: hv-family #| fig-width: 15 #| fig-height: 15 threshold_major_family <- 0.05 # Count reads for each human-viral family hv_family_counts <- hv_reads_family %>% group_by(sample, name, taxid) %>% count(name = "n_reads_hv") %>% group_by(sample) %>% mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv)) # Identify high-ranking families and group others hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>% arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family) hv_family_counts_major <- hv_family_counts %>% mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>% group_by(sample, name_display) %>% summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), .groups="drop") %>% mutate(name_display = factor(name_display, levels = c(hv_family_major_tab$name, "Other"))) hv_family_counts_display <- hv_family_counts_major %>% rename(p_reads = p_reads_hv, classification = name_display) %>% inner_join(libraries, by = 'sample') # Plot g_hv_family <- g_comp_base + geom_col(data=hv_family_counts_display, position = "stack", width=1) + scale_y_continuous(name="% HV Reads", limits=c(0,1.01), breaks = seq(0,1,0.2), expand=c(0,0), labels = function(y) y*100) + scale_fill_brewer(palette = "Set1", name = "Viral family") + labs(title="Family composition of human-viral reads") + guides(fill=guide_legend(ncol=4)) + theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain")) + facet_wrap(~ location, scales = "free_x", ncol = 2) g_hv_family # Get most prominent families for text hv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% summarize(n_reads_tot = sum(n_reads_hv), p_reads_max = max(p_reads_hv), .groups="drop") %>% arrange(desc(n_reads_tot)) display_hv_family_collate <- hv_family_collate %>% select(name, n_reads_tot) %>% rename(`Family` = name, `Total number of reads` = n_reads_tot) display_hv_family_collate ``` ## Analyzing specific families We now investigate the composition of specific families that had more than 5 viral reads. In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads: ```{r} #| label: plot-viral-family-scripts #| include: false library_with_total <- libraries %>% group_by(location) %>% mutate(total_samples = n_distinct(sample)) %>% ungroup() plot_viral_family_composition <- function(taxid_chosen, threshold_major_species = 0.1) { # Get set of chosen family reads family_samples <- hv_family_counts %>% filter(taxid == taxid_chosen) %>% filter(p_reads_hv >= 0.1) %>% pull(sample) family_ids <- hv_reads_family %>% filter(taxid == taxid_chosen, sample %in% family_samples) %>% pull(seq_id) # Count reads for each species in the chosen family species_counts <- hv_reads_species %>% filter(seq_id %in% family_ids) %>% group_by(sample, location, name, taxid) %>% count(name = "n_reads_hv") %>% group_by(sample, location) %>% mutate(p_reads_family = n_reads_hv/sum(n_reads_hv)) # Identify high-ranking species and group others species_major_tab <- species_counts %>% group_by(name) %>% filter(p_reads_family == max(p_reads_family)) %>% filter(row_number() == 1) %>% arrange(desc(p_reads_family)) %>% filter(p_reads_family > threshold_major_species) species_counts_major <- species_counts %>% mutate(name_display = ifelse(name %in% species_major_tab$name, name, "Other")) %>% group_by(sample, location, name_display) %>% summarize(n_reads_family = sum(n_reads_hv), p_reads_family = sum(p_reads_family), .groups="drop") %>% mutate(name_display = factor(name_display, levels = c(species_major_tab$name, "Other"))) species_counts_display <- species_counts_major %>% rename(p_reads = p_reads_family, classification = name_display) %>% right_join(library_with_total, by = c('sample', 'location')) %>% #mutate(p_reads = coalesce(p_reads, 0), # classification = coalesce(classification, species_major_tab$name[1])) %>% mutate(location = factor(location)) #group_by(location) %>% #mutate(samples_with_reads = sum(p_reads > 0), # label = paste0(location, " (", sprintf("%.1f%%", 100 * samples_with_reads / total_samples), ")")) %>% #ungroup() # Get family name for the title family_name <- viral_taxa %>% filter(taxid == taxid_chosen) %>% pull(name) species_count <- species_counts_display %>% select(classification) %>% distinct() %>% pull() %>% length() if (species_count <= 8) { pal <- brewer.pal(name = "Accent", n = species_count) } else { pal <- c(brewer.pal(name = "Paired", n = 12), brewer.pal(name = "Set3", n =12), brewer.pal(name = "Set1", n = 9))[1:species_count] } # Plot g_species <- g_comp_base + geom_col(data=species_counts_display, position = "stack") + scale_y_continuous(name=paste0("% ", family_name, " reads"), limits=c(0,1.01), breaks = seq(0,1,0.2), expand=c(0,0), labels = function(y) y*100) + scale_fill_manual(values = pal, name = "Viral species") + labs(title=paste0("Species composition of ", family_name, " reads")) + facet_wrap(~ location, scales = "free_x", ncol = 2) + guides(fill=guide_legend(ncol=3)) + theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain")) return(g_species) } plot_viral_family_histogram <- function(taxid_chosen) { # Filter data for the specified virus family family_data <- hv_family_counts %>% right_join(library_with_total, by = c('sample')) %>% ungroup() %>% mutate(n_reads_hv = coalesce(n_reads_hv, 0)) %>% filter(taxid == taxid_chosen) %>% group_by(location) %>% summarize(n_reads_hv = sum(n_reads_hv), .groups="drop") # Get family name for the title family_name <- viral_taxa %>% filter(taxid == taxid_chosen) %>% pull(name) # Create the histogram g_histogram <- ggplot(family_data, aes(x = location, y = n_reads_hv)) + geom_bar(stat = "identity") + geom_text(aes(label = scales::comma(n_reads_hv)), vjust = -0.5, size = 3) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs( title = paste("Distribution of", family_name, "Reads by Location"), x = "Location", y = "Number of reads" ) + scale_y_continuous(labels = scales::comma) + expand_limits(y = max(family_data$n_reads_hv) * 1.1)+ theme_kit # Expand y-axis to make room for labels return(g_histogram) } ``` ### Hepadnaviridae (Number of reads: 339,427) ```{r} #| label: plot-hepadnaviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Hepadnaviridae plot_viral_family_histogram(taxid_chosen=10404) ``` ```{r} #| label: plot-hepadnaviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Hepadnaviridae plot_viral_family_composition(taxid_chosen=10404, threshold_major_species = 0.1) ``` ### Parvoviridae (Number of reads: 17,401) ```{r} #| label: plot-parvoviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Parvoviridae plot_viral_family_histogram(taxid_chosen=10780) ``` ```{r} #| label: plot-parvoviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Parvoviridae plot_viral_family_composition(taxid_chosen=10780, threshold_major_species = 0.1) ``` ### Anelloviridae (Number of reads: 3001) ```{r} #| label: plot-anelloviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Anelloviridae plot_viral_family_histogram(taxid_chosen=687329) ``` ```{r} #| label: plot-anelloviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Anelloviridae plot_viral_family_composition(taxid_chosen=687329, threshold_major_species = 0.1) ``` ### Orthoherpesviridae (Number of reads: 298) ```{r} #| label: plot-orthoherpesviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Orthoherpesviridae plot_viral_family_histogram(taxid_chosen=3044472) ``` ```{r} #| label: plot-orthoherpesviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Orthoherpesviridae plot_viral_family_composition(taxid_chosen=3044472, threshold_major_species = 0.1) ``` ### Retroviridae (Number of reads: 115) ```{r} #| label: plot-retroviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Retroviridae plot_viral_family_histogram(taxid_chosen=11632) ``` ```{r} #| label: plot-retroviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Retroviridae plot_viral_family_composition(taxid_chosen=11632, threshold_major_species = 0.1) ``` ### Microviridae (Number of reads: 61) ```{r} #| label: plot-microviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Microviridae plot_viral_family_histogram(taxid_chosen=10841) ``` ```{r} #| label: plot-microviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false # Microviridae plot_viral_family_composition(taxid_chosen=10841, threshold_major_species = 0.1) ``` ### Adenoviridae (Number of reads: 28) ```{r} #| label: plot-adenoviridae-histogram #| fig-width: 10 #| fig-height: 2.5 # Adenoviridae plot_viral_family_histogram(taxid_chosen=10508) ``` ```{r} #| label: plot-adenoviridae-composition #| fig-width: 10 #| fig-height: 7.5 #| warning: false #| # Adenoviridae plot_viral_family_composition(taxid_chosen=10508, threshold_major_species = 0.1) ``` ## Relative abundance of pathogenic viruses of interest ```{r} #| label: hv-composition-exclusion #| include: false total <- basic_stats_raw %>% select(sample, n_read_pairs) hv_family_counts <- hv_reads_family %>% group_by(sample, name, taxid) %>% count(name = "n_reads_hv") %>% inner_join(total, by='sample') %>% group_by(sample) %>% mutate(ra_reads_hv = n_reads_hv/n_read_pairs) hv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% summarize(n_reads_tot = sum(n_reads_hv), ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% arrange(desc(n_reads_tot)) hv_family_collate <- hv_family_collate %>% rename( 'family' = 'name', 'family_n_reads_tot' = 'n_reads_tot', 'family_ra_reads_tot' = 'ra_reads_tot', ) hv_genus_counts <- hv_reads_genus %>% group_by(sample, name, taxid) %>% count(name = "n_reads_hv") %>% inner_join(total, by='sample') %>% group_by(sample) %>% mutate(ra_reads_hv = n_reads_hv/n_read_pairs) hv_genus_collate <- hv_genus_counts %>% group_by(name, taxid) %>% summarize(n_reads_tot = sum(n_reads_hv), ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% arrange(desc(n_reads_tot)) hv_genus_collate <- hv_genus_collate %>% rename( 'genus' = 'name', 'genus_n_reads_tot' = 'n_reads_tot', 'genus_ra_reads_tot' = 'ra_reads_tot', ) hv_species_counts <- hv_reads_species %>% group_by(sample, name, taxid) %>% count(name = "n_reads_hv") %>% inner_join(total, by='sample') %>% group_by(sample) %>% mutate(ra_reads_hv = n_reads_hv/n_read_pairs) hv_species_collate <- hv_species_counts %>% group_by(name, taxid) %>% summarize(n_reads_tot = sum(n_reads_hv), ra_reads_tot = mean(ra_reads_hv), .groups="drop") %>% arrange(desc(n_reads_tot)) hv_species_collate <- hv_species_collate %>% rename( 'species' = 'name', 'species_n_reads_tot' = 'n_reads_tot', 'species_ra_reads_tot' = 'ra_reads_tot', ) ``` ```{r} #| label: hv-filter-exclusion #| include: false # Assuming the data is already loaded into tibbles: #hv_family_collate #hv_genus_collate #hv_species_collate #viral_taxa # Function to find parent taxa recursively find_parent_taxa <- function(taxid, target_ranks, taxa_data) { current_taxid <- taxid result <- setNames(vector("list", length(target_ranks)), target_ranks) while (TRUE) { row <- taxa_data %>% filter(taxid == current_taxid) %>% slice(1) if (nrow(row) == 0) break # If no matching taxid is found, break the loop if (row$rank %in% target_ranks) { result[[row$rank]] <- list(taxid = row$taxid, name = row$name) if (all(!sapply(result, is.null))) { break # If all target ranks are found, break the loop } } if (row$taxid == row$parent_taxid) { break # Avoid infinite loop at root } current_taxid <- row$parent_taxid } # Replace NULL with NA for any unfound ranks result <- lapply(result, function(x) if (is.null(x)) list(taxid = NA, name = NA) else x) return(result) } # Find all relevant parent taxa for each species result <- hv_species_collate %>% mutate( parent_info = map(taxid, ~find_parent_taxa(.x, c("species","subgenus", "genus", "subfamily", "family"), viral_taxa)), genus_taxid = map_dbl(parent_info, ~.x$genus$taxid), genus = map_chr(parent_info, ~.x$genus$name), family_taxid = map_dbl(parent_info, ~.x$family$taxid), family = map_chr(parent_info, ~.x$family$name) ) %>% select(-parent_info) # Join with genus counts result <- result %>% left_join( hv_genus_collate %>% select(taxid, genus_n_reads_tot, genus_ra_reads_tot), by = c("genus_taxid" = "taxid") ) # Join with family counts result <- result %>% left_join( hv_family_collate %>% select(taxid, family_n_reads_tot, family_ra_reads_tot), by = c("family_taxid" = "taxid") ) # Clean up result <- result %>% select(-ends_with("_taxid")) #result ``` Each dot represents a sample, colored by viral family. The x-axis shows the relative abundance of human-infecting viruses, and the y-axis shows the species. ```{r} #| label: plot-pathogenic-viruses #| fig-width: 10 #| fig-height: 10 #| warning: false species_family <- result %>% select(species, family) %>% rename('name' = 'species') play <- hv_species_counts %>% ungroup() %>% inner_join(libraries, by = 'sample') %>% inner_join(species_family, by = 'name') %>% mutate(name = ifelse(name == "Severe acute respiratory syndrome-related coronavirus", "SARS-related coronavirus", name)) %>% filter(!(family %in% c("Anelloviridae", NA, "Microviridae", "Rhabdoviridae"))) adjusted_play <- play %>% group_by(name) %>% mutate(virus_prevalence_num = n_distinct(sample)/n_distinct(libraries), total_reads_hv = sum(n_reads_hv)) %>% ungroup() %>% mutate(name = fct_reorder(name, virus_prevalence_num, .desc=TRUE)) %>% select(name, ra_reads_hv, family, location, virus_prevalence_num, total_reads_hv, n_reads_hv) pal <- c(brewer.pal(8, 'Dark2'),brewer.pal(8, 'Accent')) #, labels = label_log(digits=2) ra_dot <- ggplot(adjusted_play, aes(x = ra_reads_hv, y=name)) + geom_quasirandom(orientation = 'y', aes(color = family)) + scale_color_manual(values = pal) + scale_x_log10( "Relative abundance human virus reads", labels=label_log(digits=3) ) + labs(y = "", color = 'Viral family') + guides(color = guide_legend(ncol=2)) + theme_light() + theme( axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12), axis.ticks.y = element_blank(),axis.line = element_line(colour = "black"), axis.title.x = element_text(size = 14), legend.text = element_text(size = 10), legend.title = element_text(size = 10, face="bold"), legend.position = 'bottom', #legend.position = c(1, 1), # Move legend to top right #legend.justification = c(1, 1), # Align legend to top right panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) ra_dot # geom_text(aes(label = total_reads_hv, y = name), # x = 1e-9, hjust = 0, vjust = 0.5, size = 3, # check_overlap = TRUE) ``` We can ask Claude to analyze the pathogenicity of these viruses along with their popular names. | Scientific Name | Popular/Well-known Name | Pathogenicity to Humans | |-----------------|-------------------------|-------------------------| | Human immunodeficiency virus 1 | HIV-1 | Very High | | SARS-related coronavirus | SARS-CoV | High | | Hepatitis B virus | HBV | High | | Hepacivirus hominis | Hepatitis C virus (HCV) | High | | Simplexvirus humanalpha1 | Herpes simplex virus 1 (HSV-1) | Moderate to High | | Cytomegalovirus humanbeta5 | Human cytomegalovirus (HCMV) | Moderate to High | | Lymphocryptovirus humangamma4 | Epstein-Barr virus (EBV) | Moderate | | Human mastadenovirus C | Adenovirus C | Moderate | | Human mastadenovirus F | Adenovirus F | Moderate | | Roseolovirus humanbeta6a | Human herpesvirus 6A (HHV-6A) | Low to Moderate | | Roseolovirus humanbeta6b | Human herpesvirus 6B (HHV-6B) | Low to Moderate | | Roseolovirus humanbeta7 | Human herpesvirus 7 (HHV-7) | Low to Moderate | | Rhadinovirus humangamma8 | Kaposi's sarcoma-associated herpesvirus (KSHV) | Low to Moderate | | Molluscum contagiosum virus | MCV | Low | | Erythroparvovirus primate1 | Parvovirus B19 | Low | | Human erythrovirus V9 | Erythrovirus V9 | Low | | Alphapapillomavirus 4 | Human papillomavirus 4 (HPV-4) | Low | | Betapolyomavirus hominis | Human polyomavirus | Low | | Dependoparvovirus primate1 | Adeno-associated virus (AAV) | Very Low | | Murine leukemia virus | MLV | Very Low (not a human pathogen) | | Murine leukemia-related retroviruses | MLV-related viruses | Very Low (not typical human pathogens) | We get a variety of pathogenic viruses, none of which are a suprise to see in plasma (other than SARS-CoV which we determined was contamination [on twist](https://twist.com/a/197793/ch/619193/t/6323010/)). ## Relative abundance assuming perfect human read removal ```{r} #| label: perfect-hv-removal #| include: false human_reads <- comp_assigned %>% filter(classification %in% c("Human")) %>% select(sample, human_reads=n_reads) all <- hv_family_counts %>% inner_join(human_reads, by='sample') %>% group_by(sample) %>% summarize(sum_reads=sum(n_reads_hv), n_read_pairs=first(n_read_pairs) - first(human_reads)) %>% mutate(ra = sum_reads/n_read_pairs) %>% pull(ra) %>% mean() all ``` Assuming we're able to perfectly remove all human reads, the average relative abundance of known human infecting virus is $5.37 \times 10^{-4}$. # Conclusion There were some interesting takeways from this analysis: 1. Human read fraction remains quite high in plasma if special sample preparation/library preparation is not taken (see [here](https://pubmed.ncbi.nlm.nih.gov/33767340/)). 2. There is a lot of Anelloviridae in blood, including that which has not been classified by species. 3. Pathogenic viruses such as HIV and HBV, and even latent viruses like Herpes are picked up by plasma MGS 4. Respiratory viruses are also picked up by plasma MGS (SARS), but not as much as other viruses. *We do note that we've analyzed other datasets and it seems that having a large catchment is crucial, as we'd expect, for picking up these more rare, pathogenic viruses*. Using this data, we can also look at China's HIV/HBV reported incidence to do a P2RA/swab sampling type analysis. We plan to do more analysis on this data, but wanted to share some preliminary results. # Appendix ## Human-infecting virus families, genera, and species To get a good overview of families, genera, and species, we can look at a Sankey plot where the magnitude of relative abundance, averaged over all samples, is shown in parentheses. ```{r} #| label: hv-sankey #| fig-width: 10 #| fig-height: 22.5 #| warning: false # Function to create links create_links <- function(data) { family_to_genus <- data %>% filter(!is.na(genus)) %>% group_by(family, genus) %>% summarise(value = n(), .groups = "drop") %>% mutate(source = family, target = genus) genus_to_species <- data %>% group_by(genus, species) %>% summarise(value = n(), .groups = "drop") %>% mutate(source = genus, target = species) family_to_species <- data %>% filter(is.na(genus)) %>% group_by(family, species) %>% summarise(value = n(), .groups = "drop") %>% mutate(source = family, target = species) bind_rows(family_to_genus, genus_to_species, family_to_species) %>% filter(!is.na(source)) } # Function to create nodes create_nodes <- function(links) { data.frame( name = c(links$source, links$target) %>% unique() ) } # Function to prepare data for Sankey diagram prepare_sankey_data <- function(links, nodes) { links$IDsource <- match(links$source, nodes$name) - 1 links$IDtarget <- match(links$target, nodes$name) - 1 list(links = links, nodes = nodes) } # Function to create Sankey plot create_sankey_plot <- function(sankey_data) { sankeyNetwork( Links = sankey_data$links, Nodes = sankey_data$nodes, Source = "IDsource", Target = "IDtarget", Value = "value", NodeID = "name", sinksRight = TRUE, nodeWidth = 25, fontSize = 14, ) } save_sankey_as_png <- function(sankey_plot, width = 1000, height = 800) { # Save the plot as an HTML file saveWidget(sankey_plot, sprintf('%s/sankey.html',data_dir)) } format_scientific <- function(x, digits=2) { sapply(x, function(val) { if (is.na(val) || abs(val) < 1e-15) { return("0") } else { exponent <- floor(log10(abs(val))) coef <- val / 10^exponent #return(sprintf("%.1f × 10^%d", round(coef, digits), exponent)) # May or may not be smart, just keeping magnitude return(sprintf("10^%d", exponent)) } }) } data <- result %>% mutate(across(c(genus_n_reads_tot, genus_ra_reads_tot), ~replace_na(., 0)), genus = ifelse(is.na(genus), "Unknown Genus", genus)) %>% mutate( species = paste0(species, sprintf(' (%s)', format_scientific(species_ra_reads_tot))), genus = paste0(genus, sprintf(' (%s)', format_scientific(genus_ra_reads_tot))), family = paste0(family, sprintf(' (%s)', format_scientific(family_ra_reads_tot))) ) links <- as.data.frame(create_links(data)) nodes <- create_nodes(links) sankey_data <- prepare_sankey_data(links, nodes) sankey <- create_sankey_plot(sankey_data) sankey ``` ```{r} #| label: hv-species-dot #| fig-height: 10 #| fig-width: 15 #| warning: false #| include: false species_family <- result %>% select(species, family) %>% rename('name' = 'species') play <- hv_species_counts %>% ungroup() %>% inner_join(libraries, by = 'sample') %>% inner_join(species_family, by = 'name') %>% mutate(name = ifelse(name == "Severe acute respiratory syndrome-related coronavirus", "SARS-related coronavirus", name)) adjusted_play <- play %>% group_by(name) %>% mutate(virus_prevalence_num = n_distinct(sample)/n_distinct(libraries), total_reads_hv = sum(n_reads_hv)) %>% ungroup() %>% mutate(name = fct_reorder(name, virus_prevalence_num, .desc=TRUE)) %>% select(name, ra_reads_hv, family, location, virus_prevalence_num, total_reads_hv, n_reads_hv) pal <- c(brewer.pal(8, 'Dark2'),brewer.pal(8, 'Accent')) #, labels = label_log(digits=2) ra_dot <- ggplot(adjusted_play, aes(x = ra_reads_hv, y=name)) + geom_quasirandom(orientation = 'y', aes(color = family)) + scale_color_manual(values = pal) + scale_x_log10( "Relative abundance human virus reads", labels=label_log(digits=3) ) + labs(y = "", color = 'Viral family') + guides(color = guide_legend(ncol=2)) + theme_light() + theme( axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12), axis.ticks.y = element_blank(),axis.line = element_line(colour = "black"), axis.title.x = element_text(size = 14), legend.text = element_text(size = 10), legend.title = element_text(size = 10, face="bold"), legend.position = 'bottom', #legend.position = c(1, 1), # Move legend to top right #legend.justification = c(1, 1), # Align legend to top right panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) # geom_text(aes(label = total_reads_hv, y = name), # x = 1e-9, hjust = 0, vjust = 0.5, size = 3, # check_overlap = TRUE) prev_play <- adjusted_play %>% group_by(name, location) %>% reframe(location_of_reads = n_reads_hv/total_reads_hv, total_reads_hv = first(total_reads_hv)) prevalence_bar <- ggplot(prev_play, aes(x = location_of_reads, y = name, fill = location)) + geom_bar(stat = "identity", position = "stack") + scale_fill_brewer(palette = "Accent") + scale_x_continuous( "Relative composition of locations with reads", labels = scales::percent_format(scale = 100, accuracy = 1), limits = c(0, 1) ) + labs(y = '', fill = 'Location') + guides(fill = guide_legend(ncol=1)) + theme_light() + theme( axis.text.y = element_blank(), axis.text.x = element_text(size = 12), axis.ticks.y = element_blank(), axis.line = element_line(colour = "black"), axis.title.x = element_text(size = 14), axis.title.y = element_blank(), legend.text = element_text(size = 10), legend.title = element_text(size = 10, face = "bold"), legend.position = 'bottom', #legend.position = c(1, 1), # Move legend to top right #legend.justification = c(1, 1), # Align legend to top right panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) + geom_text(aes(label = total_reads_hv, y = name), x = 1e-9, hjust = 0, vjust = 0.5, size = 3, check_overlap = TRUE, fontface = "bold") # Combine the plots using ggarrange combined_plot <- ggarrange(ra_dot, prevalence_bar, ncol = 2, widths = c(2, 1), common.legend = FALSE, align = "h") combined_plot ```