Testing the ONT version of mgs-workflow’s PROFILE subworkflow

Author

Simon Grimm

Published

December 14, 2024

I’m adapting mgs-workflow to take in ONT sequencing data. Here I’m checking if: i) the output of the ONT PROFILE workflow looks as expected.

TO EDIT: Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn’t impede mgs-workflow runs on paired-end data.

Introduction

Code
test_dir = "~/code/simons-notebook/posts/2024-12-14-ont-taxonomy-eval/mgs-results/test-ont-ww"

test_output_dir = os.path.join(test_dir, "output")

test_results_dir = os.path.join(test_output_dir, "results")

test_bracken_path = os.path.join(test_results_dir, "bracken_reports_merged.tsv.gz")

test_kraken_path = os.path.join(test_results_dir, "kraken_reports_merged.tsv.gz")

Assessing Kraken output files

First off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the kraken_reports_merged.tsv.gz files, which gives a detailed taxonomic breakdown.

Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we’d expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Calculate percentages
cols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)',
                     'Viruses (Ribosomal)', 'Viruses (Non-ribosomal)',
                     'Archaea (Ribosomal)', 'Archaea (Non-ribosomal)',
                     'Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)',
                     'Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']

# Create the horizontal stacked bar plot
fig, ax = plt.subplots(figsize=(10, 3), dpi=300)  #


df = test_kraken_plot_df

df_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) * 100

left = np.zeros(len(df))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
for i, column in enumerate(cols_to_normalize):
    ax.barh(df['sample'], df_pct[column], left=left,
            label=column.capitalize(), color=colors[i])
    left += df_pct[column]
for i, row in df.iterrows():
    ax.text(101, i, f'{int(row["Total Reads"]):,}',
            va='center', ha='left')

# Customize the plot
# ax.set_yticks(rotation=0)  # No rotation needed for horizontal bars

# ax.set_title(title)
ax.set_xlim(0, 100)


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add some padding to the right for the legend
ax.text(101, 0.5, 'Number of\nreads', va='center', ha='left')
ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)
plt.tight_layout()
/var/folders/gm/txqg8t5d57z34sqfjcvpcjj00000gn/T/ipykernel_17207/482444413.py:42: UserWarning:

Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
Figure 2.1: Kingdom-level read distribution (Kraken)

Assessing Bracken output files

The other output of the PROFILE subworkflow is the bracken_reports_merged.tsv.gz file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Calculate percentages
cols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)',
                     'Viruses (Ribosomal)', 'Viruses (Non-ribosomal)',
                     'Archaea (Ribosomal)', 'Archaea (Non-ribosomal)',
                     'Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']

# Create the horizontal stacked bar plot
fig, ax = plt.subplots(figsize=(10, 3), dpi=300)

# Calculate percentages
df_pct = test_bracken_plot_df[cols_to_normalize].div(test_bracken_plot_df[cols_to_normalize].sum(axis=1), axis=0) * 100

# Plot stacked bars horizontally
left = np.zeros(len(test_bracken_plot_df))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
for i, column in enumerate(cols_to_normalize):
    ax.barh(test_bracken_plot_df['sample'], df_pct[column], left=left,
            label=column.capitalize(), color=colors[i])
    left += df_pct[column]

# Add total read counts as text
for i, row in test_bracken_plot_df.iterrows():
    ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}',
            va='center', ha='left')

ax.set_xlabel('Percentage (%)')
ax.set_xlim(0, 100)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.text(101, len(test_bracken_plot_df)-1, 'Number of\nAssigned\nreads', va='center', ha='left')
ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.2), loc='center', ncol=3)
plt.tight_layout()
Figure 3.1: Kingdom-level read distribution (Bracken)

Next steps

Overall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.