Testing the single-read version of mgs-workflow’s PROFILE subworkflow

Author

Simon Grimm

Published

November 7, 2024

I’m adapting mgs-workflow to take in single-read sequencing data. Here I’m checking if: i) the output of the single-read PROFILE workflow looks as expected, and ii) if the output of the paired-end version is the same as the output of the test run on dev.

As in my previous post, the dataset the workflows are run on is based on Project Runway samples, with library prep performed at the BMC followed by Element AVITI sequencing. Will previously analyzed this test dataset here, using an early version of mgs-workflow.

Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn’t impede mgs-workflow runs on paired-end data.

I will amend the HV subworkflow of mgs-workflow next.

Introduction

The single-end dataset simply consists of the forward reads of our usual test dataset. Additionally, I work with the paired-end test dataset. Both of these are run through run_dev_se.nf, a shorter version of run.nf that only includes RAW, CLEAN, and PROFILE. run_dev_se.nf contains a read_type parameter, which is set to single_end for the single-end run, and paired_end for the paired-end run. Finally, the paired-end run is compared to the output of a normal test run using run.nf.

Code
single_read_dir = "~/code/simons-notebook/posts/2024-10-28-mgs-taxonomy-eval/mgs-results/test_single_read"
paired_read_dir = "~/code/simons-notebook/posts/2024-10-28-mgs-taxonomy-eval/mgs-results/test_paired_end"
test_dir = "~/code/simons-notebook/posts/2024-10-28-mgs-taxonomy-eval/mgs-results/test"

se_output_dir = os.path.join(single_read_dir, "output")
pe_output_dir = os.path.join(paired_read_dir, "output")
test_output_dir = os.path.join(test_dir, "output")

se_results_dir = os.path.join(se_output_dir, "results")
pe_results_dir = os.path.join(pe_output_dir, "results")
test_results_dir = os.path.join(test_output_dir, "results")

se_bracken_path = os.path.join(se_results_dir, "bracken_reports_merged.tsv.gz")
pe_bracken_path = os.path.join(pe_results_dir, "bracken_reports_merged.tsv.gz")
test_bracken_path = os.path.join(test_results_dir, "bracken_reports_merged.tsv.gz")

se_kraken_path = os.path.join(se_results_dir, "kraken_reports_merged.tsv.gz")
pe_kraken_path = os.path.join(pe_results_dir, "kraken_reports_merged.tsv.gz")
test_kraken_path = os.path.join(test_results_dir, "kraken_reports_merged.tsv.gz")

Assessing Kraken output files

First off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the kraken_reports_merged.tsv.gz files, which gives a detailed taxonomic breakdown.

Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we’d expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Calculate percentages
cols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)',
                     'Viruses (Ribosomal)', 'Viruses (Non-ribosomal)',
                     'Archaea (Ribosomal)', 'Archaea (Non-ribosomal)',
                     'Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)',
                     'Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']

# Create the horizontal stacked bar plot
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 8), dpi=300)  #

# Plot stacked bars horizontally
for ax, df, title in zip([ax1, ax2, ax3], [se_kraken_plot_df, pe_kraken_plot_df, test_kraken_plot_df], ["single-end", "paired-end", "normal run"]):
    df_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) * 100

    left = np.zeros(len(df))
    colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
    for i, column in enumerate(cols_to_normalize):
        ax.barh(df['sample'], df_pct[column], left=left,
                label=column.capitalize(), color=colors[i])
        left += df_pct[column]
    for i, row in df.iterrows():
        ax.text(101, i, f'{int(row["Total Reads"]):,}',
                va='center', ha='left')

    # Customize the plot
    # ax.set_yticks(rotation=0)  # No rotation needed for horizontal bars

    ax.set_title(title)
    ax.set_xlim(0, 100)

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    # Add some padding to the right for the legend
ax3.set_xlabel('Percentage (%)')
ax1.text(101, 9, 'Number of\nreads', va='center', ha='left')
ax3.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)
plt.tight_layout()
Figure 2.1: Kingdom-level read distribution (Kraken)

Assessing Bracken output files

The other output of the PROFILE subworkflow is the bracken_reports_merged.tsv.gz file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Calculate percentages
cols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)',
                     'Viruses (Ribosomal)', 'Viruses (Non-ribosomal)',
                     'Archaea (Ribosomal)', 'Archaea (Non-ribosomal)',
                     'Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']

# Create the horizontal stacked bar plot
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 8), dpi=300)  #

# Plot stacked bars horizontally
for ax, df, title in zip([ax1, ax2, ax3], [se_bracken_plot_df, pe_bracken_plot_df, test_bracken_plot_df], ["single-end", "paired-end", "normal run"]):
    df_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) * 100

    left = np.zeros(len(df))
    colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
    for i, column in enumerate(cols_to_normalize):
        ax.barh(df['sample'], df_pct[column], left=left,
                label=column.capitalize(), color=colors[i])
        left += df_pct[column]
    for i, row in df.iterrows():
        ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}',
                va='center', ha='left')

    # Customize the plot
    # ax.set_yticks(rotation=0)  # No rotation needed for horizontal bars

    ax.set_title(title)
    ax.set_xlim(0, 100)

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    # Add some padding to the right for the legend
ax3.set_xlabel('Percentage (%)')
ax1.text(101, 9, 'Number of\nAssigned\nreads', va='center', ha='left')
ax3.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)
plt.tight_layout()
Figure 3.1: Kingdom-level read distribution (Bracken)

Next steps

Overall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.