I’m adapting mgs-workflow to take in ONT sequencing data. Here I’m checking if: i) the output of the ONT PROFILE workflow looks as expected.
TO EDIT: Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn’t impede mgs-workflow runs on paired-end data.
First off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the kraken_reports_merged.tsv.gz files, which gives a detailed taxonomic breakdown.
Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we’d expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.
Code
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)','Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300) #df = test_kraken_plot_dfdf_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) *100left = np.zeros(len(df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]for i, row in df.iterrows(): ax.text(101, i, f'{int(row["Total Reads"]):,}', va='center', ha='left')# Customize the plot# ax.set_yticks(rotation=0) # No rotation needed for horizontal bars# ax.set_title(title)ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)# Add some padding to the right for the legendax.text(101, 0.5, 'Number of\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)plt.tight_layout()
/var/folders/gm/txqg8t5d57z34sqfjcvpcjj00000gn/T/ipykernel_17207/482444413.py:42: UserWarning:
Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
Figure 2.1: Kingdom-level read distribution (Kraken)
Assessing Bracken output files
The other output of the PROFILE subworkflow is the bracken_reports_merged.tsv.gz file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?
Code
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300)# Calculate percentagesdf_pct = test_bracken_plot_df[cols_to_normalize].div(test_bracken_plot_df[cols_to_normalize].sum(axis=1), axis=0) *100# Plot stacked bars horizontallyleft = np.zeros(len(test_bracken_plot_df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(test_bracken_plot_df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]# Add total read counts as textfor i, row in test_bracken_plot_df.iterrows(): ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}', va='center', ha='left')ax.set_xlabel('Percentage (%)')ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.text(101, len(test_bracken_plot_df)-1, 'Number of\nAssigned\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.2), loc='center', ncol=3)plt.tight_layout()
Figure 3.1: Kingdom-level read distribution (Bracken)
Next steps
Overall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.
Source Code
---title: "Testing the ONT version of mgs-workflow's PROFILE subworkflow"author: "Simon Grimm"date: 2024-12-14format: html: code-fold: true code-tools: true code-link: true df-print: paged toc: true toc-depth: 2 cap-location: bottom fig-format: svg crossref: fig-title: Figure fig-prefix: Figure chapters: truejupyter: venvtitle-block-banner: "#5cb2a0"---I'm adapting mgs-workflow to take in ONT sequencing data. Here I'm checking if: i) the output of the ONT PROFILE workflow looks as expected.TO EDIT: Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn't impede mgs-workflow runs on paired-end data. - Show unclassified reads# Introduction```{python}#| label: load-packages#| include: falseimport osimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns``````{python}#| label: data-pathstest_dir ="~/code/simons-notebook/posts/2024-12-14-ont-taxonomy-eval/mgs-results/test-ont-ww"test_output_dir = os.path.join(test_dir, "output")test_results_dir = os.path.join(test_output_dir, "results")test_bracken_path = os.path.join(test_results_dir, "bracken_reports_merged.tsv.gz")test_kraken_path = os.path.join(test_results_dir, "kraken_reports_merged.tsv.gz")```# Assessing Kraken output filesFirst off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the `kraken_reports_merged.tsv.gz` files, which gives a detailed taxonomic breakdown.Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we'd expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.```{python}#| label: load-kraken-output#| include: false#| echo: falsetest_kraken = pd.read_csv(test_kraken_path, sep='\t')# print(se_kraken.head())# print(pe_kraken.head())print(test_kraken.head())``````{python}#| label: generate-table#| include: falsedata = []tax_ids = {"Bacteria": 2,"Viruses": 10239,"Archaea": 2157,"Eukaryota": 2759,"Unclassified": 0,}test_kraken_df = pd.read_csv(test_kraken_path, sep='\t')def process_df(df): data = []for sample, sample_data in df.groupby("sample"):# Process ribosomal reads ribo_df = sample_data[sample_data["ribosomal"] ==True] n_reads_bacteria_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Bacteria"]]["n_reads_clade"].sum() n_reads_virus_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Viruses"]]["n_reads_clade"].sum() n_reads_archea_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Archaea"]]["n_reads_clade"].sum() n_reads_eukaryota_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Eukaryota"]]["n_reads_clade"].sum() n_reads_unclassified_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Unclassified"]]["n_reads_clade"].sum()# Process non-ribosomal reads nonribo_df = sample_data[sample_data["ribosomal"] ==False] n_reads_bacteria_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Bacteria"]]["n_reads_clade"].sum() n_reads_virus_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Viruses"]]["n_reads_clade"].sum() n_reads_archea_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Archaea"]]["n_reads_clade"].sum() n_reads_eukaryota_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Eukaryota"]]["n_reads_clade"].sum() n_reads_unclassified_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Unclassified"]]["n_reads_clade"].sum() total_reads = ( n_reads_bacteria_ribo + n_reads_bacteria_nonribo + n_reads_virus_ribo + n_reads_virus_nonribo + n_reads_archea_ribo + n_reads_archea_nonribo + n_reads_eukaryota_ribo + n_reads_eukaryota_nonribo + n_reads_unclassified_ribo + n_reads_unclassified_nonribo ) data.append([ sample, n_reads_bacteria_ribo, n_reads_bacteria_nonribo, n_reads_virus_ribo, n_reads_virus_nonribo, n_reads_archea_ribo, n_reads_archea_nonribo, n_reads_eukaryota_ribo, n_reads_eukaryota_nonribo, n_reads_unclassified_ribo, n_reads_unclassified_nonribo, total_reads ])return pd.DataFrame(data, columns=["sample","Bacteria (Ribosomal)", "Bacteria (Non-ribosomal)","Viruses (Ribosomal)", "Viruses (Non-ribosomal)","Archaea (Ribosomal)", "Archaea (Non-ribosomal)","Eukaryota (Ribosomal)", "Eukaryota (Non-ribosomal)","Unclassified (Ribosomal)", "Unclassified (Non-ribosomal)","Total Reads" ])test_kraken_plot_df = process_df(test_kraken_df)test_kraken_plot_df``````{python}#| label: fig-kraken-stats#| fig-cap: Kingdom-level read distribution (Kraken)import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)','Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300) #df = test_kraken_plot_dfdf_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) *100left = np.zeros(len(df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]for i, row in df.iterrows(): ax.text(101, i, f'{int(row["Total Reads"]):,}', va='center', ha='left')# Customize the plot# ax.set_yticks(rotation=0) # No rotation needed for horizontal bars# ax.set_title(title)ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)# Add some padding to the right for the legendax.text(101, 0.5, 'Number of\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)plt.tight_layout()```# Assessing Bracken output filesThe other output of the PROFILE subworkflow is the `bracken_reports_merged.tsv.gz` file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?```{python}#| label: load-bracken-output#| include: falsetest_bracken = pd.read_csv(test_bracken_path, sep='\t')``````{python}#| label: generate-bracken-table#| include: falsedef process_df(df): data = []for sample, sample_data in df.groupby("sample"):# Process ribosomal reads ribo_df = sample_data[sample_data["ribosomal"] ==True] n_reads_bacteria_ribo = ribo_df[ribo_df["name"] =="Bacteria"]["kraken_assigned_reads"].sum() n_reads_virus_ribo = ribo_df[ribo_df["name"] =="Viruses"]["kraken_assigned_reads"].sum() n_reads_archea_ribo = ribo_df[ribo_df["name"] =="Archaea"]["kraken_assigned_reads"].sum() n_reads_eukaryota_ribo = ribo_df[ribo_df["name"] =="Eukaryota"]["kraken_assigned_reads"].sum()# Process non-ribosomal reads nonribo_df = sample_data[sample_data["ribosomal"] ==False] n_reads_bacteria_nonribo = nonribo_df[nonribo_df["name"] =="Bacteria"]["kraken_assigned_reads"].sum() n_reads_virus_nonribo = nonribo_df[nonribo_df["name"] =="Viruses"]["kraken_assigned_reads"].sum() n_reads_archea_nonribo = nonribo_df[nonribo_df["name"] =="Archaea"]["kraken_assigned_reads"].sum() n_reads_eukaryota_nonribo = nonribo_df[nonribo_df["name"] =="Eukaryota"]["kraken_assigned_reads"].sum() total_assigned_reads = ( n_reads_bacteria_ribo + n_reads_bacteria_nonribo + n_reads_virus_ribo + n_reads_virus_nonribo + n_reads_archea_ribo + n_reads_archea_nonribo + n_reads_eukaryota_ribo + n_reads_eukaryota_nonribo ) data.append([ sample, n_reads_bacteria_ribo, n_reads_bacteria_nonribo, n_reads_virus_ribo, n_reads_virus_nonribo, n_reads_archea_ribo, n_reads_archea_nonribo, n_reads_eukaryota_ribo, n_reads_eukaryota_nonribo, total_assigned_reads ])return pd.DataFrame(data, columns=["sample","Bacteria (Ribosomal)", "Bacteria (Non-ribosomal)","Viruses (Ribosomal)", "Viruses (Non-ribosomal)","Archaea (Ribosomal)", "Archaea (Non-ribosomal)","Eukaryota (Ribosomal)", "Eukaryota (Non-ribosomal)","Total Assigned Reads" ])test_bracken_plot_df = process_df(test_bracken)``````{python}#| label: fig-bracken-stats#| fig-cap: Kingdom-level read distribution (Bracken)import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300)# Calculate percentagesdf_pct = test_bracken_plot_df[cols_to_normalize].div(test_bracken_plot_df[cols_to_normalize].sum(axis=1), axis=0) *100# Plot stacked bars horizontallyleft = np.zeros(len(test_bracken_plot_df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(test_bracken_plot_df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]# Add total read counts as textfor i, row in test_bracken_plot_df.iterrows(): ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}', va='center', ha='left')ax.set_xlabel('Percentage (%)')ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.text(101, len(test_bracken_plot_df)-1, 'Number of\nAssigned\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.2), loc='center', ncol=3)plt.tight_layout()```# Next stepsOverall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.