I’m adapting mgs-workflow to take in ONT sequencing data. Here I’m checking if: i) the output of the ONT PROFILE workflow looks as expected.
TO EDIT: Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn’t impede mgs-workflow runs on paired-end data.
First off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the kraken_reports_merged.tsv.gz files, which gives a detailed taxonomic breakdown.
Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we’d expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.
Code
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)','Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300) #df = test_kraken_plot_dfdf_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) *100left = np.zeros(len(df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]for i, row in df.iterrows(): ax.text(101, i, f'{int(row["Total Reads"]):,}', va='center', ha='left')# Customize the plot# ax.set_yticks(rotation=0) # No rotation needed for horizontal bars# ax.set_title(title)ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)# Add some padding to the right for the legendax.text(101, 0.5, 'Number of\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)plt.tight_layout()
/var/folders/gm/txqg8t5d57z34sqfjcvpcjj00000gn/T/ipykernel_17207/482444413.py:42: UserWarning:
Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
Assessing Bracken output files
The other output of the PROFILE subworkflow is the bracken_reports_merged.tsv.gz file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?
Code
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300)# Calculate percentagesdf_pct = test_bracken_plot_df[cols_to_normalize].div(test_bracken_plot_df[cols_to_normalize].sum(axis=1), axis=0) *100# Plot stacked bars horizontallyleft = np.zeros(len(test_bracken_plot_df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(test_bracken_plot_df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]# Add total read counts as textfor i, row in test_bracken_plot_df.iterrows(): ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}', va='center', ha='left')ax.set_xlabel('Percentage (%)')ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.text(101, len(test_bracken_plot_df)-1, 'Number of\nAssigned\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.2), loc='center', ncol=3)plt.tight_layout()
Next steps
Overall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.
Source Code
---title: "Testing the ONT version of mgs-workflow's PROFILE subworkflow"author: "Simon Grimm"date: 2024-12-14format: html: code-fold: true code-tools: true code-link: true df-print: paged toc: true toc-depth: 2 cap-location: bottom fig-format: svg crossref: fig-title: Figure fig-prefix: Figure chapters: truejupyter: venvtitle-block-banner: "#5cb2a0"---I'm adapting mgs-workflow to take in ONT sequencing data. Here I'm checking if: i) the output of the ONT PROFILE workflow looks as expected.TO EDIT: Looking at Kingdom-level composition in both Kraken and Bracken results, the single-read output looks as expected, i.e., similar to the paired-end output. The paired-end run looks the same as when using the original run workflow, so addition of single-read functionality doesn't impede mgs-workflow runs on paired-end data. - Show unclassified reads# Introduction```{python}#| label: load-packages#| include: falseimport osimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns``````{python}#| label: data-pathstest_dir ="~/code/simons-notebook/posts/2024-12-14-ont-taxonomy-eval/mgs-results/test-ont-ww"test_output_dir = os.path.join(test_dir, "output")test_results_dir = os.path.join(test_output_dir, "results")test_bracken_path = os.path.join(test_results_dir, "bracken_reports_merged.tsv.gz")test_kraken_path = os.path.join(test_results_dir, "kraken_reports_merged.tsv.gz")```# Assessing Kraken output filesFirst off, we can check if the high-level taxonomy statistics are the same between the single-read, paired-end, and default test runs. For this we use the `kraken_reports_merged.tsv.gz` files, which gives a detailed taxonomic breakdown.Only looking at the kingdom-level, the single-read and paired-end read results look nearly identical, which is what we'd expect. As seen in the previous notebook, in the single-read run more total reads survive FASTP cleaning, leading to a higher number of reads that end up getting classified.```{python}#| label: load-kraken-output#| include: false#| echo: falsetest_kraken = pd.read_csv(test_kraken_path, sep='\t')# print(se_kraken.head())# print(pe_kraken.head())print(test_kraken.head())``````{python}#| label: generate-table#| include: falsedata = []tax_ids = {"Bacteria": 2,"Viruses": 10239,"Archaea": 2157,"Eukaryota": 2759,"Unclassified": 0,}test_kraken_df = pd.read_csv(test_kraken_path, sep='\t')def process_df(df): data = []for sample, sample_data in df.groupby("sample"):# Process ribosomal reads ribo_df = sample_data[sample_data["ribosomal"] ==True] n_reads_bacteria_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Bacteria"]]["n_reads_clade"].sum() n_reads_virus_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Viruses"]]["n_reads_clade"].sum() n_reads_archea_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Archaea"]]["n_reads_clade"].sum() n_reads_eukaryota_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Eukaryota"]]["n_reads_clade"].sum() n_reads_unclassified_ribo = ribo_df[ribo_df["taxid"] == tax_ids["Unclassified"]]["n_reads_clade"].sum()# Process non-ribosomal reads nonribo_df = sample_data[sample_data["ribosomal"] ==False] n_reads_bacteria_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Bacteria"]]["n_reads_clade"].sum() n_reads_virus_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Viruses"]]["n_reads_clade"].sum() n_reads_archea_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Archaea"]]["n_reads_clade"].sum() n_reads_eukaryota_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Eukaryota"]]["n_reads_clade"].sum() n_reads_unclassified_nonribo = nonribo_df[nonribo_df["taxid"] == tax_ids["Unclassified"]]["n_reads_clade"].sum() total_reads = ( n_reads_bacteria_ribo + n_reads_bacteria_nonribo + n_reads_virus_ribo + n_reads_virus_nonribo + n_reads_archea_ribo + n_reads_archea_nonribo + n_reads_eukaryota_ribo + n_reads_eukaryota_nonribo + n_reads_unclassified_ribo + n_reads_unclassified_nonribo ) data.append([ sample, n_reads_bacteria_ribo, n_reads_bacteria_nonribo, n_reads_virus_ribo, n_reads_virus_nonribo, n_reads_archea_ribo, n_reads_archea_nonribo, n_reads_eukaryota_ribo, n_reads_eukaryota_nonribo, n_reads_unclassified_ribo, n_reads_unclassified_nonribo, total_reads ])return pd.DataFrame(data, columns=["sample","Bacteria (Ribosomal)", "Bacteria (Non-ribosomal)","Viruses (Ribosomal)", "Viruses (Non-ribosomal)","Archaea (Ribosomal)", "Archaea (Non-ribosomal)","Eukaryota (Ribosomal)", "Eukaryota (Non-ribosomal)","Unclassified (Ribosomal)", "Unclassified (Non-ribosomal)","Total Reads" ])test_kraken_plot_df = process_df(test_kraken_df)test_kraken_plot_df``````{python}#| label: fig-kraken-stats#| fig-cap: Kingdom-level read distribution (Kraken)import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)','Unclassified (Ribosomal)', 'Unclassified (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300) #df = test_kraken_plot_dfdf_pct = df[cols_to_normalize].div(df[cols_to_normalize].sum(axis=1), axis=0) *100left = np.zeros(len(df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]for i, row in df.iterrows(): ax.text(101, i, f'{int(row["Total Reads"]):,}', va='center', ha='left')# Customize the plot# ax.set_yticks(rotation=0) # No rotation needed for horizontal bars# ax.set_title(title)ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)# Add some padding to the right for the legendax.text(101, 0.5, 'Number of\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.8), loc='center', ncol=3)plt.tight_layout()```# Assessing Bracken output filesThe other output of the PROFILE subworkflow is the `bracken_reports_merged.tsv.gz` file, which gives Bracken-summarized and corrected kingdom-level counts, based on the Kraken output. Again, single-read results look very similar to paired-end results. The only difference is a slightly higher share of non-ribosomal bacterial reads in the single-read run. I am uncertain to why that is, maybe the reads that FASTP removed in the paired-end run were enriched for non-ribosomal bacterial reads?```{python}#| label: load-bracken-output#| include: falsetest_bracken = pd.read_csv(test_bracken_path, sep='\t')``````{python}#| label: generate-bracken-table#| include: falsedef process_df(df): data = []for sample, sample_data in df.groupby("sample"):# Process ribosomal reads ribo_df = sample_data[sample_data["ribosomal"] ==True] n_reads_bacteria_ribo = ribo_df[ribo_df["name"] =="Bacteria"]["kraken_assigned_reads"].sum() n_reads_virus_ribo = ribo_df[ribo_df["name"] =="Viruses"]["kraken_assigned_reads"].sum() n_reads_archea_ribo = ribo_df[ribo_df["name"] =="Archaea"]["kraken_assigned_reads"].sum() n_reads_eukaryota_ribo = ribo_df[ribo_df["name"] =="Eukaryota"]["kraken_assigned_reads"].sum()# Process non-ribosomal reads nonribo_df = sample_data[sample_data["ribosomal"] ==False] n_reads_bacteria_nonribo = nonribo_df[nonribo_df["name"] =="Bacteria"]["kraken_assigned_reads"].sum() n_reads_virus_nonribo = nonribo_df[nonribo_df["name"] =="Viruses"]["kraken_assigned_reads"].sum() n_reads_archea_nonribo = nonribo_df[nonribo_df["name"] =="Archaea"]["kraken_assigned_reads"].sum() n_reads_eukaryota_nonribo = nonribo_df[nonribo_df["name"] =="Eukaryota"]["kraken_assigned_reads"].sum() total_assigned_reads = ( n_reads_bacteria_ribo + n_reads_bacteria_nonribo + n_reads_virus_ribo + n_reads_virus_nonribo + n_reads_archea_ribo + n_reads_archea_nonribo + n_reads_eukaryota_ribo + n_reads_eukaryota_nonribo ) data.append([ sample, n_reads_bacteria_ribo, n_reads_bacteria_nonribo, n_reads_virus_ribo, n_reads_virus_nonribo, n_reads_archea_ribo, n_reads_archea_nonribo, n_reads_eukaryota_ribo, n_reads_eukaryota_nonribo, total_assigned_reads ])return pd.DataFrame(data, columns=["sample","Bacteria (Ribosomal)", "Bacteria (Non-ribosomal)","Viruses (Ribosomal)", "Viruses (Non-ribosomal)","Archaea (Ribosomal)", "Archaea (Non-ribosomal)","Eukaryota (Ribosomal)", "Eukaryota (Non-ribosomal)","Total Assigned Reads" ])test_bracken_plot_df = process_df(test_bracken)``````{python}#| label: fig-bracken-stats#| fig-cap: Kingdom-level read distribution (Bracken)import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Calculate percentagescols_to_normalize = ['Bacteria (Ribosomal)', 'Bacteria (Non-ribosomal)','Viruses (Ribosomal)', 'Viruses (Non-ribosomal)','Archaea (Ribosomal)', 'Archaea (Non-ribosomal)','Eukaryota (Ribosomal)', 'Eukaryota (Non-ribosomal)']# Create the horizontal stacked bar plotfig, ax = plt.subplots(figsize=(10, 3), dpi=300)# Calculate percentagesdf_pct = test_bracken_plot_df[cols_to_normalize].div(test_bracken_plot_df[cols_to_normalize].sum(axis=1), axis=0) *100# Plot stacked bars horizontallyleft = np.zeros(len(test_bracken_plot_df))colors = plt.rcParams['axes.prop_cycle'].by_key()['color']for i, column inenumerate(cols_to_normalize): ax.barh(test_bracken_plot_df['sample'], df_pct[column], left=left, label=column.capitalize(), color=colors[i]) left += df_pct[column]# Add total read counts as textfor i, row in test_bracken_plot_df.iterrows(): ax.text(101, i, f'{int(row["Total Assigned Reads"]):,}', va='center', ha='left')ax.set_xlabel('Percentage (%)')ax.set_xlim(0, 100)ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.text(101, len(test_bracken_plot_df)-1, 'Number of\nAssigned\nreads', va='center', ha='left')ax.legend(title='Taxonomy', bbox_to_anchor=(0.5, -0.2), loc='center', ncol=3)plt.tight_layout()```# Next stepsOverall, the single-read version of the PROFILE workflow seems to work as expected. The next step is to adapt the HV subworkflow.