Testing the ONT version of mgs-workflow’s RAW and CLEAN subworkflows

Author

Simon Grimm

Published

December 9, 2024

After spending some time adapting mgs-workflow to take in single-read sequencing data (see here and here), I’m now working to adapt the pipeline to take in long-read Oxford Nanopore sequencing data.

The dataset used here is based on our sequencing data, generated from swab samples collected on 2024-10-10.I’m starting with the RAW and CLEAN workflows. Outputs look as expected, but I want to compare the results with the default HTML-rendered MultiQC output.

Assessing basic stats for both raw and cleaned reads

Basic statistics look fine. The higher number of duplicates does suggest that we should make sure that deduplication is working properly in the later stages of the pipeline.

Code
test_basic_stats["barcode-div"] = test_basic_stats["sample"].apply(lambda x: f"{x.split('-')[-2]}-div{x.split('-')[-1].replace('div000', '')}")
test_basic_stats_tbl= test_basic_stats[["barcode-div", "percent_gc", "mean_seq_len", "n_read_pairs", "percent_duplicates", "n_bases_approx", "stage"]]
test_basic_stats_tbl = test_basic_stats_tbl[test_basic_stats_tbl["stage"] == "raw_concat"]
test_basic_stats_tbl = test_basic_stats_tbl.drop(columns=["stage"])


# Display the result
test_basic_stats_tbl.sort_values(by="barcode-div").set_index(["barcode-div"])
percent_gc mean_seq_len n_read_pairs percent_duplicates n_bases_approx
barcode-div
05-div0 40 3570.319471 6733 38.378138 23800000
05-div1 39 3559.099024 7483 40.037418 26300000
05-div2 39 3488.091615 7990 40.287860 27500000
05-div3 40 3401.577239 7671 38.247947 25700000
05-div4 39 3434.920247 6959 37.232361 23600000
05-div5 40 3428.614773 6661 36.465996 22500000
05-div6 40 3593.581991 2421 24.452705 8600000
Table 1.1: Summary statistics for raw ONT read

Adapter contamination stats

A lot of reads show polya tails. I will check in with Vanessa to see if this level of contamination is to be expected.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
sns.lineplot(data=test_adapter_stats, x='position', y='pc_adapters', hue='stage', ax=ax,units="sample", style="adapter", estimator=None, legend=True)

# Set common properties for both subplots
ax.set_xlabel('Position')
ax.set_ylabel('% Adapters')
ax.grid(True, linestyle='--', alpha=0.7)

# Set titles for each subplot
ax.set_title('Adapter contamination along reads')
# Remove top and right spines for both subplots
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.tight_layout()
Figure 2.1: Adapter contamination along reads

Comparing base quality stats

Phred scores along the read look good! We have long, clean reads that reach up to 5000bp at quality scores of 35 and above.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))


sns.lineplot(data=test_quality_base_stats, x='position', y='mean_phred_score', hue='stage', units="sample", ax=ax,estimator=None, legend=True)

# ax.set_title('Mean phred scores along reads')

# Add horizontal lines at Phred scores 10, 20, 30, 40
ax.grid(True, linestyle='--', alpha=0.7)

# for phred in [10, 20, 30, 40]:
    # ax.axhline(y=phred, color='gray', linestyle='--', alpha=0.5, linewidth=1, zorder=-2)


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 3.1: Mean phred scores along the read

Comparing sequence quality stats

As expected given Figure 3.1, the number of sequences with a decent average Phred score looks good. Filtlong does succesfully remove low-quality sequences, though the average quality score we are currently using is probably too low.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
sns.lineplot(data=test_quality_seq_stats, x='mean_phred_score', y='n_sequences', hue='stage', ax=ax,units="sample", estimator=None, legend=True)
ax.grid(True, linestyle='--', alpha=0.7)



# Add horizontal lines at 200, 400, 600
# for n_seq in [200, 400, 600]:
    # ax.axhline(y=n_seq, color='gray', linestyle='--', alpha=0.5, linewidth=1, zorder=-2)


# ax.set_title('Average Phred scores of sequences')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 4.1: Average Phred scores of sequences

Read lengths

After cleaning, read lengths are under 200bp are removed. There are two cleaned samples, that still show data at under 200bp. This could be an artefact of how MultiQC summarises length statistics for plotting, but it’s worth following up here by manually checking the read length data.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
sns.lineplot(data=test_read_lengths, x='length', y='n_sequences', hue='stage', ax=ax, units="sample", estimator=None, legend=True)
ax.grid(True, linestyle='--', alpha=0.7)


ax.set_xlim(0, 7500)

ax.set_xlabel('Read length')
ax.set_ylabel('Number of sequences')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 5.1: Read lengths