Quality control of Zephyr 5b

Author

Simon Grimm

Published

December 9, 2024

Here is a quick analysis of the results of the Zephyr 5b run from November 13, 2024. I find very high polyA contamination, a large number of very short low-quality reads (<500bp), but good quality scores for those reads that reach the expected length of 4000bp.

Assessing basic stats for both raw and cleaned reads

Notably, the GC content and mean sequence length of most sequencing files are fairly low (~20% GC, mean length ~ 320 bp). We’d expect mean sequence length to be close to 4kbp, and the GC content to be closer to 40%. Duplication rates are also fairly high, with up to 60%.

Code
test_basic_stats["barcode-div"] = test_basic_stats["sample"].apply(lambda x: f"{x.split('-')[-2]}-div{x.split('-')[-1].replace('div000', '')}")
test_basic_stats_tbl= test_basic_stats[["barcode-div", "percent_gc", "mean_seq_len", "n_read_pairs", "percent_duplicates", "n_bases_approx", "stage"]]
test_basic_stats_tbl = test_basic_stats_tbl.sort_values(by="barcode-div")
test_basic_stats_tbl = test_basic_stats_tbl.sort_values(by="stage", ascending=False)
# test_basic_stats_tbl = test_basic_stats_tbl[test_basic_stats_tbl["stage"] == "raw_concat"]
# test_basic_stats_tbl = test_basic_stats_tbl.drop(columns=["stage"])


# Display the result
# test_basic_stats_tbl.sort_values(by="barcode-div")
test_basic_stats_tbl
barcode-div percent_gc mean_seq_len n_read_pairs percent_duplicates n_bases_approx stage
6 12-div1 22 395.454956 7315 69.022556 2200000.0 raw_concat
1 12-div2 17 288.020629 7562 70.801375 2000000.0 raw_concat
3 12-div3 20 298.447495 6666 68.616862 1800000.0 raw_concat
5 12-div4 18 354.134247 4745 67.481560 1200000.0 raw_concat
2 13-div1 17 247.800969 4748 63.668913 1000000.0 raw_concat
7 13-div2 23 5065.290848 4786 65.461763 1400000.0 raw_concat
4 13-div3 16 318.018220 4281 63.092735 923.4 raw_concat
0 13-div4 14 319.472310 3178 63.184393 699.1 raw_concat
11 12-div1 24 534.104340 3479 78.787008 1800000.0 cleaned
15 12-div2 19 431.134010 3746 82.728243 1600000.0 cleaned
8 12-div3 23 481.664729 3096 80.006460 1500000.0 cleaned
13 12-div4 20 451.063258 2308 80.199307 1000000.0 cleaned
10 13-div1 18 380.833789 2196 74.908925 852.5 cleaned
9 13-div2 25 5130.156676 2419 75.775114 1200000.0 cleaned
12 13-div3 17 351.858591 1973 75.164724 707.1 cleaned
14 13-div4 16 378.543734 1532 74.738903 553.9 cleaned
Table 1.1: Summary statistics for raw ONT read

Adapter contamination stats

PolyA contamination is very high. The higher contamination levels in cleaned reads may be due to the removal of very low-complexity reads that didn’t get classified as being polyA-contaminated.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
sns.lineplot(data=test_adapter_stats, x='position', y='pc_adapters', hue='stage', ax=ax,units="sample", style="adapter", estimator=None, legend=True)

# Set common properties for both subplots
ax.set_xlabel('Position')
ax.set_ylabel('% Adapters')
ax.grid(True, linestyle='--', alpha=0.7)

ax.set_xlim(0, 15000)

# Set titles for each subplot
ax.set_title('Adapter contamination along reads')
# Remove top and right spines for both subplots
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.tight_layout()
Figure 2.1: PolyA/PolyQ contamination along reads

Comparing base quality stats

Most reads have a good average Phred score up until around 6000bp, dropping off thereafter. Cleaning increases Phred scores for most samples. The low scores at the start of the reads are likely due to the large number of short low-quality reads.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))


sns.lineplot(data=test_quality_base_stats, x='position', y='mean_phred_score', hue='stage', units="sample", ax=ax,estimator=None, legend=True)

# ax.set_title('Mean phred scores along reads')

# Add horizontal lines at Phred scores 10, 20, 30, 40
ax.grid(True, linestyle='--', alpha=0.7)

ax.set_xlim(0, 15000)

# for phred in [10, 20, 30, 40]:
    # ax.axhline(y=phred, color='gray', linestyle='--', alpha=0.5, linewidth=1, zorder=-2)


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 3.1: Mean phred scores along the read

Comparing sequence quality stats

The modal mean phred score for reads is around 24, which isn’t great. Potentially this average would be improved by I) stronger trimming of reads once their quality drops below a certain threshold, and II) removal of reads that are short, and probably very low quality (see the low quality scores of over the first few hundred base pair positions in Figure 3.1).

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
sns.lineplot(data=test_quality_seq_stats, x='mean_phred_score', y='n_sequences', hue='stage', ax=ax,units="sample", estimator=None, legend=True)
ax.grid(True, linestyle='--', alpha=0.7)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 4.1: Average Phred scores of sequences

Read lengths

As observed in the initial summary stats, most reads in the sample are very short (Figure 5.1). Zooming in in Figure 5.2 shows that there are a couple of reads that have the expected length of 4000bp.

Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
test_read_lengths_cleaned = test_read_lengths[test_read_lengths["stage"] == "cleaned"]

test_read_lengths_cleaned["Treatment"] = test_read_lengths_cleaned["sample"].apply(lambda x: "Filtration + Concentration" if "12" in x else "Filtration")

sns.lineplot(data=test_read_lengths_cleaned, x='length', y='n_sequences', hue='Treatment', ax=ax, units="sample", estimator=None, legend=True)
ax.grid(True, linestyle='--', alpha=0.7)


ax.set_xlim(0, 6500)

ax.set_ylim(0, 1000)

ax.set_xlabel('Read length')
ax.set_ylabel('Number of sequences')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 5.1: Read lengths (high-level)
Code
fig, ax = plt.subplots(dpi=300, figsize=(10, 4))
test_read_lengths_cleaned = test_read_lengths[test_read_lengths["stage"] == "cleaned"]

test_read_lengths_cleaned["Treatment"] = test_read_lengths_cleaned["sample"].apply(lambda x: "Filtration + Concentration" if "12" in x else "Filtration")

sns.lineplot(data=test_read_lengths_cleaned, x='length', y='n_sequences', hue='Treatment', ax=ax, units="sample", estimator=None, legend=True)
ax.grid(True, linestyle='--', alpha=0.7)


ax.set_xlim(500, 8000)

ax.set_ylim(0, 50)

ax.set_xlabel('Read length')
ax.set_ylabel('Number of sequences')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Figure 5.2: Read lengths (zoomed in)