How do we assess NGS read quality?

GOALS

1.) Understand NGS data format and quality scores

2.) Perform quality checks on NGS data

3.) Learn about quality assessment in NGS reads

NGS Data and Quality

When observing sanger sequencing data, we can very easily assess quality by viewing chromatograms and manually trimming and editing sequences.

alt text Sanger sequencing data chromatograms. You can easily determine quality sequences and trim by hand.

However, NGS produces millions of sequences that can’t be viewed manually. So how can we determine the quality of our NGS reads?

First, let’s look at some NGS data. Many NGS reads are delivered from sequencers in .fastq format. the .fastq format looks like this:

@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
>TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
>+
>CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################

1.) The first line contains an @, followed by identification information. This first line will vary depending on sequencing platform, sequencing run, and individual sequence information.

2.) The second line contains the actual sequence generated.

Reflection:
Assuming this was an Illumina sequencing run, did this sequence come from a 75, 150, or 300 cycle kit? Why?

3.) The third line contains the read quality information and is based on the following scale (! being the lowest quality, and ~ being the highest quality):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Reflection:
Where are the lowest quality bases located? Why do you think this is?

Now, we could easily interpret the quality of this sequence by hand, but again, NGS is producing millions (or up to billions!) of these reads, so we can’t visually inspect them. Instead, we have to rely on software programs to scan our sequence files and give us easy to read quality reports.

Performing QC on NGS reads

There are many quality assessment programs, but one of the most commonly used is FastQC. FastQC was developed in 2010 by individuals at the Babraham Institute and has since become one of the most widely used quality assessment platforms in NGS, and is even incorporated within Illumina BaseSpace Labs.

The program can be run on your local machine, on an HPC, or miniconda. Whatever method you choose, fastQC is run very easily. Navigate to your directory containing all your .fastq or .fastq.gz (zipped) and run the command:

fastqc *.fastq*

If the program is running correctly, you will see a progress report:

Started analysis of xxxxx_1.fastq
Approx 5% complete for xxxxx_1.fastq
Approx 10% complete for xxxxx_1.fastq
Approx 15% complete for xxxxx_1.fastq
Approx 20% complete for xxxxx_1.fastq
Approx 25% complete for xxxxx_1.fastq
Approx 30% complete for xxxxx_1.fastq
Approx 35% complete for xxxxx_1.fastq
Approx 40% complete for xxxxx_1.fastq
Approx 45% complete for xxxxx_1.fastq
....
Approx 80% complete for xxxxx_2.fastq.gz
Approx 85% complete for xxxxx_2.fastq.gz
Approx 90% complete for xxxxx_2.fastq.gz
Approx 95% complete for xxxxx_2.fastq.gz
Analysis complete for xxxxx_2.fastq.gz

Once the program has finished running, two sets of files will be generated for every sample: a sample01.fastqc.html and a sample_01.zip. The html files can be viewed on any web browser (Tip: highlight all samples and hit the space key to easily tab through all files).

This is easy enough, but having a more comprehensive output is much more useful for interpreting patterns across many samples. We can use MultiQC (Ewells et al. 2016) for this. This program will take all of our FastQC outputs for every sample and create a single report that is interactive.

To run:

1.) Install locally, using Pypl, Bioconda, or utilize within GALAXY.

2.) Navigate to your FastQC.zip directory (not the .html directory)

3.) Run MultiQC

multiqc .

4.) That’s it! You will generate a MultiQC directory that contains tab-delimited information, but the easiest way to navigate your output is to view the multiqc_report.html.

Understanding QC results for NGS reads

FastQC reports contain the following results about sequence quality:

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per sequence GC content

Per base N content

Sequence Length Distribution

Sequence Duplication Levels

Overrepresented sequences

Adapter Content

Kmer Content

Now that we have run our QC programs, we need to be able to interpret the results. Running QC is useless if we don’t understand what the results mean and what they mean for our data. For example, FasQC and MultiQC will bin the results as normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross). What does normal even MEAN?

All of these interpretations begin with understanding your type of data. For example, the FastQC quality default assumes a completely random and diverse library. However, our libraries may have inherent biases that violate these assumptions. For example, amplicon libraries will almost always have over represented sequences and poor per base sequence content. We will discuss more below.

Let’s understand what each result can tell us about our data and potential quality issues:

BASIC STATISTICS

Information about input FASTQ file name, type, and encoding. Also contains the number of total sequences, filtered sequences, length, and GC content.

PER BASE SEQUENCE QUALITY

A box-and-whisker plot indicating the quality (Q-scores) of each base in your reads. The X-axis represents the position of each base (dependent on your read lengths) and the Y-axis indicates the quality score.

1.) Green indicates Q-scores ≥ 28

2.) Yellow indicates Q-scores between 22 and 28

3.) Red indicates Q-scores less than 22

Yellow boxes indicate the 25th and 75th percentile, and the whiskers represent the 10th and 90th percentile. The red line indicates the median score, and the blue line represents the mean.