MGA is a contamination screen quality control tool for high-throughput sequencing data.
MGA screens for contaminants by aligning sequence reads in FASTQ format against a series of reference genomes using Bowtie and against a set of adapter sequences using Exonerate.
To reduced the computational run time, MGA samples reads, taking a subset of 100,000 per sample or lane by default, and trims these to a specified length, typically 36 bases. Trimming ensures consistency of the output mapping and error rates across runs with differing read lengths. Exonerate alignment against adapters uses the full-length sequences.
MGA was developed by Matt Eldridge with James Hadfield from the Genomics Core and is run for all sequencing runs carried out at CRUK-CI as part of the automated primary data processing and QC pipeline. It provides an alignment-based QC report soon after the sequencing has finished with a single clear plot that shows for each lane:
-
Yield in terms of numbers of reads
-
Proportion of reads mapping to the expected species/genome
-
Quality of sequencing in terms of error rates (reflected by boldness or opacity of the coloured bar)
-
Possible contamination from other species, including bacteria, viruses and fungi
-
Adapter content
The source code and details about installing and deploying MGA are available on GitHub.
A plugin for MultiQC for MGA, developed by Richard Bowers, is also available here.