MGA is a contamination screen quality control tool for high-throughput sequencing data.
MGA screens for contaminants by aligning sequence reads in FASTQ format against a series of reference genomes using Bowtie and against a set of adapter sequences using Exonerate.
To reduced the computational run time, MGA samples reads, taking a subset of 100,000 per sample or lane by default, and trims these to a specified length, typically 36 bases. Trimming ensures consistency of the output mapping and error rates across runs with differing read lengths. Exonerate alignment against adapters uses the full-length sequences.
MGA was developed by Matt Eldridge with James Hadfield from the Genomics Core and is run for all sequencing runs carried out at CRUK-CI as part of the automated primary data processing and QC pipeline. It provides an alignment-based QC report soon after the sequencing has finished with a single clear plot that shows for each lane:
Yield in terms of numbers of reads
Proportion of reads mapping to the expected species/genome
Quality of sequencing in terms of error rates (reflected by boldness or opacity of the coloured bar)
Possible contamination from other species, including bacteria, viruses and fungi
The source code and details about installing and deploying MGA are available on GitHub.