Contaminant screen quality control tool for high-throughput sequencing data.
MGA screens for contaminants by aligning sequence reads in FASTQ format against a series of reference genomes using Bowtie and against a set of adapter sequences using Exonerate.
To reduced the computational run time, MGA samples reads, taking a subset of 100,000 per sample or lane by default, and trims these to a specified length, typically 36 bases. Trimming ensures consistency of the output mapping and error rates across runs with differing read lengths. Exonerate alignment against adapters uses the full-length sequences.
MGA was developed by Matt Eldridge with James Hadfield from the Genomics Core and is run for all sequencing runs carried out at CRUK-CI as part of the automated primary data processing and QC pipeline. It provides an alignment-based QC report soon after the sequencing has finished with a single clear plot that shows for each lane:
- yield in terms of numbers of reads
- proportion of reads mapping to the expected species/genome
- quality of sequencing in terms of error rates (reflected by boldness or opacity of the coloured bar)
- possible contamination from other species, including bacteria, viruses and fungi
- adapter content
The source code and details about installing and deploying MGA are available on GitHub.