crukci-bioinformatics/MGA

Contaminant screen quality control tool for high-throughput sequencing data.

MGA screens for contaminants by aligning sequence reads in FASTQ format against a series of reference genomes using Bowtie and against a set of adapter sequences using Exonerate.

To reduced the computational run time, MGA samples reads, taking a subset of 100,000 per sample or lane by default, and trims these to a specified length, typically 36 bases. Trimming ensures consistency of the output mapping and error rates across runs with differing read lengths. Exonerate alignment against adapters uses the full-length sequences.

MGA was developed by Matt Eldridge with James Hadfield from the Genomics Core and is run for all sequencing runs carried out at CRUK-CI as part of the automated primary data processing and QC pipeline. It provides an alignment-based QC report soon after the sequencing has finished with a single clear plot that shows for each lane:

  • yield in terms of numbers of reads
  • proportion of reads mapping to the expected species/genome
  • quality of sequencing in terms of error rates (reflected by boldness or opacity of the coloured bar)
  • indication of how well the
  • contamination from other species, including bacteria, viruses and fungi
  • adapter content