Beyond GWAS: functional analysis of genome variation and inherited risk of cancer

Our research is focussed on how polygenic variation contributes to susceptibility to breast and lung cancers. We are interested in risk prediction, and in mechanisms of risk as a way into risk reduction.

GWAS (genome wide association studies) identify the hundreds of loci at which genetic variation contributes to disease risk.  The data are considered by some to be of little value, because the effects at each single locus are small, and with some exceptions, insights into mechanisms of cancer risk have as yet been limited.  But in combination across many loci, effects can be large, resulting in stratification of risk within the population with potential public health implications.  Most GWAS variants affect gene regulation: so we decided to use gene regulatory networks as ‘integrators’ to reveal the mechanism by which multiple GWAS variants might combine to alter risk, interacting both between themselves and with the environment. Our studies in breast cancer have provided insights not obtained from classical pathway analysis of single loci.  These demonstrate the potential of the approach, which we are now also developing further and applying to lung cancer.

Breast cancer

We constructed transcription factor-centred gene regulatory networks using the ARACNe algorithm (Califano lab) on gene expression data from the METABRIC study (Carlos Caldas). We mapped onto the networks the genes whose expression was altered by variation at the (then 72) confirmed breast GWAS loci – the eQTLs.  The mapping was done by scoring regulon enrichment: which regulons contained more of these genes than would be expected by chance distribution across the network. We obtained a striking result (Figure 1; ref 1): a cluster of 36 overlapping regulons showed strong enrichment; and these regulons were centred on the transcription factors ESR1 (the estrogen receptor), FOXA1, GATA3, and others already implicated in somatic changes in breast cancer – and known to be central to the regulation of estrogen signalling. The cluster of 36 regulons could be further split into two groups, related to ER+ and ER- breast cancer, in which the transcription factors had opposing effects on the expression of genes shared between them.  The balance of these opposing effects could be seen in the differentiation of normal mammary epithelial cells; and using a metric of ESR1 regulon activity (see below), could be correlated to patient survival and to cellular response to anti estrogens used in treatment. The underlying molecular mechanisms have been partly elucidated (refs 2,3).  What we discovered was in part already known, that estrogen signalling is central to breast cancer development. But this demonstrated that the approach has validity and potentially can provide insights in other cancers, where such mechanisms are still unknown.

The limitation of any analysis that relies on first identifying eQTLs is that in most studies, cis eQTLs are identified for only about one third of loci, and trans eQTLs for even fewer. Analysis of a single locus with current methods can take a small team 2 or 3 years.  We are therefore now attempting to bypass the eQTL step altogether and make a direct association between genotype and ‘regulon activity’.

Regulon activity, based on the VIPER algorithm (Califano lab), uses the aggregate expression of all the genes within a given regulon, to make comparison between two states – in this case, the risk and wild type genotypes at a single GWAS locus. We are finding ways to extend this comparison to include the effects of genome variation at all GWAS loci, including those below formal genome wide significance and those with trans effects, on all regulons. Preliminary results suggest that by this means we can identify new clusters of regulons, beyond the 36 regulons described above, that imply additional mechanisms of risk to those involved in estrogen signalling.  Because regulon activities act as an integrator of the functional effects of many snps, we are also testing whether a metric based on this functional analysis can complement or improve on risk estimates based on the mathematical models for the combined effect of many snps that are currently used to construct polygenic risk scores.

Lung cancer

15% of smokers develop lung cancer. Is this just bad luck? Our hypothesis is that some smokers may be at higher risk because they make a more cancer-prone response to airway smoke injury, a difference at least in part reflecting genetic background.  To study this, we are comparing the smoke injury response between individuals and trying to relate this to genetic background.  Our readout of the injury response is gene expression in normal airway epithelium, compared between different states:  smokers vs never smokers and ex-smokers; individuals with cancer and without. We study nasal and bronchial epithelium from the same individuals, because future application of our results to identify healthy individuals at risk within the population will require an easily accessible tissue surrogate for bronchus, and nasal samples may meet that need.  Rather than simply compare lists of genes differentially expressed between comparison states, we use the gene regulatory network analysis that we developed for breast cancer. By reducing noise, this gives a stronger signal; and – as with breast cancer – we hope that it will indicate the regulatory processes that, among the chaos of smoke injury, are those most strongly associated with risk.

We hope to identify which smokers in the healthy population are most at risk, and by what mechanism, for targeting in programmes of intervention.  Former smokers are of particular interest, as they account for over half of smoking related lung cancer, and the obvious intervention of stopping smoking cannot be used. Published data suggest that there may be, in some former smokers, a persisting ‘echo’ of smoke injury perhaps as chronic inflammation. If so, linking a specific component of this to risk may provide a target for risk reduction.

We have collected samples and gene expression data from over 400 individuals, including patients with and without lung cancer and healthy volunteers, and analysis is in progress.  Preliminary results show differences between smokers with cancer and those without, consistent with those published by others.  There is significant overlap between the regulons with altered activity in cancer patients compared with non-cancer, in nasal and bronchial epithelium.  Activities include DNA repair, oxidative stress response, inflammation and immune response.  Important evidence that these differences are indeed related to risk, rather than merely secondary to the presence of cancer, comes from the similarity of the regulons that show altered activity between cancer and no cancer individuals, and those affected by variation at lung cancer GWAS loci.

Figure 1: Mapping of ‘enriched regulons’ on breast cancer network

Transcription-factor centred regulatory network for breast epithelium.  Each circle is a regulon, containing approximately 50-300 genes regulated by a single transcription factor (TF).  Many genes are regulated by more than one TF, so the regulons overlap, forming the network.  The threshold for overlap here is set at Jaccard 0.4; the singleton regulons have overlap below that threshold.  The regulons coloured yellow/orange/red are those enriched for genes, whose expression is altered by variation at multiple GWAS loci.  The cluster of regulons is centred around ESRI (ref 1).


  1. Castro MAA et al (2016). Regulators of genetic risk of breast cancer identified by integrative network analysis. Nature Genet 48, 12-21.
  2. Campbell TM et al (2016). FGFR2 risk SNPs confer breast cancer risk by augmenting estrogen responsiveness. Carcinogenesis 37, 741-750.
  3. Campbell TM et al (2018). ER alpha binding by transcription factors NFIB and YBX1 enables FGFR2 signaling to modulate estrogen responsiveness in breast cancer. Cancer Res 78, 410-421.