With the completion of the Human Genome Project in 2003 and the International HapMap Project in 2005, researchers have developed a set of research tools capable of discovering the genetic contributions of common diseases. These tools include computerized databases containing the reference human genome sequence, a map of human genetic variation, and a set of technologies to analyze whole-genome samples for genetic variations that contribute to the onset of a disease. The genome-wide association studies (GWAS) allow researchers to scan markers across complete sets of DNA, or genomes, to find genetic variations and their possible associations.
These GWAS, which examine single nucleotide polymorphisms—or SNPs—across the genome, offer a chance to study complex, common diseases and the genetic variations that can contribute to a person's risk. The SNPs have already identified markers related to several conditions, including diabetes, heart abnormalities, Parkinson's disease, and Crohn's disease. These discoveries comes with the hope that with the further identification of SNPs related to chronic diseases, there can be the development of drugs that can influence interactions between a person's genes and the environment.
The use of genome-wide association studies can also identify trait- and disease-associations, as well as develop genotype information that can be leveraged for clinical applications, including development of polygenic risk scores for early detection of diseases that can also lead to prevention and treatment as well as drug development, selection, and dosage. And the understanding of genetic factors could work towards developing personalized medicine. The generation of shareable genomic data from GWASs could facilitate analysis on large and diverse sample sets.
To carry a genome-wide association study, often two groups of participants are used. The first includes people with the disease being studied, and the second includes similar people without the disease. DNA is extracted from each participant, either through drawing blood or using a cotton swab in a participant's mouth, and each participant's DNA or genome is then purified, placed on tiny chips, and scanned on automated laboratory machines. These machines scan for the single nucleotide polymorphisms, which are strategically selected as markers for genetic variation.
These studies tend to require large sample sizes to ensure they can identify reproducible, genome-wide significant associations, and the desired sample size can be determined using power calculations in specified software tools. Study designs can involve the inclusion of cases and controls when a studied trait is dichotomous, or quantitative measurements on the whole study sample when the trait is quantitative. In addition, a study can make a choice between population-based and family-based designs.
Once a study group is together, the individuals are genotyped, which is usually done through microarrays for common variants or next-generation sequencing methods. Microarray-based genotyping is the most commonly used method for obtaining genotypes, but the choice of platform can depend on many factors, including the purpose of the GWAS. For example, in a consortium-led GWAS, it can be wise to have individual cohorts genotyped on the genotyping platform.
Through this portion of a study, files for GWAS will be anonymized with individual ID numbers, coded family relations between individuals, sex, phenotype information, covariates, genotype calls for the batch. Following the input of all of this data, generating reliable results requires quality control—removing rare or monomorphic variants, removing variants not in the Hardy-Weinberg equilibrium, filtering SNPs that are missing from a fraction of individuals in a cohort, identifying and removing genotyping errors, and ensuring the phenotypes are well matched with genetic data, often by comparing self-reported sex versus sex based on the X and Y chromosomes. There are software tools specifically designed to analyze genetic data and can be used to conduct quality control steps.
Typically, in a GWAS, linear or logistic regression models are used to test for associations. This can depend on whether the phenotype is continuous or binary. And covariates such as age, sex, and ancestry are included to account for stratification and to avoid confounding effects from demographic factors. While the inclusion of an additional random effect term, which can be individual specific in linear or logistic mixed models in order to account for relatedness among individuals, can improve the statistical power for genomic discovery and increase control for stratification at the cost of requiring greater computational resources.
Millions of associations between individual genetic variants and a phenotype of interest requires a stringent multiple-testing threshold to avoid false positives. The International HapMap Project and other studies have shown that there are approximately 1 million independent common genetic variants across the human genome on average. The thresholds used will depend on the needs of the study; for example, a more stringent threshold may be needed for populations with larger effective population sizes. However, this can increase the testing burden, especially as the sample size and thresholds increase. Complex traits, such as height, schizophrenia, or type 2 diabetes, tend to be highly polygenic, and many genetic variants with small effects contribute to the phenotype.
To increase sample sizes, genome-wide association studies can be carried out with consortiums such as the Psychiatric Genomics Consortium, the Genetic Investigation of Anthropometric Traits consortium, or the Global Lipids Genetics Consortium where data from multiple cohorts can be analyzed using different software tools. The steps for some of the inclusion of this data are to ensure individual cohorts follow the same predefined data analysis plan and to use harmonized phenotypes and communicate results in a standardized way.
This can include scaling effect sizes to a standard normal distribution, as phenotypic measurements and their estimated absolute effect sizes cannot always be compared across cohorts. Cohort-level inspection of submitted results using a predefined quality control protocol can be carried out by at least two independent analysts, with issues resolved in individual cohorts. Finally, a meta-analysis can be performed on the summary statistics which can assume error variances across cohorts.
The rationale for using genome-wide association studies (GWAS) includes the investigation of common disease for human health, and in the study of genetics in plants for better agriculture. These studies can also find the pathogens for common diseases and disorders and provide an unbiased, hypothesis-free method of examining the human genome. GWAS has so far proven successful in medical diseases and in agriculture, with notable impacts of discovered genetic variants.
There have been discoveries that have been generated from different genome-wide association studies that have provided insights into the importance of genomes on traits and disease in human health and otherwise.
GWAS results have been reported for hundreds of complex traits across a range of domains, including common diseases, quantitative traits that are risk factors for disease, brain imaging phenotypes, genomic measures such as gene expression and DNA methylation, and social and behavioral traits such as subjective well being and educational attainment. And these associations have proven highly replicable, both within and between populations, under the assumption of adequate sample sizes. One unambiguous conclusion from these studies has been that for almost any complex trait, many loci contribute to standing genetic variation, or that for most traits and diseases studied, the mutational target in the genome appears large so that polymorphisms in many genes contribute to genetic variation in the population.
The number of segregating variants in the human population is large but finite, and while it is not known what proportion of the segregating variants are associate with complex-trait genetic variation, the fact remains that many traits that have been studied have been associated with variants at hundreds to thousands of loci in the genome. This suggests that some underlying causal variants are the same. And there have been multiple lines of evidence that are consistent with widespread pleiotropy for complex traits. The first being Mendelian mutations that cause specific syndromes or diseases, which have been frequently associated with multiple phenotypes in an affected individual. The second being pedigree studies, which have reported genetic correlation traits and which have implied that a number of the same genetic variants will affect two or more traits in a consistent direction. And the third being that GWAS have shown that the same genetic variants can be associated with multiple diseases and traits when phenotypes are measured on individuals and environmental associations are not driving results.
Genome-wide association studies have led to new analysis methods that fall into a number of categories which depend on their purpose:
- Methods of better modeling population structure and relatedness in a sample during association analyses
- Methods of detecting new variants and gene loci on the basis of GWAS summary statistics
- Methods of estimating and partitioning genetic (co)variance
- Methods of inferring causality
In addition, GWAS discoveries and interpretation have benefited from improved algorithms in statistical imputation of unobserved genotypes and statistical imputation of human leukocyte antigen genes and amino acid polymorphisms.
In 2007 it was shown GWAS data could be used to create genetic predictors for disease and other complex traits by estimating the effect size at multiple loci and using estimated SNP effects in independent samples to generate a polygenic risk score (PRS) per individual. A key driver of prediction accuracy is the size of the discovery sample used for estimating the effects of individual variants. PRSs have also been applied in clinical settings for the prediction of an individual's risk of disease and in applications for new experimental designs and discoveries. These studies have also been useful for detecting new trait associations by correlating observed phenotypes in a sample or cohort with the genetic prediction of another trait. The paradigm of PRSs can be applied to the prediction of molecular phenotypes such as gene expressing and be used for mining the human phenome for association with predictors derived from diseases and other traits.
Sharing genetic data in the gene-mapping community has helped in gene-mapping success. The availability of GWAS summary statistics in the public domain has increased, with hundreds of datasets are publicly available. There were initial concerns that making these studies available could lead to individual identification, but has proved difficult as the sample sizes and summary statistics are large and providing average allele frequencies from a reference sample negates potential identification. This has been largely beneficial to the genomics field.
The variants mapped through GWAS provide a strong genetic anchor to complex disease biology and can lead to the development of new therapies. However, going from genetics to function requires robust model systems in which disease-causal cells and tissues can be probed and manipulated. For example, tumor-derived human cell lines have been relevant for the identification for novel drug targets in a disease such as cancer. And such a model system could provide clues for drug target validation, as it offers a look into the molecular mechanisms of disease to identify relevant genes and to screen compounds with therapeutic potential at high-throughput. However, for many complex diseases, knowing or discovering which cells are causal is the first step.
The translation of genetic loci into biological mechanisms that underlie disease has been an arduous task, and a major challenge has been the exploration of the functional consequences of identified variants. This has been as the majority of GWAS-identified variants lie in the non-coding part of the genome. However, GWASs have generated an enormous amount of information that has fueled applied epidemiological research. Some of the most prominent applications are Mendelian Randomization (MR) and polygenic scores. MR is used to determine the causality between an exposure and an outcome.
Genetic variants that are associated with the exposure are used to randomize a population in those with high exposure and those with low exposure. If the same genetic variants also associate with the disease outcome through associations with exposure, causality between exposure and disease has been inferred. MR analyses have been performed in order to confirm or refute these relationships between numerous traits correlated to disease. This approach has also been used to validate putative drug targets prior to the initiation of clinical trials and to determine potential side effect of therapeutic inventions.
Three examples of adult-onset disease to demonstrate some advances that have followed as a result of the use of GWAS. In general, drug targets are genetically informed will have a higher probability of making it to a phase III trial or to market. The examples include type 2 diabetes, auto-immune diseases, and schizophrenia.
Scores of genes have been implicated in monogenic forms of diabetes. Recent efforts to extend GWASs beyond array-based genotyping and to access a broader range of variants through sequencing which have revealed that most genetic variation which influence type 2 diabetes and appears to reside at common variant sites. This has agreed with a view of type 2 diabetes as a largely post-reproductive trait and is consistent with a failure to detect compelling empirical evidence that type 2 diabetes risk alleles have been subject to marked purifying selection.
GWASs have been undertaken for all major immune-mediated disease, resulting in hundreds of associated loci. The development of statistical approaches for cross-disease studies to identify pleiotropic loci has been particularly productive in identifying new genes and in understanding the pathogenic relatedness of immune-mediated diseases. A cross-disease involving conditions including ankylosing spondylitis, inflammatory bowel-disease, primary sclerosing cholangitis, and psoriasis identified without any further genotyping thirty new loci at genome-wide significance.
Although psychiatric diseases have had a slow start in GWAS locus identification, the research has picked up and the linear relationship between sample size and number of loci has observed, and more than one hundred risk loci have been discovered. These risk loci have been found to be enriched in genes containing de novo mutations in schizophrenia, autism, and intellectual disabilities, and several identified loci contain genes relevant to major hypotheses of schizophrenia etiology and genes that extend previous observations of association with voltage-gated calcium channel subunits. One such finding was the highly polygenic nature of the common variants contributing to risk. This observation has since been replicated and there is evidence of substantial pleiotropy with other psychiatric disorders.
One of the hopes for genome-wide associations studies is that they can take identified genetic loci and translate them into the biological mechanisms that underlie diseases. This is an arduous task, with one of the major challenges being the exploration of the functional consequences of identified variants, with the majority of variants identified in GWASs in the non-coding parts of the genome. However, more research has created multi-omics data across multiple cell types and tissues. There are more computational pipelines being developed to integrate these multi-omics data with genome-wide association data to further help with determining the regulatory impact of a locus and to help prioritize the likely causal variant and gene to determine the tissues that are key to the pathogenesis of a given disease.
For example, using these computational tools, more than 20 percent of the loci associated with type 2 diabetes have been mapped to the most likely causal variant. Subsequent validation, using targeted molecular experiments, can be critical to establishing the role of a prioritized gene or variant. The overall generation of more and new data and the development of technologies and analytical approaches can continue to offer translation of a growing number of GWAS loci into meaningful biology and provide clinical targets for further study.
Besides offering data for translational research, GWAS have also generated information that has been applied to epidemiological research. One of the more prominent applications of this are Mendelian Randomization (MR) and polygenic scores. MR is used to determine causality between an exposure and an outcome. Genetic variants associated with the exposure are used to randomize a population in those with high exposure and those with low exposure. If the same genetic variants also associate with the disease through the association with the exposure, this allows for causality between exposure and disease to be inferred.
MR analyses have been performed to confirm or refute causal relationships between numerous correlated traits and diseases. This approach has been used also to validate putative drug targets prior to clinical trials, and to determine the potential side effects of therapeutic interventions.
GWAS data has also been applied to polygenic risk scores (PRSs) in disease stratification and precision medicine. The PRS works to estimate an individual's lifetime genetic susceptibility to disease by aggregating the effects conferred by the millions of variants tested in a GWAS. The assumption is that individuals with a high PRS have an above-average lifetime genetic risk of developing a given disease. However, there is a belief that the clinical utility of the PRS needs to be assessed in the context of existing clinical predictors of risk; otherwise, the PRS offers a chance for healthcare providers to help with decisions for their participations, with some of the most promising evidences for its utility seen in cardiovascular diseases and cancer.
There has been some criticism of GWAS, especially in its clinical applications, with some researchers arguing that because GWAS loci confer a small increase in disease risk and explain only a fraction of the heritability, reducing the overall importance of GWAS. However, as more loci have been translated into biological data, the evidence and the strength of associations of a GWAS loci and its biological importance have increased. Another criticism has been the underrepresentation of individuals of non-European ancestry, with around only 10 percent of GWAS participants of non-European descent. This lack of representation can limit the transferability of GWAS results across populations.
Genome-wide association studies have been used in agriculture to find the genetic basis of traits in plants, such as flowering time, growth rate, and yield, and has been focused on to improve crops and understand plant adaptation. One plant, A. thaliana has been an attractive model for the study of natural variation and adaptation because of the plant's wide distribution, the diversity of its habitats, and the unequaled genomic resources available for the species.
GWAS studies have acted as an important tool in plant breeding. Large genotyping and phenotyping data allow GWASs to be a powerful tool in analyzing complex inheritance modes of traits that are important yield components, including number of grains per spike, weight of each grain, and plant structure. In a GWAS on spring wheat, the study revealed a correlation of gain production with booting data, biomass, and number of grains per spike.
The emergence of plant pathogens have posed threats to plant health and biodiversity. With this in mind, the identification of wild types with natural resistances to pathogens could be important, as the variations in those genes could be used to create greater resistances in an overall strain. Furthermore, GWAS is a tool which can detect the relationships between the resistance to the plant pathogen, which can be a first step in developing new cultivars or strains with greater pathogen-resistance.
The potential for A. thaliana as a plant to study was demonstrated in the functional validation of the gene ACD6. Natural variation in ACD6 has been shown to underpin differences in vegetative growth and in resistance to microbial infection and herbivory. With a loss-of-function allele of ACD6, the plants displayed increased leaf necrosis with reduced growth and reduced susceptibility to different pathogens. Genome-wide association studies were performed for leaf necrosis on a set of ninety-six natural accessions. Nine of the fifteen SNPs with the lowest P-values in the scan were located close to or within the ACD6 gene. This study confirmed the ability of GWAS to detect true positives from results previously known.
The gene density seen in A. thaliana suggests that some genomic regions display extended clusters of high-scoring SNPs instead of sharp peaks. The broad "mountain range" of associations can make the selection of candidate genes difficult, and the width of these "mountains" can be broad due to a recent selective sweep or because of low recombination. Further, genetic or allelic heterogeneity can create these ranges with multiple peaks. These ranges and tightly linked genes have been shown to be underlying a complex association with growth rate variation.
A limitation of GWAS is the ability to identify individual genes in the midst of false positives which can be an artifact of population structure. The worldwide set of natural A. thaliana accessions is structured and when phenotypic variation of interest traits overlaps the patterns of population structure, strong confounding can occur. Nevertheless, GWAS in A. thaliana have been shown to have significant power in detecting known candidate genes and have detected hundreds of loci that are involved in the natural variation of complex traits. The new knowledge has increased the understanding of the number of genes underlying adaptive traits and the size of the effects and allows researchers to better understand the bases of flowering time, growth rate, and yield.
Two of the most important crops in the world, maize and rice, have been the focus of mapping the ancestral genetic variations underlying agronomic traits such as grain yield, disease resistance, and plant architecture. Maize is an outcrossing plant with a large genome that requires the typing of many SNPs to define a haplotype map. This has included a set of 1.6 million SNPs designed for maize GWAS, but the dense genotyping of the large number of lines has been prohibited in research due to the cost.
In the research, initially, the approach was taken instead to genotype a limited number of lines and to cross them to produce families, known as nested association mapping populations. The main advantage of this approach have been the imputation of high-density genotypes with some fine-mapping resolution; outcrossing reshuffles variation in the founder genomes and can provide some control of population structure effects; joint-linkage mapping has been used to identify low-resolution QTL across families and to control for genetic background while performing nested associations for fine mapping; and the repeated measures of phenotypes on the same lines, in common and different environments, allows for a precise estimation of variation in traits such as flowering time, leaf architecture, and blight resistance.
Rice, like A. thaliana, is a selfing species and an equally good candidate for GWAS. Researchers have been able to identify an unbiased set of common SNPs that they used to identify strong associations between genetic loci and fourteen agronomic traits. These included heading date, grain size, and starch quality. In this research, a strategy based on second-generation sequencing technology was used to develop a haplotype map for 517 Chinese land races across the Oryza indica and Oryza japonica rice subspecies. GWAS was then performed using SNPs in a subset of 373 indica lines to avoid major confounding in the population structure between subspecies.
This research was able to identify between one and seven loci for each agronomic trait, which were able to explain between 6 percent and 68 percent of the variation in the specific trait. A few genes that had large effects in controlling traits were involved in determining yield, morphology, stress tolerance, and nutritional quality were also discovered in genome-wide association studies done on rice. This research establishes a platform to link genomic variation and germplasm collections for molecular breeding.
Software plays an important role in genome-wide association studies, as it allow for the modeling of different types of gene actions, including additive, simplex dominant, and duplex dominant. As well, this software can analyze data developed from GWASs and offering gene simulations in some cases. Two primary platforms used for many GWAS include products from Illumina and Affymetrix, competing with two different approaches to measure SNP variation. The Affymetrix platform prints short DNA sequences as a spot on the chip capable of recognizing a specific SNP allele. Illumina, on the other hand, uses a bead-based technology with longer DNA sequences in order to detect alleles. The Illumina chips tend to be more expensive but have been found to provide better specificity.
This software can also help researches detect further loci through meta-analysis of studies from the same populations and increase the sample size over an individual study. Some software analysis packages incorporate routines for meta-analysis, but can be ill-equipped for the scale and complexity of the data generated from genome-wide association studies.