GWAS – Genes to Genomes https://genestogenomes.org A blog from the Genetics Society of America Tue, 01 Oct 2024 22:44:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 https://genestogenomes.org/wp-content/uploads/2023/06/cropped-G2G_favicon-32x32.png GWAS – Genes to Genomes https://genestogenomes.org 32 32 Four new pipelines to streamline and improve genomic analyses https://genestogenomes.org/four-new-pipelines-to-streamline-and-improve-genomic-analyses/ Tue, 17 Sep 2024 13:00:00 +0000 https://genestogenomes.org/?p=87443 G3 reports exciting methods designed to make specific genomic analyses easier.]]>

As part of its scope, G3 Genes|Genomes|Genetics is dedicated to reporting new methods and technologies of significant benefit to the genetics community. Here, we highlight a selection of new analysis pipelines and software developments from the August 2024 issue that promise to improve research and practical applications in their respective subfields. These advances include easy and ready-to-use genomics tools that improve data management and analysis and overcome long-time challenges, emphasizing the ongoing progress and innovation happening in genomics.

An easy-to-use phylogenetic analysis pipeline

A new turn-key pipeline called OrthoPhyl has answered the call to improve the phylogenetic analysis of bacterial genomes. Developed by Middlebrook et al., OrthoPhyl can analyze up to 1,200 input genomes and reconstruct high-resolution phylogenetic trees based on whole genome codon alignments from diverse bacterial clades.

The beauty of OrthoPhyl is that it streamlines a usually complex, multi-step process requiring extensive bioinformatics expertise and computing resources into a multi-threaded tool that runs from a single command.

With more than 2 million publicly available bacterial genomes in NCBI’s GenBank database, OrthoPhyl can help research groups in the fields of bacterial phylogenetics and taxonomy take advantage of existing datasets to inform their ongoing analyses amid the ever-expanding sea of bacterial diversity.

Accurate genotype phasing and inference of grandparental haplotypes

To improve the analysis of complex plant genomes, Montero-Tena et al. have developed a new computational pipeline called haploMAGIC, which lets researchers identify locations of recombination known as genome-wide crossovers (COs) in multi-parent populations. haploMAGIC uses single-nucleotide polymorphism (SNP) data and known pedigree information to accurately phase genotypes, i.e., determine which alleles were inherited from each parent, and to reconstruct grandparental haplotypes, i.e., determine which alleles were inherited from each grandparent.

When tested on real-world data, haploMAGIC improved upon existing methods by using different levels of haploblock filtering to prevent false-positive COs—a common limitation—even as rates of genotyping errors increased. haploMAGIC can also distinguish between COs and gene conversions. By learning more about the position and frequency of genetic recombination events in complex plant genomes, breeders can better manage and expand genetic variation in their breeding programs.

A complete HiC/HiFi assembly pipeline

The USDA-ARS AgPest100 Initiative aims to create high-quality genome assemblies of pest insects that threaten agricultural production. However, the high cost and time currently needed to produce and manage these assemblies often hinders progress.

Molik et al. set out to address this challenge by developing a new Hi-C/high-fidelity (HiFi) sequencing genomic assembly pipeline called only the best (otb) using the Nextflow programming language. They then used otb to create a HiC/HiFi genome of the two-lined spittlebug, a significant agricultural pest that is not well understood. Overall, otb was able to streamline the process and reduce manual input and analysis time—including time spent organizing data and installing and calibrating bioinformatic tools.

By saving time, otb can significantly reduce costs for large genomic projects like AgPest100 and pave the way for new discoveries. Indeed, the HiC/HiFi assembly of the spittlebug genome represents a first step toward better understanding this plant-eating pest, which may lead to new, sustainable ways to manage it.

Assigning triploids to their diploid parents

Roche et al. have developed the first publicly-available, ready-to-use software for assigning triploid fish to their diploid parents. Triploidy means that an organism has three sets of chromosomes instead of two, and sterile triploids are commonly used in aquaculture breeding programs for their better yield and growth and to prevent genetic contamination of wild fish populations. The authors improve upon existing frameworks by updating the parentage assignment R package APIS to support triploids with diploid parentage.

When assessed with simulated and real datasets, APIS accurately assigned triploid offspring to their diploid parents using both likelihood and exclusion methods. The new software represents a key tool for establishing pedigrees in fish farming.

References

]]>
Balancing genetic privacy with open data in genomic research https://genestogenomes.org/balancing-genetic-privacy-with-open-data-in-genomic-research/ Wed, 05 Jun 2024 20:51:08 +0000 https://genestogenomes.org/?p=87188 A new encryption method published in GENETICS allows researchers to maintain human data confidentiality without the need for decryption in genomic analyses.]]>

Genome-wide prediction and association studies offer a powerful approach to connecting genotype to phenotype at a large scale, but performing genomic analyses in humans invokes genomic privacy concerns that complicate the sharing of data. In a study published in the March issue of GENETICS, Zhao and colleagues expand an existing encryption approach, offering a secure avenue to perform genomic analysis without compromising confidentiality.

In whole-genome analysis, such as genomic prediction and genome-wide association studies (GWAS), researchers use statistical methods to compare genetic variants across many genomes to calculate genetic effects and estimate heritability. Linear mixed models allow testing for associations in both continuous traits, such as height, blood pressure, and body mass index, and binary phenotypes, such as disease status. Information about covariates like age, sex, and family origin is critical to assess confounding effects originating from demographic factors. In these cases, linear mixed model analysis helps account for genetic relatedness among individuals, which is necessary to strengthen statistical inference for discoveries made from the genomics data.

Because of the inherent privacy and intellectual property concerns, direct sharing of raw genotype and phenotype data is often prohibited, for example in human research; researchers first anonymize sensitive information like individual ID numbers, sex, disease status, family relations between individuals, and other covariates before performing any calculations.

So then, in a research landscape that values open-access data principles like FAIR (findable, accessible, interoperable, and reusable), how can population geneticists make their data widely available without compromising the privacy of the individuals in question?

Several data encryption approaches that obscure sensitive information have been developed; the homomorphic encryption method for genotype and phenotype (HEGP) methodology encrypts genotype, phenotype, and covariate data in a way that cannot be linked back to original identifiers, thus maintaining data privacy. However, the HEGP methodology has only been proposed for single-marker regression in GWAS using linear mixed models. Thus, Zhao et al. extended the HEGP methodology for wider application in genome-to-phenome analyses and demonstrated that HEGP can be effectively applied to many popular mixed models for genomic analyses of quantitative traits, beyond single-marker regression.

The authors used the HEGP scheme to perform linear mixed model analysis without the need for data decryption before the analysis. They successfully measured random effects originating from covariates that matched the original sample data.

They also demonstrated the HEGP method’s usefulness in analyzing genotype-phenotype characterization from multiple studies. In genomics, certain traits are difficult and expensive to measure, which often leads to studies with lower sample sizes. Researchers usually need to analyze multiple underpowered studies together to increase statistical power. Zhao et al. showed their HEGP expansion can combine multiple datasets for joint genomic analyses while preserving data confidentiality.

In conclusion, geneticists have an encryption method available for genomic analyses that allows them to perform necessary statistical analyses without disclosing sensitive information, thereby avoiding privacy concerns altogether. 

References

]]>
Hongyu Zhao joins GENETICS as new Senior Editor https://genestogenomes.org/hongyu-zhao-joins-genetics-as-new-senior-editor/ Tue, 16 Apr 2024 16:09:20 +0000 https://genestogenomes.org/?p=86974 A new senior editor is joining GENETICS in the Statistical Genetics and Genomics section. We’re excited to welcome Hongyu Zhao to the editorial team.]]>

Hongyu Zhao
Senior Editor, Statistical Genetics and Genomics

Hongyu Zhao is the Ira V. Hiscock Professor of Biostatistics, Professor of Genetics, and Professor of Statistics and Data Science at Yale University. He received his BS in Probability and Statistics from Peking University in 1990 and PhD in Statistics from the University of California, Berkeley in 1995. His research interests are the development and application of statistical methods in molecular biology, genetics, therapeutics, and precision medicine with a focus on genome-wide association studies, biobank analysis, and single cell analysis. He is an elected fellow of the American Association for the Advancement of Science, the American Statistical Association, the Institute of Mathematical Statistics, and Connecticut Academy of Science and Engineering. He received the Mortimer Spiegelman Award for a top statistician in health statistics by the American Public Health Association and Pao-Lu Hsu Prize by the International Chinese Statistical Association.

Why publish in GENETICS?

]]>
Scientists pinpoint the “fight” in fighting chickens https://genestogenomes.org/scientists-pinpoint-the-fight-in-fighting-chickens/ Thu, 04 Apr 2024 14:09:00 +0000 https://genestogenomes.org/?p=86965 A genome-wide association study published in G3: Genes|Genomes|Genetics offers insights into the genetic origins of aggression in gamecocks.]]>

While the controversial practice of cock fighting was recently criminalized across the United States, it remains part of many cultures throughout the world. At first glance, fighting chickens look like ordinary chickens raised for livestock, but these so-called gamecocks have been selectively bred over thousands of years to be highly aggressive. Now, research published in the February issue of G3: Genes|Genomes|Genetics pinpoints the genetic origins of that aggression.

Evidence suggests that cockfights took place as early as 2700 BC in China and 2500 BC in the Indus Valley, which was then home to the Bronze Age Harappan civilization. Gamecocks continue to be bred for cock fighting worldwide with well over a hundred breeds sharing high levels of aggression. In contrast, ordinary “nongame” chickens, which were domesticated from wild red junglefowl, are far less likely to initiate or engage in a fight. Though scientists have identified genetic regions distinct to local populations of Chinese gamecocks, the larger genetic relationship between gamecocks and nongame chickens, as well as the genetic basis of gamecock traits, has not been definitively established.

To address these questions, Bendesky et al. sequenced the genomes of gamecocks from around the world alongside a representative set of nongame chickens. They combined their data with other published chicken genomes to form a large, diverse dataset that includes samples from 12 countries and 108 recognized chicken breeds.

Through a genome-wide association analysis, the authors found that gamecocks have a specific “game ancestry” related to a non-coding variant within the isoprenoid synthase domain containing gene (ISPD), which has been implicated in muscular dystrophy-dystroglycanopathy in humans. Present at a frequency of 89.5 percent in gamecocks compared to 3.7 percent in nongame chickens, the variant is what most strongly differentiates gamecocks and nongame chickens worldwide. Likely the result of selective breeding for aggression in gamecocks and against aggression in nongame chickens, the near absence of this variant in nongame chickens suggests it may be the key to gamecock aggression—finally giving us an explanation for why gamecocks are always in a “fowl” mood.

The researchers also report that genetic similarities among chickens are largely based on geographic proximity, with gamecocks clustering with other chickens of similar regions. What’s more, the evolutionary patterns reflect historical colonization events, with North and South American gamecocks most closely related to chickens from Spain.

Though gamecocks did not cluster together, when examining the ISPD locus, the authors found it to be highly differentiated and very similar among most gamecocks but not nongame chickens. Based on this ancestral pecking order, it’s likely that gamecocks originated from a common ancestor before intermixing with local chicken populations throughout the world, where selective breeding for cock fighting maintained their aggressive traits.

Although genetic variation at the ISPD locus could affect several genes, the authors posit that ISPD itself may be the target of that variation due to its proximity and known involvement in axon guidance during brain development. This offers a potential explanation of how variation around ISPD may contribute to behavioral changes associated with aggression in gamecocks beyond its effects on muscle. Many of the breeds included in the study had not previously been sequenced at this depth of coverage. By assembling the largest and most diverse chicken genome dataset to date, the authors hope to provide a much-needed genomic resource for agricultural genetics. Because variants associated with gamecocks can still be selected against in nongame chickens, it may be within our reach to selectively reduce aggression in nongame chickens to avoid ruffling any feathers.

References

]]>
Heights and pitfalls in detecting polygenic adaptation https://genestogenomes.org/heights-and-pitfalls-in-detecting-polygenic-adaptation/ https://genestogenomes.org/heights-and-pitfalls-in-detecting-polygenic-adaptation/#comments Wed, 04 Apr 2018 14:30:10 +0000 https://genestogenomes.org/?p=15341 Identifying signatures of polygenic adaptation is getting easier—but a commentary calls for caution in drawing conclusions. If you’ve ever wished for a stepstool so you could see the stage at a crowded concert, or, conversely, if you’re tired of being asked “How’s the weather up there?”, you’ve likely pondered what makes some of us tall…]]>

Identifying signatures of polygenic adaptation is getting easier—but a commentary calls for caution in drawing conclusions.


If you’ve ever wished for a stepstool so you could see the stage at a crowded concert, or, conversely, if you’re tired of being asked “How’s the weather up there?”, you’ve likely pondered what makes some of us tall and others short. You’re in good company; geneticists have been thinking hard about height since the dawn of the field.

Height is a classic example of a polygenic trait, meaning that its genetic component is dictated by the combined action of many genes and genetic loci acting together in a complex way. With the advent of new techniques, analyzing the evolution of these complex traits is becoming easier. But a new Commentary in GENETICS by editors John Novembre and Nick Barton cautions that such studies are ripe for misinterpretation by the public and policymakers—particularly when it comes to human traits more controversial than our height.

The Commentary was prompted by a useful new method reported in the same issue of GENETICS. This technique, developed by Racimo, Berg, and Pickrell, helps geneticists analyze how polygenic traits like height have adapted and changed over the course of evolution.

Through heritability studies, linkage analyses, and genome-wide association studies (GWAS), we now have a reasonable view of which genetic loci contribute to determining a person’s height. But piecing together how selection shapes such traits is much more complicated than studying how a single-gene trait evolves. In some cases, a trait responds to selection by a big change in the frequency of a single variant that strongly affects that trait. Other times, however, the trait adapts via the added effects of many tiny adjustments—small changes in frequency of many variants that subtly affect the trait in question. This phenomenon is known as polygenic adaptation.

While these subtle changes are too weak to have been picked up by more classical methodology, GWAS has made it possible to identify SNPs associated with polygenic traits, and it provides the power to detect small-effect variants. Comparing SNPs across populations can identify the signature of selection, and the new method does just that.

The authors use admixture graphs—a simplified representation of how populations have mixed and diverged over time—to explore the adaptation of over 40 traits measured in previous GWAS, finding preliminary evidence for selection on variants associated with height, self-reported unibrow, and educational attainment (years of schooling). By combining GWAS data with the known history of the populations in question, the authors were able to identify when in evolutionary history the selective pressures were most likely applied—they can pinpoint which branch of the graph shows signs of selection. That they found a signal for the polygenic adaptation of height is consistent with previous studies; however, the signals for self-reported unibrow in European populations and educational attainment in East Asian populations were more surprising.

What are we to conclude from these data? It’s tempting to make assumptions about the type of selective pressure acting on these traits and what the data say about fitness—especially when a trait like “educational attainment” is under discussion. But Novembre and Barton—together with the authors of the study—urge extreme caution in leaping to conclusions.

The editors provide context for interpreting any tests for polygenic adaptation, including those of Racimo, Berg, and Pickrell, critically urging care and attention when drawing conclusions from such data and communicating the implications. First, they discuss technical concerns like population stratification, transferring effect sizes, ascertainment bias, and accurate population modeling. Second, they remind us that it’s a difficult task to untangle the relationship between a trait that we can measure and a fitness advantage; it is misleading to assume that height itself, which we can measure, is directly conferring a fitness advantage such as increased survival or finding a mate. As in many areas of science, precision in language is key here: the Commentary points out that Racimo, Berg, and Pickrell also stress the many caveats of this type of analysis and choose their words carefully, discussing the signs of selection on loci associated with the studied traits—not on the traits themselves. Indeed, the target of selection may not be the actual trait measured in GWAS, but something genetically correlated with it.

While geneticists are getting a better view of the genetics behind complex human traits, it’s easy to sow confusion outside the field. Getting clear about what new data do and don’t say is the crucial first step in preventing our insights being misused.

CITATIONS

Detecting Polygenic Adaptation in Admixture Graphs
Fernando Racimo, Jeremy J. Berg and Joseph K. Pickrell
GENETICS April 2018. 8(4): 1565–1584.
DOI: 10.1534/genetics.117.300489
http://www.genetics.org/content/208/4/1565.full

Tread Lightly Interpreting Polygenic Tests of Selection
John Novembre and Nicholas H. Barton
GENETICS April 2018. 8(4): 1351–1355.
DOI: 10.1534/genetics.117.300786
http://www.genetics.org/content/208/4/1351.full

]]>
https://genestogenomes.org/heights-and-pitfalls-in-detecting-polygenic-adaptation/feed/ 4
New schizophrenia risk genes found by computational analysis https://genestogenomes.org/new-schizophrenia-risk-genes-found-by-computational-analysis/ Mon, 19 Dec 2016 13:00:21 +0000 https://genestogenomes.org/?p=7951 Symptoms of schizophrenia most commonly begin to creep up in young adulthood. Although genetics play a major role in this complex disorder, narrowing down the search for the genes involved has proven frustratingly difficult. Risk loci identified by genome-wide association studies (GWAS) may contain several genes, making it unclear which of these contribute to pathology.…]]>

Symptoms of schizophrenia most commonly begin to creep up in young adulthood. Although genetics play a major role in this complex disorder, narrowing down the search for the genes involved has proven frustratingly difficult. Risk loci identified by genome-wide association studies (GWAS) may contain several genes, making it unclear which of these contribute to pathology. Risk genes may even reside outside these loci if they are affected by regulatory elements in a risk locus. In the December issue of Genetics, Lin et al. report a new computational approach to parsing the challenging data on genetic variants that predispose to the disorder.

To determine which gene in a GWAS risk locus increases vulnerability to schizophrenia, the team used predictive features of schizophrenia risk genes derived from 56 genes with very strong evidence of being linked to the disorder. And to track down contributing genes that lie outside risk loci, their analysis included gene regulatory information, such as enhancer-promoter connections identified by the ENCODE and FANTOM5 projects. This approach identified many previously unknown risk genes, some of which are involved in processes related to schizophrenia, such as neural plasticity and synaptic transmission. Some other risk genes the researchers found are also linked to autism, a neurodevelopmental disorder with a few characteristics that overlap with schizophrenia.

The researchers also noticed an interesting pattern in the tissue types that express the risk genes. Unsurprisingly, a large number of the genes identified by the analysis are expressed in the central nervous system. But one in four of these genes are not, which implies that processes outside the brain contribute to the pathogenesis of schizophrenia. Consistent with other research that suggests immune dysfunction is involved in schizophrenia, many of these risk genes are highly expressed in B-c and T lymphocytes.

By analyzing the expression of the risk genes at different points in brain development using RNA-seq data from BrainSpan, the group also discovered a relationship between the risk genes’ strength of association with schizophrenia and the timing of their expression in the brain. The most strongly schizophrenia-associated genes were more often transcribed in the late stages of brain development (between 8 and 40 years), while less strongly associated risk genes were active during early or middle periods of brain development. This is consistent with the fact that schizophrenia most often presents in late adolescence or early adulthood.

The next steps are to understand how these genes might be related to development of schizophrenia. Determining their roles in the disorder could help reveal more about the mechanisms of pathogenesis. Such insights could also inspire new drug treatments—a necessity because not all patients respond adequately to medication, and existing drugs often have severe side-effects, such as obesity and potentially permanent movement disorders. For the nearly one percent of the United States population affected by schizophrenia, who often require lifelong medication, further study of the risk genes identified by this research to develop better treatments is essential.

CITATION:

Lin, J.; Cai, Y.; Zhang, Q.; Zhang, W.; Nogales, R.; Zhang, Z. Integrated Post-GWAS Analysis Shed New Light on the Disease Mechanisms of Schizophrenia.
GENETICS, 204(4), 1587-1600.
DOI: 10.1534/genetics.116.187195
http://www.genetics.org/content/204/4/1587

]]>
Anxious chickens as a model for human behavior https://genestogenomes.org/anxious-chickens-as-a-model-for-human-behavior/ https://genestogenomes.org/anxious-chickens-as-a-model-for-human-behavior/#comments Mon, 11 Jan 2016 14:00:25 +0000 https://genestogenomes.org/?p=4489 Chickens that “chicken out” in unfamiliar surroundings may shed light on anxiety in humans, according to research published in the January 2016 issue of the journal GENETICS. Domestic chickens are much less anxious than their wild cousins, the red junglefowl. The new research identifies genes that contribute to this behavioral variation and reveals that several…]]>

Chickens that “chicken out” in unfamiliar surroundings may shed light on anxiety in humans, according to research published in the January 2016 issue of the journal GENETICS.

Domestic chickens are much less anxious than their wild cousins, the red junglefowl. The new research identifies genes that contribute to this behavioral variation and reveals that several of the genes influence similar behaviors in mice. The authors argue that these results, combined with evidence from studies in humans, demonstrate the potential of the chicken to serve as a powerful model for understanding the genetic underpinnings of human behavior.

“By necessity, human genetic studies of behavior often focus only on susceptibility to a mental health disorder. But what about more subtle differences in behavior? For example, what makes one person a little more anxious than others? And what makes someone a little bolder?” said study leader Dominic Wright, of Linköpings University in Sweden. “Animal models like the chicken allow us to address questions like these using controlled breeding experiments.”

But why choose the chicken as a model for anxiety? One reason is to take advantage of a “natural” genetics experiment, the transformation of red junglefowl in Asia into the modern domestic chicken. After thousands of years of breeding, the barnyard chicken has a different temperament to its jungle-dwelling counterpart: the chicken is more tame and less anxious. Anxiety behaviors in animals are typically measured by observing their activity in a brightly-lit, featureless space that they have never encountered before (an “open field test”). In this setting, wild junglefowl spend most of their time either frozen with fear or darting around rapidly. They also avoid the exposed center of the test arena. Domestic chickens, in contrast, move around the whole area at a less erratic pace.

In addition, the chicken genome has properties that can make it easier to study than the human or mouse genomes. It is relatively small—around a third the size of the mouse genome—and it is grouped into smaller linkage disequilibrium blocks. These blocks are groups of neighbouring genes that tend to be inherited together, rather than being split up during recombination at each generation. Having smaller chunks gives researchers greater resolution in pinpointing genome regions associated with a trait.

To look for genome regions that contribute to variation in anxiety behaviors, the researchers performed a quantitative trait loci (QTL) analysis on the hybrid offspring of White Leghorn domesticated chickens and red junglefowl (using an experimental design called an eighth generation advanced intercross). The hybrid birds inherited a patchwork of gene variants from their chicken and junglefowl ancestors and varied widely in their anxiety levels as measured by the open field test. By correlating the behavior and genome data for each bird, the team identified fifteen QTLs that contributed to the variation in behavior.

Each of these genome regions included many genes, so the next step was to hone in on specific genes of interest. The team narrowed down the search by examining the candidate genes’ activity in the hypothalamus, a region of the brain involved in regulating anxiety. The team examined expression QTLs—sequence variants that affected hypothalamic expression of a nearby gene— that were located within one of the behavior QTLs. These were considered plausible causal variants if they influenced gene expression in a pattern that correlated with the behavioral variation. For instance, the expression QTL might confer low expression of the candidate gene in individuals with high anxiety or vice versa.

Ten genes that fit these criteria were identified, of which six have previously been shown to have functions related to behavior. For example the gene ADAM10 is needed for embryonic brain development and protection against amyloid plaques in neurodegenerative disease, and influences learning and memory.

They then tested whether these ten genes also influenced behavior in studies of mice and humans. The mouse data came from a massive ongoing breeding experiment called the Mouse Heterogeneous Stocks cross, which includes behavioral data from open field tests just like those used in the chicken study. Four genes identified in the chicken data were also associated with anxiety behaviors in mouse. In several cases, the genes influenced the same aspect of the open field test —activity— for both mouse and chicken.

The candidate genes were also examined in data from human genome-wide association studies (GWAS). Three genes were associated with schizophrenia or bipolar disorder. Although anxiety behaviors were not directly measured in the human studies, the authors argue that results for other disorders may be revealing. For instance, a large proportion of people with bipolar disorder have diagnosed anxiety disorders. There may also be some complex overlaps between schizophrenia symptoms and anxiety behaviors.

Using data from animal experiments to explore human GWAS in this way can help detect associations that would otherwise be difficult to distinguish from statistical noise, says Wright. Because GWAS often include huge numbers of markers, they must be analyzed using very stringent significance thresholds that could obscure true associations.

“Though we can’t yet prove these genes have equivalent functions in chicken and humans, the data certainly raise the intriguing possibility that genes controlling variation in behavior can be remarkably conserved between a whole variety of species,” said Wright. “Understanding the genetics underlying the chicken results may provide fundamental insights into animal behavior, including normal behavioral variation in humans.”

 

CITATION

Genetical Genomics of Behavior: A Novel Chicken Genomic Model for Anxiety Behavior
Martin Johnsson, Michael J. Williams, Per Jensen, and Dominic Wright

Genetics, 202 (1), 327–340
http://www.genetics.org/content/202/1/327
http://dx.doi.org/10.1534/genetics.115.179010

]]>
https://genestogenomes.org/anxious-chickens-as-a-model-for-human-behavior/feed/ 1
Human Genetic Diversity and Social Inequalities https://genestogenomes.org/human-genetic-diversity-and-social-inequalities/ https://genestogenomes.org/human-genetic-diversity-and-social-inequalities/#comments Wed, 09 Sep 2015 23:37:00 +0000 https://genestogenomes.org/?p=2342 As ancient humans spread across the globe from their evolutionary birthplace in Africa, they tended to lose a little genetic diversity at each step along the way. New settlements were probably often founded by small groups that carried only a subset of the total diversity present in their homelands. Successive rounds of this “founder effect”…]]>

As ancient humans spread across the globe from their evolutionary birthplace in Africa, they tended to lose a little genetic diversity at each step along the way. New settlements were probably often founded by small groups that carried only a subset of the total diversity present in their homelands. Successive rounds of this “founder effect” mean that today, modern indigenous populations living further from Africa along the old migration routes now have lower genetic diversity than those closer to Africa.

These marks of our prehistoric movements can still affect people and society today. Differences in genetic diversity between human populations can complicate certain social disparities, argue Stanford geneticists Noah Rosenberg and Jonathan Kang in a review published in the latest issue of GENETICS. But they find no evidence that genetic diversity is related to geographic patterns of economic development, contrary to a controversial finding reported by economists. In fact, Rosenberg and Kang’s reanalysis suggests the economists’ result was just a statistical fluke.

This finding doesn’t mean we should ignore genetic effects in societal processes, say the authors. They describe three established examples of interactions between genetic diversity and societal disparities: in forensics, bone marrow transplantation, and genomic studies of health. As an increasing number of economic studies investigate population-genetic variables, the authors emphasize that geneticists and economists need to draw a careful distinction between approaches based on genetic principles and those that treat genetic data in the same way as non-biological data—as just one among many possible variables.

A schematic of the serial founder model. Each color represents a distinct allele. Migration events outward from Africa tend to carry with them only a subset of the genetic diversity from the source population, as some alleles are lost during migration events.

A schematic of the serial founder model. Each color represents a distinct allele. Migration events outward from Africa tend to carry with them only a subset of the genetic diversity from the source population, as some alleles are lost during migration events.

In 2012, Rosenberg was exploring the real-world impacts of differences in genetic diversity when an American Economic Review paper-in-press made a splash. The paper, by Quamrul Ashraf (Williams College) and Oded Galor (Brown University), argued that the intermediate levels of genetic diversity seen in Asian and European populations were optimal for their economic development, while low genetic diversity in Native American populations and high diversity in African populations had held their economic development back.

The paper was met with strong criticism from several prominent geneticists and anthropologists. A group of scholars from Harvard and the Broad Institute swiftly published an open letter criticizing the methods used in the project, claiming “Such haphazard methods and erroneous assumptions of statistical independence could equally find a genetic cause for the use of chopsticks.”

They also questioned whether pursuing such a project was ethical:

“….the suggestion that an ideal level of genetic variation could foster economic growth and could even be engineered has the potential to be misused with frightening consequences to justify indefensible practices such as ethnic cleansing or genocide.”

The debate grabbed Rosenberg’s attention. Like Ashraf and Galor, he was interested in how the genetic diversity differences between populations played out in societal processes. But the role of genetic diversity in the examples he was exploring was uncontroversial, in some cases even long established. What was the fundamental difference between Ashraf and Galor’s work and these examples?

In the process of examining this question, Rosenberg realized that he could make use of a much larger set of genetic diversity data that his lab had been preparing—237 populations from 39 countries subsequently published in the GSA journal G3: Genes|Genomes|Genetics, rather than the 53 populations in 21 countries studied by Ashraf and Galor. Using this expanded dataset and the economists’ own methods, Rosenberg and graduate student Kang repeated the analysis. They found no statistically significant relationship between genetic diversity and economic development when the larger dataset was used. They argue that Ashraf and Galor’s original result was likely a false positive, and that if they had happened to examine a different set of 21 countries, they would likely not have found a significant effect of genetic diversity in the first place.

In short, even if one disregards criticisms of the original study’s methods or ethics, it is likely the reported relationship between genetic diversity and economic variables was a coincidence.

But what about other real-world effects of genetic diversity?

Rosenberg and Kang’s first example is the use of familial identification in forensic genetic testing. Usually, when DNA is found at a crime scene, investigators look for a perfect match to a sample from law enforcement databases. But when no match can be found, it is becoming more common to search for partial matches that may identify relatives of the suspect.

The most famous example in the US is the 2010 identification and arrest of a suspect in the Los Angeles “Grim Sleeper” serial killer case. The suspect was tracked down through a partial match of crime scene samples to his son’s DNA profile from a database.

Critics point out that familial testing will intensify surveillance on those ethnic groups already overrepresented in law enforcement databases. The same groups will also bear more of the risk of being falsely implicated in a crime, a risk that is substantially higher for current familial identification than standard “perfect match” testing.

The population genetic features of different populations introduce another disparity to familial identification. Rori Rohlfs (University of California, Berkeley) and colleagues have shown that familial identification has a higher false positive rate in populations with lower genetic diversity, such as Native American groups.

Another example of the societal implications of genetic diversity is bone marrow transplantation matching. To reduce the risk of dangerous immune responses to a transplant, donors are chosen according to their genetic similarity to the recipient at six genes of the human leukocyte antigen (HLA) system. But for populations with higher genetic diversity, recipients have a lower chance of finding a match.

This genetic disparity can exacerbate the uneven availability of bone marrow transplants. Social factors that influence the likelihood someone from a particular ethnic group finds a match include the size of the population and their rate of inclusion in donor databases, and the rate at which members of the group participate as donors when they are identified as potential matches. African Americans seeking a donor have their chances of success lowered both because they are underrepresented in donor databases and because they belong to a population with higher genetic diversity.

The authors’ third example is the representation of different populations in genome-wide association studies (GWAS). GWAS have so far identified thousands of genetic links to hundreds of human traits and diseases. But by a 2011 estimate, 96% of GWAS participants had European ancestry. This extreme disparity means other ethnic groups are less likely to benefit from the results.

The reasons for the skew are partly sociological — the distribution of research funding, the structure of scientific collaboration networks, access to participants from each population — but such factors are exacerbated by genetic characteristics that make some populations more difficult to study by GWAS.

That’s because GWAS relies on the tendency for chromosomes to be passed on in chunks, co-inherited regions that show linkage disequilibrium (a statistical association between the alleles present at different loci). As a result, GWAS do not need to examine every possible genetic variant in the genome, but instead track a subset of a few hundred thousands or a million variants that each act as markers or tags for other variants within the same linkage disequilibrium “chunk”. Because recombination breaks up these chunks into smaller and smaller regions over the generations, and populations farther from Africa had smaller starting population sizes with fewer distinct chunks to disassemble, the diversity of chunks within a population increases with distance from Africa. African populations today have comparatively low linkage disequilibrium, which can in turn can mean more markers are needed to successfully “tag” the genome than for other populations.

As GWAS methods have matured, this technical issue has become less important, but in the early days of GWAS it helped limit research on African populations. This initial underrepresentation persisted as the existing European-skewed research was used in the development of new technologies and genomic resources.

The broader point of these three examples, says Rosenberg, is that each case includes  sociological differences that contribute to the disparity – differences like population size, representation in databases, and funding—but he emphasizes that there are also contributions that relate to genetic diversity. To fully address these disparities, it will be important to take into account both types of effect: sociological differences and genetic diversity differences.

So what is the distinction between these examples and studies like that of Ashraf and Galor? Rosenberg says the arguments of the former were built using the theoretical machinery of population genetics. In comparison, the economists’ work looked only for correlations between genetic and economic variables.

Understanding this difference is an important part of evaluating claims about genetics, says Rosenberg, especially as studies investigating correlations between economic and biological variables become increasingly common. “They’re not necessarily wrong,” says Rosenberg. “They just don’t draw any support from principles of genetics.”

CITATION:

Noah A. Rosenberg and Jonathan T.L. Kang  (2015). Genetic Diversity and Socially Important Disparities. Genetics, 201(1), 1-12. doi: 10.1534/genetics.115.176750
http://www.genetics.org/content/201/1/1.full 

 

]]>
https://genestogenomes.org/human-genetic-diversity-and-social-inequalities/feed/ 1
Turning spit and data into treasure https://genestogenomes.org/turning-spit-and-data-into-treasure/ Tue, 23 Jun 2015 23:03:34 +0000 https://genestogenomes.org/?p=1421 By the time President Obama announced the Precision Medicine Initiative in January 2015, the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort was already a trailblazing example of this new approach to medical research. GERA is a group of more than 100,000 members of the Kaiser Permanente Medical Care Plan who consented to…]]>

By the time President Obama announced the Precision Medicine Initiative in January 2015, the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort was already a trailblazing example of this new approach to medical research.

GERA is a group of more than 100,000 members of the Kaiser Permanente Medical Care Plan who consented to anonymously share data from their medical records with researchers, along with answers to survey questions on their behavior and background. Participants also shared their DNA—via saliva samples—to help with the project.

The result is a treasure trove of data, says GERA co-principal investigator Neil Risch (University of California, San Francisco). The study links genotype data from the saliva to environmental and lifestyle data from the surveys to clinical, pharmacy, imaging, and diagnostic laboratory data from electronic medical records—all derived from a large, ethnically-diverse population.

GERA was formed in 2009 by a collaboration between the Kaiser Permanente Northern California Research Program on Genes, Environment, and Health (RPGEH) and the Institute for Human Genetics at UCSF, and is led by Risch and RPGEH Executive Director Catherine Schaefer.

Today, in a series of three papers published Early Online in GENETICS, the research team formally describes the GERA resource, including the population structure and genetic ancestry of the participants, telomere length analysis, and details of the innovative methods that allowed them to perform the genotyping within 14 months.

Genes to Genomes spoke with Dr. Risch about GERA and the team’s research:

 

What makes the GERA data so useful?

In my view, it’s Kaiser’s incredibly comprehensive electronic health record system. They’ve been way ahead of the game. The records include pharmacy records, what procedures were performed, scans, lab tests, you name it, it’s all there. Really, the only thing missing is dental. And it all goes back twenty years to 1995. Kaiser places a big emphasis on prevention, so there are lots of screening results that greatly enhance the information on risk factors. Once you attach genetic information to data like this, it enables analysis of so many different phenotypes.

And it’s not just genetics. From their survey responses we learn about patients’ behavior and lifestyle, and from their addresses we can infer all kinds of things about their risk exposures, air quality, water quality, social environment, built environment, income, etc.

Historically, the way we’ve done genetic studies is to start from scratch. We would recruit a study population, collect all the information, and measure a few things—like disease status or biomarkers—at one specific point in time.

But when we instead use data that is routinely collected as part of care, we have a much richer dataset, often over many time points. For example, I’m interested in lipids. You might think if we have a cohort of 100,000 people, that translates to 100,000 lipid panels. In fact it’s 1.1 million because the average person in the cohort has records from 11 lipid panels. That means we can look at changes over time with age.

Because these records are also linked to the pharmacy database, we also know what each of those people has been prescribed. So we’ve been able to analyze how people’s LDL cholesterol levels change after they start taking statins. And then we can look at side effects and so on. We don’t have to create a proposal for each of those questions, we just go back to the database.

And better still, a cohort like this only gets more valuable over time, because the records get updated every night. That means we can now do prospective, rather than retrospective analysis [prospective studies follow clinical outcomes in a cohort after enrollment in the study; retrospective studies record outcomes and risk factors in the cohort before enrollment. They suffer from more bias and confounding factors than prospective studies].

Skeptics thought electronic health data would turn out to be less reliable than targeted measurements, but that’s wrong. Over and over again, we’ve validated that electronic records are actually fantastic for these kinds of studies. In fact, I see this as a phase shift in the way genomics research will be done.

What findings has the GERA data yielded so far?

We have findings for prostate cancer, allergies, glaucoma, macular degeneration, high cholesterol, blood pressure—and those are only a few examples. It’s not just diseases either. For example, we have the results of PSA tests [prostate specific antigen screening tests for prostate cancer risk]. So we were able to find up to 30 novel variants that influence PSA levels.

The beauty of this resource is that no matter what phenotype we look at, we find associations—everything we touch! These are subtle effects, but in this cohort, if they exist, you’re going to find them. Even though people complain that the risks detected by GWAS are modest, I argue that this simply reflects reality—not everything is a Mendelian disorder. Model organism geneticists have known this for years: these traits are polygenic and there are many genes involved.

What did you learn from the population structure analysis?

Traditionally there’s been a bias in research participation from people with Northern European ancestry. To make up for that bias, we had a mandate to maximize minority representation when we selected participants. In the end, around 20% of the cohort were from a minority ethnicity/race/nationality.

We were particularly interested in people who checked more than one box on the ethnicity questionnaire. More and more people are identifying as multi-ethnic, which can pose some technical challenges for genomic studies in terms of complexity. At the same time, it also presents opportunities for analyzing genetic and social contributions to disease differences between groups.

Yambazi Banda (UCSF), first author of the population structure paper, is very interested in the relationship between genetic ancestry and how people self-identify. We found that the relationship is very strong, and the way people describe their backgrounds generally matches their genetics.

One interesting aspect of the data is that we ended up with related individuals among the cohort, including around 2,000 pairs of full siblings. That meant we could tell whether these siblings described their ethnicity in the same way as each other. Most did, but those who reported different ethnicity from their siblings tended to be multi-ethnic. Multi-ethnic people also tended to be younger, which probably reflects social changes and increased intermarriage across racial and ethnic boundaries.

How and why did you genotype the samples so quickly?

Around 2008 we had 85,000 saliva specimens and consent to use them, but we needed funding. This was around the economic recession, and, it turns out, when Arlen Specter and Congress pushed for 10 billion dollars in extra funding for the NIH, as part of the economic stimulus package. We received Grand Opportunity Project funding from the National Institute for Aging (NIA) because the average age of the cohort was 63, and the NIA was interested in funding genomic analyses of age-related diseases. But we needed to finish the work in two years, or just 14 months in the lab.

In 2009, it was a big deal to do something like this so fast. We were under the gun to get this data, with assays running 24/7. Thankfully we had a lot of hands-on help from Affymetrix [manufacturers of the genotyping chips]. And Mark Kvale, our lead scientific programmer, and postdoc Stephanie Hesselson, and Pui-Yan Kwok, who directs the genomics core, did a huge amount of work to make the project a success.

Part of our solution to the time crunch was developing real-time turnaround in the data analysis. So within three hours after the results came out of the GeneTitan [the genotyping array processing stations], we knew if anything was going wrong. Working in this way probably saved us hundreds of thousands of dollars.

We also improved the way the genotypes were called [inferred], realizing that Affymetrix’s historical method was suboptimal for rare variants. The upshot is that Affymetrix has since changed its protocol and has used a lot of the lessons that we learned with the GERA project to benefit other very large genotyping projects using the same platform—for example, the Million Veterans Program.

It wasn’t just genotyping either. Liz Blackburn and her group were assaying telomere length in all the samples at the same time [Blackburn is a UCSF geneticist and won a Nobel prize for the discovery of telomeres]. No one had done anything with telomeres on this scale before. The first author on the telomere paper, Kyle Lapham, had to create a robotic system for these very tricky experiments. In the end it only took four months to do the assays. It’s quite an achievement!

The results confirmed that the data is sound—for example, we see that telomeres get shorter with age as expected. We also observed a sex difference, where women tend to have longer telomeres than men.

Remarkably, there was some evidence that telomere length is related to survival. For those under 75, younger people tend to have the longer telomeres; But for the over 75s, there’s a reversal; the oldest people tend to have the longer telomeres.

What’s next for GERA?

We’re working on publishing more of the results; there are so many phenotypes that are just begging for analysis! At this stage we’re operating largely as a resource for other scientists. Researchers can apply for data access via Kaiser Permanente [the Kaiser Permanente Northern California Research Program on Genes, Environment, and Health] or via NIH’s database dbGap.

The field is moving away from SNP genotyping and in the direction of sequencing, with the rationale that the SNP arrays don’t provide good coverage of rare variants. But in reality the amount of information you get from these arrays is vastly more than just the several hundred thousand sites on the array because you can impute the genotypes at other sites by using reference sequence panels.

Tom Hoffmann (UCSF), who helped design the GERA genotyping arrays, has done a lot of work on imputation in this cohort. For example, we’ve published analyses on a rare mutation in HOXB13 that causes prostate cancer. The carrier frequency in people with Northern European ancestry is only about 0.3%, but given we have 100,000 people in the cohort, we expect carriers among them. But how do we find them? That particular variant was not included on the SNP arrays.

We found we could identify carriers relatively well by imputing genotypes at the mutation site using reference sequence panels and the genotypes of surrounding SNPs. The beauty is, once we had identified those carriers, the health records allowed us to look at not only prostate cancer, but at all cancers. Sure enough, we showed that in fact this mutation is a risk factor for a lot of other cancers.

Using imputation, I believe it’s very realistic that the GERA cohort will end up with good coverage of variants with frequencies of around one in a thousand. That means we’ll have data on up to 50 million variants, rather than just the several hundred thousand on the array.

As you can tell, I’m enthusiastic about this project! At the beginning of a big project like this, you really don’t know it’s going to work. It’s gratifying that after such a major investment of time and effort, we ended up with a resource that is so valuable and exciting.

CITATIONS:

Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort

Yambazi Banda, Mark N Kvale, Thomas J Hoffmann, Stephanie E Hesselson, Dilrini Ranatunga, Hua Tang, Chiara Sabatti, Lisa A Croen, Brad P Dispensa, Mary Henderson, Carlos Iribarren, Eric Jorgenson, Lawrence H Kushi, Dana Ludwig, Diane Olberg, Charles P Quesenberry Jr, Sarah Rowell, Marianne Sadler, Lori C Sakoda, Stanley Sciortino, Ling Shen, David Smethurst, Carol P Somkin, Stephen K Van Den Eeden, Lawrence Walter, Rachel A Whitmer, Pui-Yan Kwok, Catherine Schaefer, and Neil Risch (2015). Genetics. Early Online June 19, 2015. doi: 10.1534/genetics.115.178616

http://www.genetics.org/content/early/2015/06/18/genetics.115.178616

 

Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort

Mark N Kvale, Stephanie Hesselson,Thomas J Hoffmann, Yang Cao, David Chan, Sheryl Connell, Lisa A Croen, Brad P Dispensa, Jasmin Eshragh, Andrea Finn, Jeremy Gollub, Carlos Iribarren, Eric Jorgenson, Lawrence H Kushi, Richard Lao, Yontao Lu, Dana Ludwig, Gurpreet K Mathauda, William B. McGuire, Gangwu Mei, Sunita Miles, Michael Mittman, Mohini Patil, Charles P Quesenberry Jr, Dilrini Ranatunga, Sarah Rowell, Marianne Sadler, Lori C Sakoda, Michael Shapero, Ling Shen, Tanu Shenoy, David Smethurst, Carol P Somkin, Stephen K Van Den Eeden, Lawrence Walter, Eunice Wan, Teresa Webster, Rachel A Whitmer, Simon Wong, Chia Zau, Yiping Zhan, Catherine Schaefer, Pui-Yan Kwok, and Neil Risch (2015). Genetics. Early Online June 19, 2015, doi: doi:10.1534/genetics.115.178905

http://www.genetics.org/content/early/2015/06/18/genetics.115.178905

 

Automated assay of telomere length measurement and informatics for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort.

Kyle Lapham, Mark N Kvale, Jue Lin, Sheryl Connell, Lisa A Croen, Brad P Dispensa, Lynn Fang, Stephanie Hesselson, Thomas J Hoffmann,Carlos Iribarren, Eric Jorgenson,Lawrence H Kushi, Dana Ludwig, Tetsuya Matsuguchi,William B McGuire , Sunita Miles, Charles P Quesenberry Jr, Sarah Rowell, Marianne Sadler, Lori C Sakoda, David Smethurst, Carol P Somkin, Stephen K Van Den Eeden, Lawrence Walter,Rachel A Whitmer, Pui-Yan Kwok, Neil Risch, Catherine Schaefer, and Elizabeth H. Blackburn (2015). Genetics. Early Online June 19, 2015 doi:10.1534/genetics.115.178624

http://www.genetics.org/content/early/2015/06/18/genetics.115.178624

17200471_d6804aeff7_o

Photo credit: John Goode  CC BY 2.0

]]>
The trouble with HLA diversity https://genestogenomes.org/the-trouble-with-hla-diversity/ Thu, 28 May 2015 13:54:46 +0000 https://genestogenomes.org/?p=1378 The most diverse of all human genes encode a set of proteins at the frontline of our immune system. Many different Human Leukocyte Antigen (HLA) proteins are encoded by genes clumped together in one portion of the human genome known as the major histocompatibility complex region. HLA proteins sit on the surface of cells and…]]>

The most diverse of all human genes encode a set of proteins at the frontline of our immune system. Many different Human Leukocyte Antigen (HLA) proteins are encoded by genes clumped together in one portion of the human genome known as the major histocompatibility complex region. HLA proteins sit on the surface of cells and bind the chopped-up fragments of other proteins (antigens), presenting them for inspection by immune cells. If the presented antigens are recognized as foreign, the immune system may be triggered to attack, whether the invaders are pathogens, cancer cells, or transplanted tissue.

Remarkably, most HLA genes have dozens, or even hundreds of alleles present in the human population, so across the genome region as a whole there are thousands of different alleles. This variation can affect individual susceptibility to infectious and autoimmune diseases, and is of great interest to geneticists studying human evolution and population history.

But despite the functional and evolutionary importance of HLA genes, sequencing data from this region is biased in many population genomics studies. As a consequence, the results from this region are often treated as suspect, and in many cases are discarded from subsequent analyses.

The reason is that it’s difficult to make sense of HLA data generated by the next-generation sequencing (NGS) methods that are now standard for population genomics studies. NGS methods generate short sequence reads, and when these reads come from highly polymorphic genes like the HLA genes it can be challenging to correctly align them to the genome reference sequence. This problem is even worse when the gene is just one of a group of related polymorphic genes, as is the case for many of the HLA loci.

ddd

Genotyping errors for a highly polymorphic gene: The left hand side represents a case where sequence reads come from an individual who is heterozygous at a SNP, but where the rest of the gene is relatively similar to the reference for both haplotypes. The reads from both haplotypes can be aligned to the reference, and the SNP genotype is “called” (i.e. determined by the analysis software) correctly. The right hand side represents a case where one of the haplotypes is different to the reference sequence at more than one position. Reads from this haplotype won’t align with the reference and the genotype will be incorrectly called as homozygous at the SNP of interest. Image credit: Vitor R. C. Aguiar.

Though HLA loci are the worst-case scenario for this problem, other examples of polymorphic genes that come in related groups might suffer similar issues (such as the killer-like immunoglobulin receptor (KIR) and olfactory receptor genes). But because the degree of polymorphism in other gene families is less extreme than in the HLA genes, the analysis issues may be less obvious and therefore less likely to be accounted for.

In the latest issue of G3, Brandt et al. demonstrate the scale of the challenge using HLA data from the 1000 Genomes project, which is a collection of high-coverage exome and low-coverage whole-genome sequences from 1092 people generated by NGS. The authors compared the NGS data to a parallel dataset in which 930 of the samples from the 1000 Genomes project were re-sequenced using the “gold-standard” of Sanger sequencing, which doesn’t suffer from the same problems of short read alignment (the Sanger data were generated by Gourraud et al.)

Using the Sanger data as a benchmark, Brandt et al. showed that approximately 19% of single nucleotide polymorphism (SNP) genotypes for HLA genes in the NGS data were incorrect. And around a quarter of HLA SNPs had allele frequency estimates that differed between the two datasets by more than 0.1, with a bias towards overestimation of allele frequency in the NGS data. They also found that the most “unreliable” SNPs in NGS data were those with the highest heterozygosity. In other words, the SNPs at which people were mostly likely to be heterozygous were those that were most difficult to genotype correctly.

The results also suggest the NGS problem probably can’t be solved by boosting the intensity of sequencing efforts (i.e. increasing coverage). Rather, the authors’ argue that better computational analysis is the way forward. For example, they suggest that a major part of the problem is that standard approaches align reads to a single reference sequence. For HLA genes, and perhaps other polymorphic genes, alignment to a database of multiple reference sequences (for example, Boegel et al. and Dilthey et al.) can greatly improve genotyping accuracy by accounting for the different alleles possible at each gene.

A computational fix would be a boon to the many genetic studies that currently struggle to characterize HLA sequence data, including efforts to seek disease associations, quantify gene expression changes, and examine population histories. After all, the diversity of HLA genes is not only a technical challenge, but also a mark of their profound importance to immune system function and human survival.

ddd

Genotype mismatches between the 1000 Genomes (next-generation sequencing) and PAG2014 (Sanger sequencing) datasets. Results per polymorphic site (“Position”) and per individual. Dark squares indicate mismatches between genotypes in the two datasets. From Brandt et al.

CITATION:

Brandt, D.Y.C, Aguiar, V.R.C., Bitarello, B.D., Nunes, K., Goudet, J., & Meyer, D. (2015). Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data
G3: Genes|Genomes|Genetics, 5(5):931-941 doi: 10.1534/g3.114.015784
http://www.g3journal.org/content/5/5/931.full

]]>