The end of genomics, the beginning of analysis

A Tree of Life

Evolutionary processes which play out across the tree of life are subject to distinct dynamics which can shape and influence the structure and characteristics of individuals, populations, and whole ecosystems. For example, imagine the phylogeny and population genetic characteristics of organisms which are endemic to the islands of Hawaii. Because the Hawaiian islands are an isolated archipelago the expectation is that lineages native to the region are going to be less shaped by the parameter of migration, or gene flow between distinct populations, than might otherwise be the case. Additionally, presumably there was a “founding” event of these endemic Hawaiian lineages at some distant point in the past, so another expectation is that most of the populations would exhibit evidence of having gone through a genetic bottleneck, where the power of random drift was sharply increased for several generations. The various characteristics, or states, which we see in the present in an individual, population, or set of populations, are the outcome of a long historical process, a sequence of precise events. To understand evolution properly it behooves us to attempt to infer the nature and magnitude of these distinct dynamic parameters which have shaped the tree of life.

Credit: Verisimilus

For many the image of evolutionary processes brings to mind something on a macro scale. Perhaps that of the changing nature of protean life on earth writ large, depicted on a broad canvas such as in David Attenborough’s majestic documentaries over millions of years and across geological scales. But one can also reduce the phenomenon to a finer-grain on a concrete level, as in specific DNA molecules. Or, transform it into a more abstract rendering manipulable by algebra, such as trajectories of allele frequencies over generations. Both of these reductions emphasize the genetic aspect of natural history.

Credit: Johnuniq

Obviously evolutionary processes are not just fundamentally the flux of genetic elements, but genes are crucial to the phenomena in a biological sense. It therefore stands to reason that if we look at patterns of variation within the genome we will be able to infer in some deep fashion the manner in which life on earth has evolved, and conclude something more general about the nature of biological evolution. These are not trivial affairs; it is not surprising that philosophy-of-biology is often caricatured as philosophy-of-evolution. One might dispute the characterization, but it can not be denied that some would contend that evolutionary processes in some way allow us to understand the nature of Being, rather than just how we came into being (Creationists depict evolution  as a religion-like cult, which imparts the general flavor of some of the meta-science and philosophy which serves as intellectual subtext).

R. A. Fisher

But shifting from such near-metaphysical generalities to more in-the-trenches science as it is done, we are faced today with the swell of sequence data due to the genomic revolution. What does this matter for our understanding of evolution? Many of the original arguments of evolutionary geneticists such as R. A. Fisher and Sewall Wright were predicated on inferences from the inheritance patterns of a few genes which were easily identifiable by their phenotypic markers. But a more likely frame for the dispute was one where the inferences were purely theoretical, deduction with a minimal level of empirical messiness intervening. In contrast today we live in an age where someone may pity you if you don’t have a very well assembled genome of your organism (on the order of billions of base pairs for mammals), and so have to make due with SNP marker data of a few thousand per individual!

These new data, first and foremost from humans due to the funding priorities of biomedical science, have stimulated a renaissance of method development to take advantage of the richness of the genetic variation now being uncovered. Consider PSMC, which allows one to make demographic inferences of population history from one genome by surveying patterns of heterozygosity within a single individual. Last week I reviewed a preprint which illustrated the power of extensive data analysis in shading and refining previous results which seemed straightforward on the face of it. The reformulation yielded the possibility of natural selection as being a pervasive parameter in human evolution over the past ~100,000 years. The authors compared variation at different categories of bases (synonymous vs. nonsynonomous) across the genome to reinforce both old intuitions and extract novel insights.

Citation: Voight, Benjamin F., et al. “A map of recent positive selection in the human genome.” PLoS biology 4.3 (2006): e72.

Looking at diferences between synonymous vs. nonsyonomous substitutions is a tried & tested technique with a fine pedigree, but more recently haplotype based methods to detect natural selection have been all the rage, due to the emergence of dense genome-wide marker sets. These allow for the inference of correlated patterns of markers across adjacent genomic segments. This trend toward haplotype methods naturally triggered their antithesis, and the resulting synthesis to some extent can be seen in two papers, both Grossman et al., A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection, and Identifying recent adaptations in large-scale genomic data. These are improvements upon earlier work in the aughts, a reassessment which had already started to occur in the literature after the excesses of genomic methods in their detection of ubiquitous selection in human populations. More specifically, the newer techniques focused on recent selective events which leave long blocks of the genome within populations homogenized. As causal markers rapidly increase in frequency due to positive selection, they drag along flanking region in sweep events. For many generations after the initial selection event these flanking regions will produce regions of linkage disequilibrium, as recombination only slowly breaks apart apart the associations across loci. But a key drawback with these methods is that selection is not the only dynamic which results in long haplotypes and linkage disequilibrium. More specifically demographic stochasticity, colloquially the vicissitudes of population history, can also generate long homogeneous blocks of markers. The initial candidate regions yielded by a statistic like iHS were saturated by the effects of population specific history.

CMS, debuted in Grossman et al. 2010, is an attempt to correct for this bug, while retaining the power of haplotype based methods. Natural selection within the genome leaves more evidence behind in regards to its operation than just long halotype blocks and linkage disequilibrium. Selected alleles often exhibit greater between population difference than the average region of the genome (i.e., higher Fst). Additionally, a new derived allele segregating within one population at a high frequency is often a telltale marker of recent adaptation, as a de novo mutation in a specific locale turns out to be beneficial. By combining tests which survey patterns of variation across loci (i.e., haplotype based methods), with those within loci and across populations (Fst based methods) , CMS zeros in on a few precise narrow candidates by cross-checking with multiple tools. False positive hits aside, another major problem with relying upon a single coarse test is that they often highlight a large region as a target of natural selection. This does not necessarily allow for simple follow up when you have dozens of genes and millions of bases which are potential candidates.

The second paper, Grossman et al. 2013, is less a map of genome-wide variation, than a scan of genome-wide variation with an intent to select choice targets for more detailed analysis. To no one’s surprise for human data sets loci implicated in salient physical characteristics such as height and pigmentation, metabolism, and immune response, are high on the list of candidates. No matter the genuine issue of false positives it does seem that recent human evolution (and frankly, evolution more generally) has a fixation on these traits, no pun intended. I do wonder sometimes if this is just an feature of the fact that we humans notice exterior phenotypes, as well as disease related markers (e.g., metabolic and immune illnesses). One of the major concerns in the second paper is that a selection signature without a phenotype is often without utility, but perhaps the phenotypes are lacking in utility because humans are blind in terms of what traits are of interest. I am still skeptical of explanations for what exactly the target of selection around the EDAR locus in East Asians is.

Two alleles of SLC24A5, citation: Norton, Heather L., et al. “Genetic evidence for the convergent evolution of light skin in Europeans and East Asians.” Molecular biology and evolution 24.3 (2007): 710-722.

One of the more intriguing results from CMS in Grossman et al. 2013 is that a locus with the strongest association with resistance to leprosy also contains SLC24A5. This locus has an allele within it that is almost disjoint in frequency between Europeans and Sub-Saharan Africans. By this, I mean that almost all Africans carry one base, while nearly all Europeans care the other. The allele found in Europeans is dominant in West Asia, and present as frequencies as high as ~50% as far south and east as Sri Lanka. It is a gene which is famously correlated with lighter skin in humans and zebrafish. And yet there remains the mystery that it is present at very high frequencies rather far south, and it is certainly not a necessary condition for light skin. East Asians are nearly fixed for the ancestral variant which is common in Sub-Saharan Africa. A possible explanation is that these sorts of salient phenotypic loci have been reshaped due to very strong bouts of selection targeting particular diseases in the recent past. If this is correct, the phenotypic characteristics which we find salient in human beings may simply be pleiotropic side effects of selective sweeps anchored around disease resistance.

I am not proposing here that genomics can solve and explain evolution. The heirs of G. G. Simpson may have something to say about that. Rather, I am suggesting that the genetic piece of the puzzle will not be lacking in data to any extent within our lifetimes. My hunch is that many evolutionary genetic questions will be soluble when we have thousands of complete genomes of high quality on thousands of organisms. There is no likely windfall of fossils in the near future, so palentology will have to continue to operate in a relatively data constrained environment. For those who work in the domain of evolutionary genetics and genomics the onus is on human ingenuity, and analytic skill and savvy. Thinking hard and deep about difficult problems, rather than putting in long hours on the bench to glean more data.

The post The end of genomics, the beginning of analysis appeared first on Gene Expression.

Source: Discover Magazine – Gene Expression