Mother of all microsatellites

MDS of all samples

Noah Rosenberg’s lab has put out the mother of all microsatellite papers, Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation. It seems to me that this is the culmination of all the work with microsatellite markers which has come out of his lab over the past decade, applying all sorts of fancy analytic techniques they’ve developed (for example, Procrustes transformation). The big thing to note is that the human sample size is nearly 6,000 individuals with over 600 loci. Because microsatellites mutate and diverge very fast (mutation rates 10-4 rather than 10-8as with SNPs) 600 loci is more than sufficient to differentiate populations. Because of this rapid mutation I’m a little dubious about their attempt to explore human-chimp differences using a smaller set ascertained on humans, though that may be simply a proof of principle (if the markers evolve too fast they might not tell you much informative about very deep divergences).

Click to enlarge

Reading the paper it’s quite obvious that just merging the samples was a big feat. And it’s not just sample size, they had excellent population coverage (267). As Dienekes observes microsats are somewhat “retro”, but try and get this sort of population coverage with whole genomes, or even SNPs. You can get to N>5,000, but with SNPs the overlapping markers start to drop off very quickly, to the point where they are far less informative than this number of microsats. Dienekes quite liked the tree to the left, and I’ve uploaded a rather large version of it for your enjoyment (just zoom in if your browser sizes it down).

But to some extent the tree above illustrates the limitations of this sort of analysis. Rather than an analysis, this is really more a useful data set that you have to slice and dice, and explore on a finer grain. Pooling all the samples together makes it far less informative and unintelligible. This is already obvious in their aggregation to create the large data set, as they had to prune very large subpopulations so they didn’t overwhelm the results. Even then problems obvious to those familiar with the data crop up, though they might not be so clear to those who are reading superficially. The Gujarati data set among the South Asians separated out on a two dimensional visualization from all other populations. This is something that often occurs because it looks like Gujaratis are sampled from a very specific caste, which increases the perceived affinity of this regional ethnicity. Similarly, pooling all the populations and representing them on a two dimensional plot is more an aesthetic declaration than an informative visualization. You have to bracket out the populations to see value-added structure. Finally, even the coarse and general observations need to be integrated with caution. Rosenberg’s lab has been illustrating the decay of genetic diversity from Ethiopia for nearly a decade now. It’s a classic result which shows up in graduate level population genetics courses. But both the anthropology and genetics tell us that Ethiopians are a compound population with Sub-Saharan African and Eurasian affinities. Most readers can be expected to know this, but I would not be surprised if some simply took the general plot at face value and applied the insight to all the populations, as if they really were subject to a serial founder effect (my specific point is that Ethiopians are the product of a synthesize due to back migration, reversal of the general migration out of Africa being illustrated with the decline in genetic diversity).

Overall I find this an interesting paper which sets the backdrop for understanding the canvas of human genetic variation. The only last caution I would offer is that microsatellites are atypical regions of the genome which evolve rapidly in a neutral fashion. This makes them excellent for pinpointing population differences and inferring history from a limited marker set. But I think people should be cautious of specific novel results, and not hold them up as that authoritative when we have high density SNP data.

Note: They’ve released the data. If readers are curious about doing different things with these data than was shown in this paper, Treemix can handle microsats. Also, props to them for releasing this creative commons.

Citation: Pemberton, Trevor J., Michael DeGiorgio, and Noah A. Rosenberg. “Population structure in a comprehensive genomic data set on human microsatellite variation.” G3: Genes| Genomes| Genetics 3.5 (2013): 891-907.

Source: Discover Magazine – Gene Expression