A genetic map of Italy

Since the Ralph & Coop paper on IBD patterns across Europe I’ve been keen to see what gets uncovered about Italy. Recall, if you will, that in that paper the authors noted that Italy in particular of European nations exhibits a lot of deep population structure. Whereas the network of descent ties together many European nations and regions, in Italy there are deep regional differences which seem to go back to antiquity. Additionally, more recently Sardinia has come under focus as possibly particularly informative in the ethnogenesis of European peoples. Until recently I was moderately skeptical of the utility of Sardinian samples in the HGDP data set. After all, it was an isolated island, and perhaps subject to peculiarities of low effective population size. Well, it turns out that it may be that modern Sardinians are the best approximation we have today to Southern Europeans ~5,000 years ago.

A new paper in PLoS ONE has a huge sample of Italians, and applies standard techniques to ascertain population structure. An Overview of the Genetic Structure within the Italian Population from Genome-Wide Data:

In spite of the common belief of Europe as reasonably homogeneous at genetic level, advances in high-throughput genotyping technology have resolved several gradients which define different geographical areas with good precision. When Northern and Southern European groups were considered separately, there were clear genetic distinctions. Intra-country genetic differences were also evident, especially in Finland and, to a lesser extent, within other European populations. Here, we present the first analysis using the 125,799 genome-wide Single Nucleotide Polymorphisms (SNPs) data of 1,014 Italians with wide geographical coverage. We showed by using Principal Component analysis and model-based individual ancestry analysis, that the current population of Sardinia can be clearly differentiated genetically from mainland Italy and Sicily, and that a certain degree of genetic differentiation is detectable within the current Italian peninsula population. Pair-wise FST statistics Northern and Southern Italy amounts approximately to 0.001 between, and around 0.002 between Northern Italy and Utah residents with Northern and Western European ancestry (CEU). The Italian population also revealed a fine genetic substructure underscoring by the genomic inflation (Sardinia vs. Northern Italy = 3.040 and Northern Italy vs. CEU = 1.427), warning against confounding effects of hidden relatedness and population substructure in association studies.

The number of SNPs is rather good for the tasks which they attempted. My personal experience is that for clustering algorithms like ADMIXTURE or PCA you’re hitting diminishing returns >100,000, if you are looking at intra-national differences. And the sample size is rather large, though the authors admit that they could have had denser coverage of central Italy. For Italy they pooled a lot of data sets, including from biomedical studies. Naturally they also took in the HGDP and HapMap Italians.

On some methodological notes, the PCA is really hard to read. I’m not quire sure if the labeling is correct (see figure 1 to check me here). So I’ll just report the ADMIXTURE results. I looked at the methods, and I do have some concerns here. I am not clear if they ran ADMIXTURE K 2 to 10 more than once. The reality is that you should. That’s because ADMIXTURE is sensitive to the value of the seed parameter (you should change it from the default and allow it to be generated pseudo-randomly from the computer’s time), and when you do statistical checks such as cross-validation that value itself can vary across runs! What I’m saying is that one run of ADMIXTURE may tell you that K = 4 is the best fit, but another run may tell you that K = 6 is the best fit. It’s happened to me. I once ran a data set up to K = 20 20 times, and the cross-validation values themselves exhibited considerable variation across runs depending upon the K (there were some K’s though where the value seemed extremely stable, so I was more confident of the fit of that K).

Also, there was one passage which makes me a little curious as to how clearly the authors understand the clustering techniques being used here, and what it tells us (and does not tell us):

The average admixture proportions for Northern European ancestry within current Sardinian population is 14.3% with some individuals exhibiting very low Northern European ancestry (less than 5% in 36 individuals on 268 accounting the 13% of the sample).

I’d be careful of labeling a modal component in Northern Europeans “Northern European ancestry.” I’ve posted on enough topics related to this to illustrate how easy it is to generate statistical artifacts which have little correspondence to the real biological world. It’s one thing when you have two populations which are genetically very distinct, and clustering in a disjoint faction almost immediately. For example, Africans and Europeans. But when you have intra-European variation, and the clusters don’t distribute in an exclusive fashion, one should be wary of reifying them into real populations. “Northern European modal cluster” may not roll off the tongue, but it has the benefit of being precise and not false.

So what about the results? Nothing too surprising, I invite you to peruse the figures and read the supplements yourself. I did note that the evidence of intra-Italian migration is very obvious in these results. People whose geographic origins are in the north often cluster with southerners (i.e., the southern cluster), but people whose origins are in the south rarely seem to cluster with northerners. In the 20th century there were massive flows of migration from the Italian south to northern cities like Turin, while Mussolini encouraged the migration of southerners to the German speaking regions of the northeast. In contrast, few northerners headed south. In short, many people in northern Italy have grandparents or great-grandparents who left southern Italy. Far fewer southern Italians have grandparents or great-grandparents who left northern Italy (though they do exist, I actually met a young man recently whose mother was a Neapolitan whose parents were from the Veneto). Additionally, I’m curious about the fact that Sardinians seem to exhibit some level of genetic homogeneity. This surprises many people because of the history of Sardinia, under Carthaginian, Roman, and Vandal rule. I have a simple explanation for what’s going on: the coasts of Sardinia are malarial. The modern population of Sardinia are the descendants of the indigenous mountainers, who repopulated the coastal cities periodically.

I want to note that if you look at the ADMIXTURE runs the Mozabites have nearly as much of the Sardinian modal component as mainland Italians. This doesn’t mean equal genetic distance; the Mozabite dominant cluster has a higher distance. But, it does suggest to me that it may be that in the Copper Age the western Mediterranean was dominated by a Sardinian-like population, which later was displaced and assimilated by newcomers.

Finally, I have no idea where to get this data. That’s sad, since it is so large a set. But I specifically noted the biomedical origin of some the data because I suspect that’s going to make it difficult to get it into the public domain.

