Friday, January 13, 2012
Eurogenes' North Euro clusters - phase 1, exploring the data
I have some preliminary results from a new intra-North Euro cluster analysis, using a cutting edge tool called ChromoPainter. More than 400 samples and 270K SNPs were tested, in linkage mode, and then the output processed in fineSTRUCTURE at 200K burn-ins and iterations. Like I say, the results should be treated as preliminary, but they already look better than any other cluster analysis I've ever seen dealing with Europe north of the Alps, Pyrenees and Balkans. The algorithm identified 21 clusters, with most located in Eastern and Northeastern Europe (see spreadsheet for details). Below are two plots showing how the clusters relate to each other via a tree diagram and heat maps – the first shows an aggregate view, and the second the individual samples.
It's interesting that the Baltic Finns seem to create clusters at a drop of a hat, but they also share the highest number of chunks, and the longest chunks, than any other group. Indeed, all of the Finnish clusters are closely related, and many of the individuals, especially from East Finland, even look like distant relatives on the heat map (note the ultra-hot, blue squares). On the other hand, the large Northwestern European cluster, featuring samples from across the UK, as well as from several nearby countries, is holding firm, and might be tough to break up in this analysis.
I have some theories about the reasons for the obvious genetic homogeneity and diversity in Western Europe, and these include the effects of the Black Death. It decimated many populations in the western half of the continent, thus encouraging migrations into emptied areas, and eventually leading to more open, mobile societies. It's an interesting subject, and I might write much more on it in the future. Meantime, here's a PCA plot from the ChromoPainter chunk counts data. Note the large distances spanned by groups from Northern and Eastern Europe, and the tight bundle of samples from the west, mostly from the UK, Ireland, France and the Low Countries. Interestingly, and perhaps counter-intuitively, it's the closely related Finns who take up most of the space on the plot.
The first component picked up by this PCA appears to be an Atlantic one. It peaks in the Cornish samples, but shows similar levels in all the British, Irish, French, Dutch and Belgians (post-Black Death mobility?). If we are to assume that I identified the component correctly, then it appears as if the East Finns, Vologda Russians, Erzya from the Middle Volga, and Lithuanians are the least “Atlantic” samples in this analysis. These groups, especially the East Finns, also happen to act like relative genetic isolates in many of my experiments (such as ADMIXTURE and MDS analyses). Thus, it seems they've been sheltered from significant gene flow from outside in recent times, including from the west, like German emigration to East Central Europe and Scandinavian influence in Western and Southwestern Finland.
1 comment:
Read the rules before posting.
Comments by people with the nick "Unknown" are no longer allowed.
See also...
New rules for comments
Banned commentators list
Nice work, I'm glad this is panning out. I'm the author of fineSTRUCTURE so I'm following with great interest!
ReplyDeleteI don't know if you are aware, but there is a "high contrast" colour scheme that might help visualise the individual level variation better (Organise->Change Colour Scale). If it isn't good, you can change it arbitrarily.
You can also name the populations which might help you (and the readers?) figure out what is going on in the PCA without the spreadsheet. I'm a little worried about using the PCA plots straight out of the program as it was just an exploritory data analysis feature I threw in - I'd recommend extracting the PCA matrix and moving to R for serious work.
You can also rotate the tree to make the order more sensible. Clearly there is a lot of admixture here which requires careful interpretation - it is usually easy to see this when you find an "optimal" order as some populations don't fit it. There is no statistical way of determining it at present, we are using ADMIXTURE as a supplementary approach.
Good luck! Dan