search this blog

Sunday, August 24, 2014

Genetic structure in the Western Balkans


PLoS ONE has a new paper by Kovacevic et al. on the genetic structure of Western Balkan populations. Here's the abstract:

Contemporary inhabitants of the Balkan Peninsula belong to several ethnic groups of diverse cultural background. In this study, three ethnic groups from Bosnia and Herzegovina - Bosniacs, Bosnian Croats and Bosnian Serbs - as well as the populations of Serbians, Croatians, Macedonians from the former Yugoslav Republic of Macedonia, Montenegrins and Kosovars have been characterized for the genetic variation of 660 000 genome-wide autosomal single nucleotide polymorphisms and for haploid markers. New autosomal data of the 70 individuals together with previously published data of 20 individuals from the populations of the Western Balkan region in a context of 695 samples of global range have been analysed. Comparison of the variation data of autosomal and haploid lineages of the studied Western Balkan populations reveals a concordance of the data in both sets and the genetic uniformity of the studied populations, especially of Western South-Slavic speakers. The genetic variation of Western Balkan populations reveals the continuity between the Middle East and Europe via the Balkan region and supports the scenario that one of the major routes of ancient gene flows and admixture went through the Balkan Peninsula.

Among the most eye catching figures from the study is this TreeMix graph with ten migration edges or admixture events. Note the 44% migration edge running from the base of the Eastern European branch to the French. Is this perhaps a legacy of the Proto-Celts and early Germanics? In any case, something similar can be seen on this TreeMix graph from the supplementary PDF to Skoglund et al. 2014, where a French genome is modeled as a clade closely related to Upper Paleolithic Siberian forager MA-1, but with considerable Sardinian admixture.


Also, the position of the Poles at the tip of the tree, and thus near the North Russians, is somewhat curious. However, I know that several of these individuals are ethnic Poles from Estonia, so that might be the problem.

Update 25/08/2014: Here's a typical Eurogenes Principal Component Analysis (PCA) of West Eurasia with the new samples from this paper (Bosnians, Kosovars, Macedonians, Montenegrins and Serbs).



Citation...

Kovacevic L, Tambets K, Ilumäe A-M, Kushniarevich A, Yunusbayev B, et al. (2014) Standing at the Gateway to Europe - The Genetic Structure of Western Balkan Populations Based on Autosomal and Haploid Markers. PLoS ONE 9(8): e105090. doi:10.1371/journal.pone.0105090


54 comments:

Unknown said...

The frequency of E1b1b the sole type of E in Kosovo has plummeted to 29.5% according to this study (it was used to 47%). I guess Kosovo is not as african as I thought it was and I might consider reconsidering it as a part of Europe. Also 3 out of 17 Kosovans had H5 (which I have).

Unknown said...

However, this study only sampled 61 kosovans the one that found 47.4% of E1b1b has 114. So maybe Kosovans are radical Islamic pale skinned Africans macerating as Europeans after all. And on top of that these monsters from hell marry women with H5!. The chechnans are muslims but they are Europeans (no E), the Kosovans ARE not European in any way shape or form, I have more in common with a Bangledesi muslim.

Davidski said...

As you can plainly see on the diagram above, Kosovars don't show any genome-wide genetic African admixture.

Unknown said...

If you look at the populations of the world countries in 1900 and also 1915 when Europe had 3 times more people than africa and 4 times more people than Sub-saharan africa, than an Eurasian origin of E would make sense. We need to see ancient dna to see where D and E came from.
However, many people say E1b1b is associated with basal Eurasian,
so
A.) Do the Kosovars show higher than average Basal Eurasians
B.) Do the Western Balts as a whole have more Basal Eurasian
C.) Do the Chechnans lack (or almost lack) Basal Eurasian
D.) Do the South-Asians lack basal Eurasian
E.) Do the Greeks have more Basal Eurasian than the Turks
If either A,B,C,or D or E are incorrect than I dont think E1b1b is associated with basal Eurasian.
E (and D)might not be African because its related to D but it is not European, and it should cause populations with it to cluster closer with africans. In short Kosovons aren't European from a genetic perspective.

Unknown said...

I know that the Kosovans do not have african admixture

However, could someone please give me the WHG, EEF, ANE, and WHG and particularily the basal Eurasian breakdown of the Kosovars,Serbs,Bulgarians as well as the Chechnans and Croatians.

Thank you.

Davidski said...

See Extended Data Table 3, Table S14.9 (page 111), and Table S14.15 (page 123) here...

http://arxiv.org/abs/1312.6639v2

Gui S said...

The Eastern European edge towards France is actually basal to all of Eastern and Central Europe, while the tree lacks any more Western groups like Germans or Scandinavian. So I reckon it probably represents the excess Northern European affinity in the French when compared to the North Italians, for which Central/Eastern Europe is the best proxy.
Still, I've been playing with your IBD data for the French and there are some interesting results, even compensating for the variety of effective population sizes (France is a total sink, with samples sharing only 2.6 cM on average with each other, the lowest in Europe, compare with 9.6cM in Poland and 30.6 cM among the Basques), the French share a deep affinity an area rangeing from Hungary to France running on the northern side of the Alps, so the Celtic thing is definitely off the cards.

Gui S said...

BTW, interesting to see the Kosovars in there. I remember them being very close to Corsicans and other more Western Mediterrranean populations in Vadim Verenich's admixture runs. They stood out heaps among other Balkanians, Greek included, and looked like they could be some sort of fossil population.
Do you have any Kosovar samples?

Davidski said...

I do have the Kosovar samples. I'm putting them into my dataset now.

Chad said...

An interesting bit to what Gui said, is that Albanians and Kosovars have some very old sublades of R1b (m269* and L23*) on-top of the genetic affinities. Maybe they escaped the excess Near Eastern during the mid-late Neolithic and a group of R1b dominant migrants hopping off the Danube mixing in, maybe on the way into Anatolia, could make them more Western like.

Unknown said...

I would like to see if the gallic/celtic invasion along the danube river and the balkans, stopping at the failed attempt to take Greece and finally settling in the kosovar region - scordisci tribe, has anything to do with this "western" kosovar marker

Davidski said...

I just ran a West Eurasian PCA with the new samples. The Kosovars cluster with northern Greeks. See above.

Gui S said...

That looks a lot more normal.

I wonder how MDLP had gotten those results.
Based on his world22:
MOracle("Kosovar", k=30)
[,1] [,2]
[1,] "Kosovar" "0"
[2,] "Italian-North" "5.6134"
[3,] "Corsican" "7.1253"
[4,] "Italian_North" "10.9982"
[5,] "Provancal" "11.463"
[6,] "Portugese" "12.7248"
[7,] "Iberian" "13.689"
[8,] "Spaniard" "13.9728"
[9,] "Swiss" "14.4108"
[10,] "Italian-Center" "14.8536"
[11,] "Greek_Center" "14.9077"
[12,] "Italian-South" "16.539"
[13,] "Romania" "16.9322"
[14,] "French" "17.152"
[15,] "Ashkenazim_V" "17.3445"
[16,] "Gagauz" "19.1966"
[17,] "Sicilian" "19.2653"
[18,] "Bulgarian" "19.3145"
[19,] "Montenegrin" "20.1668"
[20,] "Puerto-Rican" "20.6073"
[21,] "German-South" "20.9881"
[22,] "Macedonian" "21.0578"
[23,] "Basque" "21.7979"
[24,] "Serbian" "24.047"
[25,] "Greek_Cretan" "25.2893"
[26,] "Ashkenazim" "25.5039"
[27,] "British" "26.5313"
[28,] "CEU" "27.095"
[29,] "Tatar_Crim" "27.9278"
[30,] "Orcadian" "27.9698"

Unknown said...

"Note the 44% migration edge running from the base of the Eastern European branch to the French. Is this perhaps a legacy of the Proto-Celts and early Germanics"

Not an accurate summary, and rather presumptive. More like a basal north Eurasian one, which would have been more discernable had they actualy included som enorthwest European populations (ie Danes, Germans or Swedes, as they'd too have it. Also, the authors somehow bizarely think that Hungary is in "western Europe".

Davidski said...

You seem to be missing the big picture there Petemeister. Let me help you out with a quote:

"A geographically parsimonious hypothesis would be that a major component of present-day European ancestry was formed in eastern Europe or western Siberia where western and eastern hunter-gatherer groups could plausibly have intermixed. Motala12 has an estimated WHG/(WHG+ANE) ratio of 81% (S12.7), higher than that estimated for the population contributing to modern Europeans (Fig. S12.14). Motala and Mal’ta are separated by 5,000km in space and about 17 thousand years in time, leaving ample room for a genetically intermediate population."

http://arxiv.org/abs/1312.6639v2

Unknown said...

Yes I see.
But the authors still should have included NW Europeans, then at higher K's they might have been able to discern between a NW and a NE ancestry , for French and Balkaners, etc.

Nor do i think its presence in French is necessary due to GErmanic of "Celts", as it might have been there as early as Late Neolithic.

Finally, I am suprised at the relative scarcity of the off-red (West Asian/ Mediterranean") component in Sardinians, but they rather have a predominance of the light blue (? pan-central European) elements

Ryukendo K said...

Peter Schwartz,

Actually this is a pattern that is quite common. In fact this run replicates Dodecad K12b extremely strongly, so strongly that it was bit surprising when I first saw it.

Once again we have:
1. a component that peaks in Georgians, like K12b caucasus, that is not present in Basques specifically but present in Sardinians, and in all IE speakers.
Whats interesting and eye-catching here is that it specifically avoids the Burusho vs the other South Asians, in addition to Basque vs other IE pops, when in most other runs a component peaking in Georgian becomes discriminatory for IE in Europe only.

2. K12b Mediterranean-like component, that peaks in Sardinian+Basque, and pan European, but low elsewhere.

3. K12b Gedrosia-like component, that peaks in Indus, is present in Basques and Western Europe generally, but is missing/low in the Balkans+ C. and E. Eur.

This is again interesting because an Indus component that shows this pattern in W.Eur vs E.Eur cannot be replicated in most other ADMIXTURE runs. So its presence in Dodecad K12b was always an oddity. But here it resurfaces.

Shaikorth said...

ryukendo, if you mean figure S1, at what K-value you see this "gedrosia-like component" present in Western Europe and the Basques? It seems to me that the K=4 is the only time Western Europe (French) share a component peaking in Indus Valley that East Europeans do not have. However that component is high in the Balkans, absent in Basques and well over 50% in Saudis which doesn't really resemble the Gedrosia stuff.

Pete Schwartz, it's probably not as much that Sardinians have a Central European component but that Central Europeans have a visible "Sardinian-like" (neolithic farmer) component. The differences between K5 (Sardinians have over 50% Saudi modal component) and K6 (light blue Sardinian modal component forms and most of the "Saudi" in Sardinia vanishes) suggest very high amount of Middle-Eastern like alleles in this component which fits what we know of the farmers.

Unknown said...

Thanks boys.

1) Yes I agree with Shaikorth. It is not really a Gedrosia/ Sth Asian but rather a "North Levantine'/ 'West Asian highland' component which is present in all but basques. Yes, it is tempting to link it to IEs, but it is present in all Arab and Caucasian groups which are obviously not IEs.

2) Youre right in that the Sardinian like is probably teh Neolithic farmer intrusions. What, then, is the significance of the face that it 'transforms' into a new component between K=5 & 6 ?

3)Actual 'South Asian' component is rather rare on the whole , in Europe; at least in this data set.

The problem is how to make sense of it from a more recent prehistory (protohistoric ) aspect. How much is due to 'common ancesrty' and how much due to recent admixture.?
I am not sure i trust their 'migration paths' reconstruction. Some of the results are plain odd : ie Tuscan migration to macedonia, Bosnians derived from greeks ? ? Not very historically plausible scenarios unless one starts clutching at straws and resorts do distant myths, etc. Perhaps an approaxh more like Ralph & Gray's recent paper might have been worthwhile ?

And yes, these results all match closely those done by enthusiasts from the community. Literally, they are years ahead, so well done to all !

Ryukendo K said...

Shaikorth,
Pete Schwartz,

Agree with the Sardinian approxes neolithic.

The component mentioned is Dark green in K=7 in Figure 2 in the main text, where it is distinctly high in Basque. It is hard to see in S1.

The only components in Basque in K=7 are the N. Euro-centered, the Sardinian-centered, and the Burusho-centered.

The basque have the most Burusho component in all of Europe, with the French close behind.

Sardinian have N. Euro-centered, self-centered, and Georgian-centered. No Burusho.

The segregation of an Indus and Georgian centered component separately into Basque and Sardinian is very striking and probably reflects the same underlying phenomenon, whatever that might be, as found in K12b.

I think it might be instructive to analyse ADMIXTURE in light of what we know about ANE+WHG+AME. Because we know that the neolithic is introduced into Europe with an AME+WHG population, also known as EEF, and secondly that Sardinians preserve this group the best, it makes sense that the Mideast+Euro in K=5 transforms into a component centered on them specifically in K=6.

Ryukendo K said...

@ Pete Schwarz

Lol I think Treemix is not a very good tool to do this when you have say more than 10 pops in a tree.
Add as many migration edges as these authors and everyone is mixed with everyone else.

Shaikorth said...

Thanks for pointing the component out.

Although now I also see one quite obvious differenece between this "Burusho" and K12b Gedrosia. Northern Russians actually have as much "Burusho" as the French and Basques and that was not the case with Gedrosia which was as low in North Russians as in other Eastern Europeans.

Ryukendo K said...

@ Shaikorth
@ Peter Schwarz
@ Davidski

I agree.
But Wait!... I think there actually is an explanation, which ties in with the protohistorical movements question. I think ADMIXTURE can actually give us more information than we have hitherto realized. Sorry for the long post and hope you can bear with it, I've been taking note of a lot of things and there's quite a bit to explain lol.

One thing that we have been observing very consistently is that ANE is split in two ways in Eurasia by ADMIXTURE.

One component would peak in the Indus, and this will be well represented across South and Central Asia. The other will peak in North Europeans.

Georgian + Armenian and other South Caucasus populations, on the other hand, usually get their own component, and this makes sense because they are the groups with highest AME anywhere. c.f. David's last k=6 work.

These assignments sound daring, but I believe they constitute a plausible hypothesis, because:

1. The scenario above produces the following effect: North Caucasus populations, such as lezgin, which we know are a ANE-rich regional peak in Eurasia and much more ANE-rich than South Cauc pops, will score higher in either the Indus-centered component or the North European centered component vs South Cauc pops, in addition to the already high Georgian-centered one which they share, to indicate their high ANE vis-a-vis South Caucs. We see this across the board, in Dodecad K12b, Lazaridi's ADMIXTURE, Eurogenes K10, and Skoglund+Raghavan.
This is very consistent. You can practically predict the difference between Lezgin and Georgian in every ADMIXTURE run beforehand as long as you see which populations the components peak in. Lezgin > Georgian for N.Euro for Laz, Lezgin>Georgian for both N.Euro and Indus in Eurogenes, Dodecad and the original Skoglund paper.

2. Furthermore, Laz's and Skoglund's ADMIXTUREs indicates that Mal'ta himself has N.Euro and Indus-centered components, but not and never Caucasus/Middle Eastern ones.

cont'd...

Ryukendo K said...


It seems to me that ADMIXTURE is actually uncovering the different contributions of ANE+WHG+AME in different pops, but chooses to assign secondary mixes of these as pure components (e.g. WHG+ANE as a N.Euro component, AME+ANE as a Indus component), which are nevertheless very clean and well-split (c.f. Mal'ta scoring zero, zilch, nada in Caucasus, despite great pressure to do so due to influence of North Cauc groups who are high in Caucasus, meaning that ANE in Caucasus region is cleanly assigned to N. Euro or Indus components only). The clean-splitting part is very important.

In fact, this means that when ADMIXTURE varies in assigning components, the variation is easy to predict.
e.g. Given that the 'larger fraction' of ANE can be dedicated to either the Indus or North Euro component, either scenario explains
1. N.Euro present in Kalash+Pashtun if Indus component is centered in Balochi/Brahui, as Kalash+Pashtun are more ANE than Balochi/Brahui and that excess is divided into N.Euro in that particular run. This is seen in Dodecad.
2. Indus present in E.Europe if N.Euro component centers in Poles (as happens here), as West-Siberian admixed pops (e.g. Eryza, and in this case East Russian) have more ANE than poles and ANE is more centered in Indus in this run.
3. What is more, there's evidence of this in action from Laz vs Skoglund. In Laz, all excess ANE in N.Cauc vs S.Cauc is assigned to N.Euro only, not Indus. In Skoglund, it is assigned to both. When the Malta genome is analyzed in Laz, it scores much higher in the N.Euro and lower in Indus than we see in Skoglund, where it scores equally. This means that the 'larger fraction' of ANE has been partitioned into N.Euro in Laz vs equally in Skoglund.

I think there is a lot of information hidden here which we can choose to interpret, if we choose to do this in a rational and step-by-step manner, and that ADMIXTURE might finally live up to the buzz it generated originally now that we have some ancient DNA to figure out what it is actually saying.

This is also evidence, I think, of an increased role for secondary expansions of WHG+ANE+AME mixtures, during Neolithic and later, in shaping genetic profiles. We already see this in this paper, where EEF, which is AME+WHG, is practically the same thing as the component that peaks in Sardinians, which ADMIXTURE has no problems picking out.

Granted that we are not supposed to overanalyze this, but the picture is too consistent to not hypothesize at least :]

About Time said...

@Ryu, interesting insights. Keep it up bc this material requires good minds to penetrate it. Without that it's just dumb machine output.

One note. Admixture does not and cannot infer ancestral populations (ie past states). It looks only at present states and assigns proportional ancestry based only on those present states.

Inferring past states (sans ancient DNA) would require using segment analysis to identify+isolate pieces of the "past population" then putting those together without pieces of other past pops. Then synthesizing a reconstructed "past genome" --- which only then could be used in Admixture (like a synthetic Mal'ta etc).

Unknown said...

The study did an IBD for the impact of Ottoman expansion, but not broader processes. Of course ancient DNA is optimal, but i think we can still learn quite a bit by doing IBD analysis of large sets of European data, as Ralph & COop did http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001555

Chad said...

That Gedrosia signal isn't only present in the K12, and here. It is also present in the Baloch component of Harappaworld. The British also score the highest (among Europeans) in the Indo-Iranian component of MDLP, that I believe comes from the Kalash. Harappaworld's Baloch component is the one where Lithuanians and Russians score 7%, to the Brits 11%, so it may be more correct. It's hard to say. The Indo-Iranians kind of behaves like Gedrosia, as it is 'washed out' in the Baltic region. Either way, Western Europeans have a stronger connection to this group or type of ANE.

I still think that an entrance to the Steppes via the Caucasus is the best route available, when we consider the fact that old L23 is spread as far as Mongolia, and there is also m73 in high numbers in pockets of Western Siberia, East, Central, and South Central Asia. It makes no sense to have R1b enter Europe via Anatolia, and then migrate to Mongolia. Where is the evidence of such a movement? Having L23 split around Kazan, with one to the Baltic people and the other across Crimea to the Dniester, splitting to a Balkan/Aegean group and a Carpathian group, looks more logical and follows frequency maps and clades better.

M73 may have risen in the Near East, with most migrating North, along with m269/L23. It is almost non-existent in Europe, but dominates R1b in Asia, with movements that could be tied to Indo-Iranians and later absorbed and moved by Turkic groups.

http://www.eupedia.com/europe/autosomal_maps_dodecad.shtml#Gedrosian

MY/UFtElAEysSI/AAAAAAAAD1Q/ajahtf-3HUA/s1600/MDLPindoiranian.jpg

https://docs.google.com/spreadsheet/ccc?key=0AuW3R0Ys-P4HdDhib1M5OE1wWENNb2haUFFWZzNBMEE#gid=0

Chad said...

Correct link to the Indo-Iranian component. Sorry, for the cut-off.

http://2.bp.blogspot.com/-QKR7A_zu-MY/UFtElAEysSI/AAAAAAAAD1Q/ajahtf-3HUA/s1600/MDLPindoiranian.jpg

Chad said...

David,
Were those Scottish samples in Lazaridis' paper from Argyll? If so, then the 18%ANE could be correct. Argyll is where Gedrosia peaks in Europe, at 13.3%, to 11.8% for Orkney. Scotland may average 16-17%, but Argyll at 18%, could likely be true, based on these Gedrosia numbers.

Chad said...

As far as that 12k age that was reported by someone on here, it could be the fact that the ANE involved in the Gedrosia component is a result of some Ice Age refugia East of the Caspian, before expanding. This could be where R and maybe some Q were stuck. It could be also due to the fact that the AME component was formed around the same time, as farming begins roughly 12kya. Just another thought here, on how a 12k age might be found, if both reach their unique signature at around the same time.

About Time said...

@Chad, is that Eupedia map showing a mini Gedrosia peak in Albanians?

Davidski can check that with new Kosovar dataset.

Chad said...

Yes, Albanians are around 8-9% I believe, the rest of the Balkans are around 2-5%, or something like that. Not a surprise, considering that their R1b is quite a bit higher than R1a.

About Time said...

Romans thought Gauls, Celts, and Illyrians all had same pre-Roman origin. From Sicily if I remember.

People ignore that bc of stereotypes, but maybe there is something to it.

There's also the y-DNA hg E that peaks in part of Wales. Everyone assumes it must be from Roman garrisons, but again history didn't begun with the Romans.

Chad said...

Yeah, I'm not sure of the specific subclade or how it relates to E among the Western Balkans. They didn't follow the Aegean to Italy though. R1b and Gedrosia cut off around Croatia and Slovenia. Yes, there is a supposed Illyrian connection to the region and into SE Italy. Also, I2a-Disle, and I2a-Dinaric have an MRCA at like 5-6kya. So a mix around the region to the NE of here is very possible.

About Time said...

Info on E in Britain: http://www.jogg.info/32/bird.htm

I personally think it could be Brythonic or P-Celtic, not Roman.

Robert Graves thought that invasion had something to do with Gwydion and a change in alphabet with religious meaning. Pushing Q-Celtic "Sons of Mil" to very marginal / poor Ireland, separated from the important Atlantic trade routes.

Chad said...

It was probably brought with I2a-Disle, by the German beakers, if they have a MRCA in the same timeframe. I think that R1b could pick that up if they are at Cotefeni, or anything to the W/NW of there.

Chad said...

Circum-Pontic and R1b are an interesting fit. L23 does cover that region, from the the Balkans, Northern Anatolia, Southern Ukraine, the Caucasus, and to the Urals.

Chad said...

from Dienekes...

Pericic et al. (2005) give a 7.3kya estimate for the expansion of E-M78α (almost perfectly equivalent to E-V13) for Southeastern European populations north of Greece. Due to their use of the 3.6x slower mutation rate, this figure needs to be converted to equivalent years. The Nea Nikomedeia time depth was estimated as 9.2kya by King et al. Therefore, the equivalent age for the Pericic et al. (2005) expansion is (7.3/9.2) * 149 generations or 118 generations (1,540-950BC). They note that STR variance is higher in Greece, Macedonia, and Apulia, all areas with well-known historical Greek connections.

Cruciani et al. (2007) propose that E-V13 arrived in Europe from West Asia and underwent an expansion in Europe at 4-4.7 kya. This age is calculated using effective mutation rates that are 2.4 or 2.8 slower than the germline rate, which seems to suggest a Late Bronze Age or even later expansion with a rate closer to the germline one.

In the Balkans, it is fairly clear that E-V13 is mostly concentrated south of the Jirecek Line which separated native Greek from Latin speakers. In Italy, the highest frequencies are found in the south, the areas of historical Greek colonization. High frequencies are also attained in Cyprus. Cyprus also high STR diversity, consistent with an early arrival, suggestive of both early Mycenaean and later colonizations from the Aegean.

I think it may have hopped on with R1b to Britain. More testing of Beaker sites will tell the story. People were going to Britain from the Netherlands to Austria from around 2400BCE. I think that the Amesbury Archer was born in Austria, or there about. So they were coming in like mad. Word of metals and grazing land?!?

About Time said...

@Chad: I think it may have hopped on with R1b to Britain. More testing of Beaker sites will tell the story. People were going to Britain from the Netherlands to Austria from around 2400BCE. I think that the Amesbury Archer was born in Austria, or there about. So they were coming in like mad. Word of metals and grazing land?!?

Or trying to get away from something on the Continent.

Graves thought it was the War of the Trees, which wasn't really about trees: http://en.m.wikipedia.org/wiki/Cad_Goddeu

2400 BCE is when Battle Axe culture rolled into C Europe: http://en.m.wikipedia.org/wiki/Battle_Axe_Culture

Urnfield and Hallstatt might have been later "replays" of similar territorial conflicts. BB might have taken refuge in the islands to the west, looking for a safe hiding place.

Shaikorth said...

"Harappaworld's Baloch component is the one where Lithuanians and Russians score 7%, to the Brits 11%, so it may be more correct."

I think it is more correct than K12b Gedrosia at any rate. The issue with ADMIXTURE is that homogenous populations form their own components rather easily and that often masks other components. "Gedrosia" is not necessarily a signal absent in Eastern Europeans but something covered up by the "North European" component which is the Lithuanian modal component, and widely shared due to Balto-Slavic expansion less than 2000 years ago. Harappaworld's "Northeast European" probably masks something too, but obviously not to that extent because the difference in Baloch between Lithuanians and British is less than 5% while Gedrosia was totally missing from the former.

Chad said...

I think that it has something to do with the boom in Anatolia before 5000BCE. New cultures start popping up and getting absorbed into places like Cucuteni Trypillian. They may have had some E1b y-DNA, with G, J, and I2a. I don't think that R1b mixed too much with them, as the Caucasus component that is so high in the Balkans, is hardly present in the British and almost zero in the Basque. Their only West Asian link is Gedrosia. I think that most of the Caucasus appearance in Central and Western Europe is due to later events, like Urnfield, and later Celtic expansions.

This Caucasus component looks linked more to G, E, J, and later I2a-Dinaric. Basically, Anatolian/Levantian. R1b entering Europe through the Balkans would surely spread this around a little better and not spare the Basque and British. Western European R1b probably never went further south than the Danube, and maybe followed the Dniester to the Carpathian Basin, and onto Germany. Either that, or two other options. R1b went through so fast, it maintained its more West Asian/Gedrosian type, or it split into two groups, one that enters the Balkans, and another that went through Ukraine, the two never mixing before one enters Western Europe. I still think the spread of old clades of L23 support a Southern Steppe entrance for Western Europeans. L23 does make that V from near Kazan, to Ukraine and Lithuania. No one migrated from Anatolia, to the Balkans, and onto Mongolia. A ride to the East had to be caught with Afanasevo, or Andronovo.

Chad said...

I think that this is how G, J, and E, came to overpower R1b in its homeland of the time, Eastern Anatolia, Western Iran and the Caucasus. They moved on and picked up some Danubian and Corded influence on the way Northwest.

Ryukendo K said...

@ AT
Thanks for the encouragement :] I will present more of my hypotheses here as time goes on.

@ All
I'm studying stats now, which might help here. Also I am in a foreign country where there is no nightlife lol, so I've decided to do a few posts. Anyone with more stats experience feel free to correct me.
I think a proper mathematical understanding of the way ADMIXTURE works would help us all understand what it can and cannot do, but, failing that a visuospatial one would do. I will try to illustrate it with analogies, in a few following posts. No math, don't worry.
Every locus that we measure in people constitutes a dimension in which it is possible to be different, in which people have a 'difference distance' from each other. So distance is a measurement of difference. If we total a huge number of SNPs, this creates a 1000s-dimensional space, where each person is a point suspended in that space, and a population is a group of points that cluster together because of their similarity in all dimensions aka all loci.
If we introduce a population, aka a cluster into this space, and allow it to split and drift, the clusters will split and move in every direction. Each population will be maximised in its own direction, and is 'opposite' to everyone else. Everyone can be 'opposite' to everyone else because this space is more than 3-dimensional, and thus there is more space than we are used to, and more directions in which to drift.

Because every population is opposite to everyone else if all populations split cleanly in a tree-like manner, there will be no populations that occupy a position exactly in between two other populations such that they form a straight line. Imagine a square. Each corner occupies an extreme direction, and is not directly between any two other corners. Each corner started out by drifting from the center in a direction of its own.
This means that if a population is directly in between two other populations in the multidimensional space it is a combination of two others. it could not have reached the position otherwise.

Ryukendo K said...

What about inside the square? Well only a diagonal through the center is in between two other corners. But if we want 'within three other corners' then that increases the number of points to half the points in the square. 'Four corners', all points.
In most ADMIXTURE datasets, populations would be strewn throughout the center of the square. When we force ADMIXTURE to measure Ks, we are forcing it to create 'virtual points', which are similar to the corners in the previous analogy; we then use this to capture information by finding distances of each point from the 'virtual' points, aka defining the rest of the population as combinations of contribution from these virtual points. If all the points are in a straight line, then two points will capture most of the information. If they are in a long ellipse/sausage/finger, two points will still be sufficient to capture most information as well, though some information is lost.

If there were only 2 loci, aka a 2-dimensional space, ADMIXTURE would probably find the four corners of a square/4 ancestral components or whereabouts, because that would explain all the variation accurately. As to how the square is 'rotated' with respect to the clusters, it would depend on which is the most 'extreme' population in the cluster, aka earliest to split wrt the center of the cluster, with the rest of the corners orthogonal to that. If there were an e.g. 'san' cluster all the way out, that would become the first corner. Unless there is an extremely endogamous, drifted pop that 'corrupts' this scenario, which ADMIXTURE is known to be prone to.

The same thing happens in the 1000s of dimensions. Here the dataset is best thought of as a giant amoeba-like constellation of clusters. The reason why a component peaks in georgian is because there is some dimension where most populations can be seen as 'in between' gerogian and something else, aka where georgian is 'the most extreme', and because there are so many georgian and so many everyone else and many, many pops are within this dimension, this dimension is very information-rich and appears.
The more accurate way to think about this is that a 'virtual point' appears very close to georgian, and the distance from all pops are measured from this point as it maximises information. Because most pops happen to 'align' with this virtual point, aka an ellipse/sausage-shape with this point at one end, this point is informative.
While the point to one side of the sausage shape is not informative.
There are also other dimensions where it is possible to think of georgian as between other things. But these dimensions are not as specifically information-rich and would only appear if you force ADMIXTURE to reduce its K's. For example, there is a dimension where all Africans are spread out in a ellipse and far away from Eurasians, who are all jumbled together in a knot at one end, and if ADMIXTURE is forced to choose K=2, because this dimension is by far the longest (san vs Amerindian are the most 'extreme') and is the long axis of a sausage-shape, Georgian will be defined as a combination of African+East Eurasian.

Ryukendo K said...

This means that the following is true:
1. Any aDNA or real genome of a person constitutes a 'real' point in this space that is fixed wrt the others, not a virtual one. Virtual points are moved around in the constellation in relation to their distance to 'real' points. This makes it possible to anchor our components with respect to what we know about fst and cladistics. Will explain more later.
2. Areas with a concentration of points that all go off in one direction, aka a series of closely-related points stretching out like a 'finger/sausage', would likely receive its own component peaking/own 'virtual point' in the 'end' of the points aka the end of the finger.
3. Virtual points/components will try to be as far apart from each other so as to maximise the amount of information collected. This is saying the same thing as 'reflecting the most accurate contribution of ancestral components'. Find the most number of viewpoints from which the distr of the constellation looks like a sausage, then plonk the 'virtual' point at one end of all of them. If K=2 it will find a single viewpoint, a single ellipse with San at one end and Amerindian at another. If K=3 it will 'change perspective' and find another one. and etc. Thus the square shape when K=4.
It will try not to find 2 viewpoints which result in almost parallel fingers, because that is not efficiently revealing of information.
4. If A and B person/pop are close together in space, then they will either score 1. similarly with respect to all other 'virtual points', aka same admixture proportions, or 2. score very high in the same component as they are very close to a 'virtual point' that is close by, aka share the same component.
5. If a population are close together in all dimensions, then by the dynamics of the drift+admixture I presented just now they are either a) recently split or b) admixed with each other. b) applies because Rmb that if they split long ago, they would each drift in their own direction 'opposite' to each other and thus cannot be close together.
This final point, plus that of whether or not something is 'in between' others, is the central reason why ADMIXTURE works.

Ryukendo K said...

This means that we can apply the following:
1. If A and B both score in component alpha, (a.k.a have allele frequencies that are similar to some degree mathematically/closer in the 'difference space', implying some kind of homogenizing input) and distant from others, implying that they are close together at the end of a 'finger/sausage', this implies that either a) a population C even further out in the sausage/finger, rich in component alpha contributed to A+B, pulling them away from everyone else or b) B contributed to A or c) vice versa. I will call this the 'ABC' rule.
2. A grouping of points/populations with similar allele frequencies, aka the occupying the same space, which is also very far and drifted in a chain from the rest (aka it is possible to view them as an end of a finger shape) will likely receive a component/'virtual pop/point' that occupies the extremity of that grouping. Because that degree of elongated extremity implies information-richness with respect to genetic difference/positioning in the difference space, so it's a good place to set up an axis.
This indicates that that component/'virtual point' is the closest thing to something that contributed to everyone in differing proportions, making the elongated finger shape, terminating in that grouping, by the logic we have mentioned prior.
So thats an 'inferred ancestral component'.

Ryukendo K said...

This explains:
1. A genome that is actually ancestral, e.g. Mal'ta, fails to form a component of its own. Since there are huge clusters of pops defining both south asia and Euro 'areas/fingers', each drifted in their own direction, but none around Mal'ta, the most informative way to arrange the data involves looking for a finger leading to Kalash/Balochi, and another finger leading to Europe, and placing the 'virtual pops' at the end of these fingers/measuring along the axis of these fingers. By the ABC rule this also implies that Mal'ta is far out along both fingers, since it contributed to both, but because of WHG input into Euro and AME input into Kalash, Mal'ta is to the 'side' of both the fingers, as the fingers have drifted in a direction 'opposite' to mal'ta in some other unrelated, orthogonal dimension (all unrelated dimensions are orthogonal), creating distance. This implies that Mal'ta will score in both N.Euro and Indus, and also that info is lost in the case of Mal'ta, which is true.
On the other hand Mal'ta is not part of the finger leading to the 'virtual point' beside Georgian, as by the ABC rule Mal'ta did not contribute to AME. So Mal'ta does not score in any component high in Georgian.
2. If we fix an admixture component in Mal'ta, aka in the supervised mode, this virtually guarantees that Mal'ta contributed to those populations in the proportions described, because now we are using a real point for a component, though this is not as 'informative'. What I mean is, should an alien look at the Davidski k=6, he would not really know that such a thing as 'European' or 'South Asian' existed, because everything is defined in terms of distance to the axis defined by mal'ta and other very old components; they are not defined according to dimensions that would bring out most of the recent drift in the 'fingers', which account for intercontinental differences most apparent now.
3. Populations themselves form fingers in dimensions of their own, but this is insufficiently informative and will not produce a 'virtual point' in a population-specific component unless it is extremely, extremely drifted.
4. This also means that we can talk about positions and shifting 'virtual points' when discussing ADMIXTURE components. The exciting part is, we can make accurate predictions based on this. Some examples:
We know now that Indus and N.Euro-centered components/'virtual points' are ones that tend to have a fixed distance between them as they peak in two 'real' clusters that are a fixed distance apart--Indian and Euro populations, with Mal'ta in between. We also know that the closer N.Euro is to Mal'ta, the closer Indus will be to AME and further from Mal'ta to maintain orthogonality and thus information richness, and the closer Indus is to Mal'ta, the closer N.Euro will be to WHG and further from Mal'ta for the same reason.
This means that if N.Euro is close to Mal'ta, then the Indus 'virtual point/axis' will be forced towards AME pops, e.g. Balochi, due to the fact that Indian ANE+AME pops all share the same 'finger', and they are real, 'fixed' pops who do not change position while the 'virtual' pop can do so. N.Euro axis will be closer to South Asian pops high in ANE, causing elevated levels of N.Euro in Pashtun+Kalash, who are two of the closest 'real' points beside Mal'ta's point and thus closest to N.Euro. This will also cause Indus to be more distant/orthogonal from Mal'ta, and N.Euro to be closer/parallel, which increases Mal'ta scoring in N.Euro and decrease that in Indus. This is what we observe in Laz ADMIXTURE vs Skoglund ADMIXTURE.

About Time said...

@Ryu, right. Thinking visionspatially, Admixture finds clouds. Say, at the vertices of a cube. It will use the center of each cloud as a reference point in assigning proportional admixture for unknown points.

Sort of.

Main point is that Admixture just sees those four vertices. It does not and cannot infer "genetic space" beyond those vertices defined by known points.

So, in a PCA of Europe, La Brana is way off beyond Basques. It's "hyper Basque" sort of. Admixture never imagines that "hyper Basque" point exists from looking at Basques-----only if it has La Brana.

Admixture is "blind" to undefined data points. When it shows Gedrosia in Britain, it has no idea why. Just that relative to other defined points in West Europe, Brits are pulled a little to Kalash. AS IF they are mixed with Kalash.

But why? We don't know and neither does the software. It's just a parsimonious mathematical expression of the given data points.

Ryukendo K said...

Another prediction: If Indus is close to Mal'ta, then the N.Euro 'virtual point' will be forced towards the direction of European populations, not just increasing the amount of WHG but AME/EEF as well, as they are part of the same finger! In such runs, e.g. this one, the N.Euro component 'cannibalizes' a lot of the Sardinian one, resulting in East european populations with no Sardinian/EEF input, which we know is patently false. Compare this with Dodecad, where everything is shifted the other way, and Sardinian is 'exposed', represented deep into Eastern Europe.
This also means that Skaikorth's explanation for Gedrosia is very plausible. The European 'finger' are really 2 fingers more 'parallel/closer-knit' in the dimension leading to polish, but also smaller and less 'parallel' in the dimension leading to Sardinina/basque, but a larger cluster defined this way.

When the tighter finger around poles gains its own component, this 'exposes' the Sardinian and Basque as being not really part of that finger in some way in that particular viewpoint/dimension; they are also not really part of the 'virtual point' where all europeans cluster around Sardinian (which is probably EEF), they are 'to the side' of both a bit, so they score whatever they are closest to.

The more interesting thing is Basque. ADMIXTURE has made the decision to assign the excess variation (i.e. the Basques are off to the 'Side' of the finger, close to the base)as Indus. Why should this be??

My own take on this is that Indus is so ANE-rich in this run (c.f. the comment before this chain) that it even scores high in Georgian and Armenian, which it almost never does. This also implies that any other component/'virtual point' is pushed away from ANE, including the Sardinian-dominated one as well. This plus the fact that N.Euro is somewhat ANE-poor and very WHG-rich in this analysis (poles and belorus are highest in WHG of any pop, higher than any russian) shifts N.Euro and Sardinian further away from Basque and Indus closer to basque, causing basque+Western European to score a small amount in this. The only other ADMIXTURE runs in which an Indus component appears in Georgian are DODECAD and (thanks Chad) Harrapaworld, and in both, the same Indus appearing in W.Eur occurs as well, in exactly the same fashion, implying that this 'pole shifting' is real. Raghavan is not comparable because it does not have an EEF pole for Sardinians + Basque to dominate.

In Eurogenes k = 10, Georgians score a little in Gedrosia, and W.Eur + Basque score a little. In Laz Georgians score 0 in Gedrosia, and W.Europe and Basque score none. This makes perfect sense with pole-shifting.

This might still insufficient, as Basque might not have sufficient ANE to be so close to Indus, given that N.Euro is quite close to ANE as well. This means that in the space, Basque and Western Europe might still be anomalously close to Indus and not just due to the ANE contribution. This is where we have to make a judgement.

Ryukendo K said...

@ About time
Wait, I haven't finished yet, read everything first :]
@

Thus ADMIXTURE can reveal differences in a hierarchical manner. More recent and informative, more easy to reveal. Higher the K, narrower the 'fingers', cleaner the splits. But also smaller, less significant components/fingers, e.g. palestinian-specific ones, at high K.
For e.g. Georgian, ADMIXTURE will form a pole there because there is a finger leading to georgian, but AT is right in saying that it cannot infer that this is due to a further point outwards of Georgian, AME, which is pulling everything there. What it will do is fix the point in Georgian.

This does not affect our analysis too much. Mal'ta and WHG have gained their own poles in David's K=6, so it doesn't matter if the 'virtual point' for AME is situated in Georgian and not as 'extreme' outwards as AME, because this just means we have to adjust for the distances to make the 'virtual point' more Georgian-like instead of AME-like. So we can gain a handle on quite a bit of structure in our Dataset anyway. What we cannot know is how much e.g. WHG is remaining, hidden in the AME figure. Probably too much judging from the fst dists. But we just have to take in mind that this is shared across all pops with the 'Georgian' component, in the proportion that it is found in that selfsame.

If an actual AME pop is introduced here it will likely peak in its own component, because it is surrounded by a finger from the aformentioned Georgian. This is not the same as Mal'ta, which allows us to make another prediction:

An aDNA pop with its own finger will get its own component, but an aDNA pop in between two other fingers will be defined in terms of those.

From Loschour and Mal'ta in Laz, this is indeed what we see.


Ryukendo K said...

However this also implies that any 'zombie' person you choose to create in any space is simply an aggregate of alleles, in other words he is the same as a real human being if you get all his SNPs. So while we can say that ADMIXTURE is 'fake', 'calculated', if we use admixture to create a zombie from a particular position, aka a 'virtual point', and compare this to SNPs of a person actually in that position, it will make no difference. They're the same thing.

By remembering that real populations constitute fixed points in that multidimensional space, we can make informed judgements about the 'virtual' populations/points that ADMIXTURE chucks at us.

Last of all, if multiple ADMIXTURE runs generate similar 'virtual' components, that increases the likelihood that we would find an ancestral population whose real allele frequencies/aka real position in the space, matches the position of that hypothetical population. Of course, rmb that we are finding 'fingers' on different scales so at the highest K this creates an ancestral pop for each population, and at the lowest K=2 it doesn't even really find an ancestral pop, just the longest distance in which it is possible to vary, which brings us all the way back to OOAfrica. So everything becomes a matter of scale and timing, with recent migrations (more distinct 'fingers') easier to find than ancient ones.

Davidski said...

Here you go Ponto.

https://drive.google.com/file/d/0B9o3EYTdM8lQM1pUd1R6OUN2LTg/edit?usp=sharing

J.S. said...

Gui was wrong about the French. Its with Germany we share a deep affinity, not Hungary...
The long green stick correspond to Germany not Hungary
From 4335 to 2335 years ago and 2335 to 1515 also.

So the Celt hypothesis is not cut off.

Unknown said...

I have 10% baloch ancestry on harrapaworld autosomal...does this mean i have baloch ancestry