search this blog

Saturday, March 19, 2016

Indo-European phylogeny + Y-DNA R subclades

Here's one of the most sensible Indo-European phylogenetic trees of this type that I've seen to date. It comes from a recent paper from Tuebingen University (see here). I added the red Y-chromosome haplogroup R subclade labels. Can something like this work as well, or even better, with other haplogroups?

My decision to mark the Tocharian branch with R-M417* is based on the latest available information on the paternal markers of the Tarim Basin mummies (scroll down to the second part of the post here). The rest of my choices should be self explanatory.


Taraka Rama, Ancestry sampling for Indo-European phylogeny and dates, InProceedings (Aufsatz / Paper einer Konferenz etc.), 2016-03-07


capra internetensis said...

Looks pretty plausible, apart from Walloon being the most basal branch of Romance, lol. Mind you the other tree in the article dates PIE to like the beginning of the Holocene. :)

Onur said...

Looks pretty plausible, apart from Walloon being the most basal branch of Romance, lol.

I interpreted it to mean that Walloon is the most divergent Romance language according to the tree, then comes French, after which comes Provencal. These actually make some sense. On the pan-IE scale the most divergent languages are the Anatolian IE languages according to the tree, which also makes sense. Divergent does not mean more archaic, the opposite is more likely to be true: the most divergent languages tend to be the ones that are the most innovative.

capra internetensis said...


I agree with you that French is very divergent, but the tree is supposed to be estimating when the different language diverged from each other historically, not grouping them just by how innovative they are. When your tree says that Walloon branched off of Old Latin in 700 BC and that Umbrian is an early French dialect there's something gone wrong with your method.

rozenfag said...

Afrikaans clustered with Frisian, Polish clustered with Russian, Ukrainian and Belarussian? Seems strange.

Davidski said...

It's probably hard to accurately classify modern languages with this type of methodology, because they weren't as isolated as the ancient languages during their development.

Maybe Afrikaans shows higher affinity to Frisian because of contacts with English? Similarly, there were a lot of contacts between Polish and Ruthenian during the Polish-Lithuanian Commonwealth days, while other West Slavic languages like Czech and Sorbian were almost wiped out by German.

George Okromchedlishvili said...

Polish and Russian definitely have a lot in common. The only "problem" for Russian speakers is the Polish pronunciation but reading the texts is relatively easy. Czech is certainly more distant.

Nirjhar007 said...

This is nonsense. Utterly premature and lacks practical approach.

Simon_W said...

The tree mixes together clades of R1a and R1b as if there was no R1a/R1b split. Since that split predates the time of PIE considerably, in principle you could also mix in other haplogroups typical for certain subclades of IE. And (especially as a Germanic R1a guy) I'm reluctant to identify Germanic with R1b-U106 only. In large parts of Scandinavia the incidence of R1a equals or surpasses the incidence of R1b-U106. I don't think this is evidence for a Proto-Balto-Slavic substrate in Scandinavia. I think it rather reflects the intermediate linguistic position of Germanic inbetween Italic/Celtic and Baltic/Slavic.

Simon_W said...

R1b-U106 is well associated with Germanic languages, but it peaks in the Netherlands and England. Not quite the oldest core Germanic areas I would say.

Ryan said...

I agree re: Simon_W's point. R is older than PIE, so merging R1b and R1a is fairly nonsensical unless you're trying to suggest a late Steppe origin for Berbers and Chadic speakers. Perhaps do two parallel trees in different colours?

Davidski said...


R is older than PIE, so merging R1b and R1a is fairly nonsensical.

Doesn't matter how old R is. It can be thousands of years older than PIE.

What matters is that both R1a and R1b were present in the populations that gave rise to PIE, and thus several subclades of R1a and R1b can be reliably associated with the expansions of certain IE groups, even though it's likely that these R1a and R1b subclades are much older than the linguistic groups, or even PIE itself.


R1b-U106 is well associated with Germanic languages, but it peaks in the Netherlands and England.

So do you think that R-U106 Bronze Age Scandinavian, RISE98, came from the Netherlands or England, or is it more likely that his descendents eventually ended up in the Netherlands and England?

And the fact that R1b almost reaches fixation in Iberia and Ireland today means that Romance and Celtic languages came from Iberia and Ireland?

Ryan said...

David - I agree, hence why I suggest two different but connected trees.

Nirjhar007 said...


I think its a bit of free time and boring yes.

Why not do some stuff like for example that Kum6 genome?. :)

Anonymous said...

The Romance tree at least is completely wrong. It puts Catalan closer to Spanish than to Provencal, Walloon no closer to French than to any other Romance language, Walloon as diverging from Italic hundreds of years before Latin diverged from Umbrian - hundreds of years before any Italic speakers reached Wallonia! What, were the Proto-Wallonians a secret society ambling around in Italy secretly for a thousand years before accompanying the Roman legions to Wallonia and supplanting them? It has Romansch closer to Spanish than to French! It has several Romance language further from each other than they are from Sardinian, which is so clearly the most distantly related of the languages!

I don't know as much about the rest of the tree, where there's been less serious research and there is less historical attestation, but there are some obvious issues elsewhere as well, as have been pointed out (Frisian, Polish; also Swiss German, which isn't actually closely related to Luxembourgish). And the dating is extremely suspect. Campidanese and Nuorese clearly did not diverge only 500 years ago! But Gothic probably has to have diverged from west/north germanic more recently than 500BC, and it's hard to believe that east, west and south slavic only diverged around 1000AD, because the Slavic population expanded 500 years earlier and the different groups had been largely out of contact for most of that time (the south slavs, at the very least).

The problem seems to be that the method cannot adequately account for areal influences, which in linguistics can be extremely extensive. So neighbouring sibling languages like the Sardinian languages look more recently diverged than they really are, languages that are not closely related but are geographically and culturally close, like Spanish and Catalan or Italian and Friulian, or Polish and Belarussian, or Frisian and Dutch, look more related than they are, and distant languages or peripheral languages with more influence from outside look less closely related (eg Gothic broke off later than the model shows, but was more isolated than the other Germanic languages were; the peripheral status of Walloon, with considerable archaism and extensive influence from Germanic, probably explains why it gets such a weird position).

This is why lexicostatistics is so widely regarded as a pseudoscience. Notice how in the paper the author admits that they had to rig their program to ensure that even the high-level families came out intact - rather implying that without being told, the program might not even have been able to recognise them itself!

Lexicostatistics might be useful as a first parsing of a little-known language family, an indication of where to look in more detail. But it's like trying to guess the contents of a room by listening to echoes and feeling the draughts and fumbling along the walls - you don't keep trying to do that once you've found the light switch! In Indo-European, we've found the switch, and serious work has been done, and lexicostatistical 'evidence' can now be ignored in favour of consistent applications of the comparative method (which recognises the importance of the relative ORDER of changes, not merely the fact of similar changes having happened), supplemented by analysis of historical records and archaeology. Continuing to draw up trees on a lexicostatistical basis is sort of like (and indeed analogous to) drawing up genetic trees on the basis of skull shape and skin colour...

Unknown said...


I posted a preprint of the forthcoming paper on the effect of tree priors in Indo-European phylogenetics on arXiv. The link is pasted below.