search this blog

Tuesday, June 19, 2018

An exploration of distance-based models of language relationships with a special focus on Indo-European (Kozintsev 2018)

The latest edition of the Journal of Indo-European Studies includes an interesting methodological paper by Alexander Kozintsev, in which the author tests the relationship between Indo-European and other language families using lexicostatistical data and a wide range of distance-based models (see here). My impression, after reading the paper a couple of times, is that we probably have a long way to go before someone comes up with a robust enough way to study languages with these sorts of methods, which are more widely used for the classification of living things.

However, note that Kozintsev's results are very consistent in placing Indo-European, including Hittite (HIT in the figure below), significantly closer to Uralic than to any of the language families south of the Caucasus. This is in line with the general consensus amongst historical linguists working with more traditional methods of studying languages, and, if true, has significant implications for the search for the Proto-Indo-European (PIE) homeland. Why? Because it's very difficult to imagine the PIE homeland being located anywhere south of the Caucasus considering the present-day distribution and likely homeland of Uralic languages well to the north of this region. Emphasis is mine:

The paper explores the informative potential of various distance-based methods of language classification such as cluster analysis, networks, and two-dimensional projections, using lexicostatistical data on 41 languages belonging to seven families (IE, Uralic, Altaic, Yupik-Chukchee, Kartvelian, Semitic, and North Caucasian) represented in the STARLING database. Rooting and weighting are of critical importance, radically affecting the graphic models. Special focus is made on two-dimensional charts generated by the multidimensional scaling and on the little-used minimum spanning tree method. The latter two techniques are employed to test the hybridization/ Sprachbund theory of Indo-European origins. The “Semitic” tendency of IE relative to Uralic is significant whereas neither the “Kartvelian” tendency nor the North Caucasian substratum hypothesis are supported by the two-dimensional models.


Finally, having come full circle, we return to our working hypothesis––that IE is closer to Uralic than to any of the “southern” families. I did not test this assumption because it appeared almost self-evident; now it can be easily tested by the same analysis. But, in fact, even statistical testing is unnecessary, because the triangle data cited above speak for themselves. IE, according to these data, is 20.8% closer to Uralic than to West Caucasian; 18.4% closer to Uralic than to East Caucasian; 13.7% closer to Uralic than to Kartvelian; and 16.9% closer to Uralic than to Semitic. Given the statistical reliability of a 5.6% difference (see above), all these values are highly significant a fortiori.

Kozintsev, Alexander, On Certain Aspects of Distance-based Models of Language Relationships, with Reference to the Position of Indo-European among other Language Families, Journal of Indo-European Studies, Vol. 46, 2018, No. 1 & 2, pp. 1-264


Slumbery said...

If this were due to ancient relatedness and not later connections, that would mean implications for the Uralic homeland too. Namely it would cross out far eastern or deep siberian root areas. Either that or (proto/pre) Uralic-like languages were very widespread at some point.

Ric Hern said...

It makes sense looking at MtDNA Haplogroup U5a....

Ric Hern said...

I wonder how much Semitic Languages were influenced by Caucasian and Pre-Indo-European Anatolian Languages ? Maybe Semitic shares a common substrate with Indo-European due to Neolithic Anatolians ? Proto-Semitic doesn't seem to be older than Proto-Indo-European. So a direct link between Semitic and Indo-European does not look plausible...

EastPole said...

Kozintsev’s article is available at

That link between Hittite and Uralic is interesting as this would contradict crazy Carlos theory that Corded Ware was Uralic speaking and Yamnaya was IE. Actually it was the opposite: CWC was IE and Yamnaya was Uralic.

I have always had problems in believing that the Volga-Don steppes, the driest and easternmost part of the Pontic-Caspian steppe zone where the oldest Yamnaya variants appeared, was the place where PIE religion and culture was born. It contradicts everything I know about IE culture and religion. But it could be the Uralic homeland. As the steppes became even drier Indo -Slavs from Sredny Stog moved first north and west creating CWC and pre-Ugro-Finns moved north along Ural. Afanasievo could be Uralic as well. Some Yamnaya Uralics mixed with IE (pre-Hittite) moved south to Anatolia and this would explain the Kozintsev’s results.

So the expansion of Uralic languages (red arrows) and Indo-Slavic languages (blue arrows) could look like this:

It is all speculation of course but we have to be open on many possibilities. PIE problem is far from being solved.

Ric Hern said...

@ EastPole

What you say about Yamnaya doesn't make sense. Can you remind us how much Yamnaya like was Corded Ware ? Now compare this with the percentage of Uralic influence on Indo-European....

Davidski said...

Here's the direct link to the freely available version of the paper...

On Certain Aspects of Distance-based Models of Language Relationships, with Reference to the Position of Indo-European among other Language Families

EastPole said...

@Ric Hern
“Can you remind us how much Yamnaya like was Corded Ware ?”

I am not aware of any Yamnaya Y-DNA in CWC.

Dmytro said...

"I am not aware of any Yamnaya Y-DNA in CWC."

Most Yamna territory remains genetically unexplored re aDNA. I have no doubt that a lot of R1a will emerge.
But we know in any case that autosomally CWC and Yamna are very close
And Y-DNA constitutes a miserly portion of total genetic identity anyway.

So you might rethink your answer.

EastPole said...

“So you might rethink your answer.”

I can only present my way of thinking. I am speculating and not claiming to be right. Would very much appreciate learning about other ideas.
I believe from my own experience that IE is more related to Uralic than to other language families. This seems to be confirmed by Kozintsev's results.
Four genetic components participated in the formation of the population which first spoke PIE: EHG, CHG, EEF and WHG.
I cannot prove it but it seems reasonable to me to look for common IE and Uralic link in EHG. Just from geographical distribution. So I assume that EHG spoke Indo-Uralic languages.
Eneolithic steppe component was formed from EHG +CHG. What language families were spoken by Eneolithic steppe populations? My guess is pre-IE and pre-Uralic. I assume that pre-IE was spoken in Western Eneolithic steppe and pre-Uralic in Eastern Eneolithic steppe.
Then EEF + WHG arrived and mixed with Eneolithic steppe. This resulted in Sredny Stog culture R1a dominated in the west and Repin and early Yamnaya R1b dominated in the east.
I assume SS was PIE as from it CWC emerged which is linked with Balto-Slavic and Indo-Iranian languages.
Repin, early Eastern Yamnaya, Afanasievo could be proto-Uralic or some other unknown languages. R1b clade present in Yamanya is still present in this area and does not seem to be linked with IE.

Grey said...

"significantly closer to Uralic than to any of the language families south of the Caucasus"

if correct this might counter the argument that PIE was originally a mountain or near-mountain language

Ric Hern said...

@ EastPole

I see what you say but could it not also mean that the Proto-Indo-European which entered Uralic was transfered via the female line especially when looking at MtDNA Haplogroup U5a ?

Remember that both R1a and R1b were present on the Steppe at +-8000 BCE and in Khvalynsk. There were several Cold Dry periods on the Steppe from +-10 000 BCE onwards. The Steppe was warmer and less arid between 6000 and 4000 BCE. which in my opinion was the ideal time for the formation of Proto-Indo-European.

If I remember correctly there were MtDNA Haplogroup U5a present in both Samara and Sredny Stog which could point to an underlying relatedness no matter the Y-DNA differences.

Ric Hern said...

It could be that several Sister Branches of a Steppe Language formed during the dry periods when some Steppe people moved North and others South admixing with Northern and Southern Populations and then recombined during the Wet period to form Proto-Indo-European before the big migration....

We simply do not know how many different Mesolithic Steppe Languages contributed towards the formation of some or most parts of Proto-Indo-European.

We see three Mesolithic Y-DNA Haplogroups at Derievka. R1a, R1b and I2a. Could it be that I2a was the Proto-Uralic guy in the Mix ? Did the Balkan R1b people speak Proto-Uralic ?

This is why I think that R1b were not Uralic.

About the near Mountain thing....The Ural Mountains the Carpathian Mountains and the Crimea have Mountainous terrain...Mouflon Sheep existed in the Crimea till about 3000 years ago if I remember correctly. The Caucasian Moose is another interesting species that went extinct in the 1800s....

wastrel said...

Unfortunately the results seem fairly meaningless.

Lexicostatistics is an extremely iffy business in any situation, as it only works in specific circumstances, and its basic assumptions are known to be wrong.

But to the extent that lexicostatistics can be useful, it's useful when you know you're dealing with related languages, and you're fairly confident that you're not dealing with areal influence. In that case, variation can be modeled as deviation from a common ancestor. Where languages differ from plants, however, is that the 'genes' from one language can pass into neighbouring languages without it reflecting a common ancestor!

PIE and Uralic, however, are not known to be related, and they ARE known to have been in an areal relationship. Common lexical items have been found - but not only are they believed to have been borrowed into Uralic, we know that they must specifically have been borrowed from Indo-Iranian (or an IE branch very similar to II, at least). Centuries if not millennia of areal influence make any attempt to trace variations from a common ancestor, in the absence of demonstrable sound laws established through the comparative method, effectively useless. [and there are no demonstrable lexical items shared that cannot be explained as later borrowings]

I'd also caution that a lot of what PIE and Uralic may have in common has also argued to be in common with a bunch of other language - Mongolic, Turkic, Tungusic, and even Eskimo-Aleut. If there is a relationship, therefore, it may be extremely deep, and given the geography it would probably have to be associated with the ANE componant on the steppe - which would match the hypothesis that Uralic is itself a later migrant from siberia.

Meanwhile, if PIE and Uralic DO have a particularly close connection, that we've somehow failed to work out so far, that could be a problem in its own right. Because proto-Uralic was probably spoken thousands of years after PIE, and the history of pre-Uralic to that point is even less clear than that of pre-PIE. Which means that if PIE WERE to have come from south of the caucasus, as a pre-Indo-Hittite, it may just as well have been as pe-Indo-Uralic.

So, yes, it's fair to say that IE and Uralic languages look suspiciously similar and may be related - we've known that since the 18th century. But until we can actually prove that relationship isn't just coincidence and areal influence - and pin down when the split occured and where the different groups might have lived - it doesn't really tell us much.

Ric Hern said...

There seem to be a probable ancient connection between Yukagir, Uralic and Indo-European but this is also a fringe theory.

However seeing the oldest Y-DNA Haplogroup R was found near Lake Baikal (Mal'ta Buret Culture) there could have been some sort of linguistic contact or common origin.

However we do not know if this faint similarities could have been transfered via the Afanasevo Culture who migrated to the East oregional even later probable Indo-Europeans...

Years ago I read about some very faint similarities between Irish Gaelic and Finnish but this is not generally accepted...but I can not help myself wondering about an apparent Finnish Myth/legend which mentions about how domesticated cattle arrived in Finland.

Apparently a Queen with red hair arrived with ships and with her she brought cattle. The earliest cattle remains in Finland date to +-2000 BCE. The other thing that is striking is that the Irish Moiled Cattle breed looks very much like the Finnish Cattle breed.

Could this have been Bell Beaker People who traded with the Finns ?

EastPole said...

@Ric Hern
Don’t forget about very powerful Uralic steppe tribes like Hungarians for example.
Look at the distribution of Altaic languages:

They all came from the steppe. The question is from which part of the steppe?

Turkic people have a lot of East Asian component and look Asian, so they most likely came from the eastern part of the steppe. They have a lot of R1a and R1b (Yamnaya R1b). This Y-DNA was most likely assimilated from Indo-European and Uralic steppe groups which originated from Andronovo, Afanasievo, Eastern Yamnaya or others.

Uralic people look European. Look at Mordvins, Estonians, Finns, Hungarians,. They came from the steppe around Ural and expanded north, conquered North-Eastern Europe, conquered Siberia, before Turks did the same. They assimilated some IE and DNA.

I wouldn’t have problems with accepting crazy Carlos theory that R1a was Uralic if Indo-Iranians didn’t exist and if the oldest pastoralist’s R1b on the steppe were not found east of the oldest pastoralist’s R1a on the steppe. There would be some problem with explaining very archaic and conservative nature of Balto-Slavic languages but one can argue that they were indoeuropeized very early and preserved their languages intact because they didn’t mix much with other people.

But Indo-Iranians do exist and the oldest pastoralist’s R1a on the steppe was located west of the oldest pastoralist’s R1b on the steppe. IE were located west of Uralics on the steppe. R1a-M417 is undoubtedly an IE Y-DNA. The only one undoubtedly IE-Y-DNA. It is IE Y-DNA by definition, because it is the only link between India and Europe.
One cannot question these facts by some pseudo-scientific linguistic reconstructions crazy Carlos or other crazy linguists are doing. Reading their papers is really depressing, it is so boring and seems so utterly useless because they cannot explain anything or even agree on anything.

Groo Salugg said...

- RISE98 for example
- CWC from before the female mediated EEF admixture are autosomaly identical to Yamna
- CWC is just a Yamna flavor and nothing really hints at these speaking different language families
- spread of Uralic is a post-Yamna phenomenon

@EastPole after rethinking
The European<->East Asian look forms a gradient, not a split.
The Uralic vs Turkic distribution does not align with that either.
Nenets look fully East Asian, Turks mostly European with a minor East Asian admixture.
I think ProtoUralic comes from Okunevo, who adopted the warrior culture from Sintashta and went to conquer Siberia, imposing their language on various previously unrelated people.

I thought there were both PIE and Indo-Iranic layers in Uralic?

EastPole said...

@Groo Salugg
“I think ProtoUralic comes from Okunevo”

“Okunevo people could therefore be a remnant paleo-Siberian population with possible Afanasievo input, as suggested by the presence of the R1b1a1a2a subhaplogroup in one individual.”

They also say that Afanasievo has no link with Tarim Basin and Tocharians. So there is no evidence that Afanasievo was IE.
Genetic evidence suggests that bearers of the Afanasievo culture were partly ancestral to those of the Okunevo culture. There are also archeological links.
How do we know that Uralic languages came from paleo-Siberian population and not Afanasievo? How do you explain similarity between IE and Uralic languages, usually explained by close geographical vicinity and postulated Indo-Uralic origin. And how do you explain many Uralic populations with very high Yamnaya component?

Ric Hern said...

@ EastPole

Tocharian share Proto-Indo-European Archaisms with Celtic, Italic and Hittite and not with Indo-Iranian,Slavic and Baltic, plus Tocharian was apparently a Centum Language and not Satem. How do you explain this ?

We see R1b among the Uyghurs. Did Tocharian speakers migrate into the Tarim Basin from the East(Mongolia) and replace or admix with some of the R1a Tarim Mummy people ?

Archaeology shows admixture and displacement of Afanasevo by Okunevo people who had more Mongoloid like features than Afanasevo. Where did the Afanasevo people who were not caught up in the admixture go ?

Shaikorth said...

@Groo Salugg, Eastpole
Okunevo and Neolithic/BA Western Baikal region apparently do not have the N-L1026 lineages found in all Uralic speaking populations and Balts so in a genetic sense they're poor fits unless matrilineal migrations spread the language.

Re: Tarim, R1b in more southern regions of Central Asia may not be Afanasievo-derived. PH155 at least is found in Uyghurs and its further distribution doesn't fit Afanasievo.

Ric Hern said...

How many actual Tocharian remains were tested ? None sofar so how can we be sure ?

Ric Hern said...

Maybe some connection to Uralic ?

Grey said...

so *if* the near-mountain thing is correct the current candidate list would be:
south of black/caspian seas (zagros etc)
tien shan

old europe said...


You forgot


Ric Hern said...

@ Grey

There area also Mountainous terrain in the Crimea. Like I said before the Mouflon Sheep which are Mountain dwellers became extinct in the Crimea only 3000 years ago.

Ric Hern said...

@ Grey

Ebizur said...

Shaikorth said,

"Re: Tarim, R1b in more southern regions of Central Asia may not be Afanasievo-derived. PH155 at least is found in Uyghurs and its further distribution doesn't fit Afanasievo."

Another subclade of R1b in which I am interested is R-L23 (formed 6400 ybp, TMRCA 6100 ybp) > R-Z2103 (TMRCA 5500 ybp)> R-Z2106 (TMRCA 5300 ybp) > R-CTS8966 (TMRCA 4300 ybp). It appears that most known members of R-CTS8966 are Chechens, Armenians, or Sino-Tibetans (Beijing, Jiangsu, and Bhutan). There is a subclade, R-CTS8966 > R-CTS347 (TMRCA 4100 ybp) > R-Y37188 (TMRCA 3600 ybp), that currently is represented on YFull only by three individuals from the Chechen Republic. A basal member of a sister clade, R-CTS8966 > R-BY3295 (TMRCA 3800 ybp), who reports ancestry in Baden-W├╝rttemberg recently has been added to YFull.

It would be nice if someone could get a reading on CTS8966, CTS7763, CTS2437, or Y125191 for some ancient specimens. This clade is not at all common among extant European populations, but it is a subclade of R-Z2103, which predominates in specimens from the Yamnaya archaeological horizon, and many subclades of which are common among extant Armenians. It has been found in people in eastern China and in Bhutan, and those individuals' MRCAs with present-day people in Caucasia and Europe may date back to the third millennium BCE. I think it may be a candidate for an Afanasievan and/or "Tocharian" lineage, and may have been responsible for the introduction of metallurgy to Sino-Tibetans.

Grey said...

Old Europe / Ric Hern


Groo Salugg said...

I think the steppe ancestry of Uralics is actually Steppe_MLBA (75% Yamna, 25% EEF), rather than pure Yamna, is that correct?
The similariies between IE and Uralic are indeed deep and striking, but there is very few of them. There are orders of magnitude more features where IE and Uralic are completely different.
Overall it looks rather like an IE influence on Uralic then a (reasonably recent) common origin.

There is no N1c in Yamna.
Autosomaly, Yamna and Afanasievo are identical, their YHgs are sister clades.
Almost certainly these cultures (and CWC) spoke (dialects of) the same language.

There is N1c in Okunevo, though a different subclade.
The could have been higher variety of the Hg initially, that got subsequently reduced due to the pruning effect of the IE culture.
I do not see an easy way for a foreign male Hg (and potentially language) to enter an IE culture, but in Okunevo it apparently happened.
That said, it could have started as an originally Indo-Iranic expansion north, that got later somehow dominated by local foragers.

Shaikorth said...

@Groo Salugg

Okunevo's N1c is apparently this: , a branch formed considerably before Okunevo and its predecessor cultures. Looks too old and diverged for Uralic or Indo-European timeframes. Besides a few Turkic tribes and Okunevo itself it's been found in one of the later (autosomally Okunevo-admixed) Baikal HG's suggesting Okunevo spread east if it spread at all. Overall it looks like a dead end, as does Afanasievo - if it actually can be proven to be ancestral to Tocharian it just took longer to disappear.

J Pystynen said...

Don't read too much into the results: this entire study is distance-based instead of ancestral state reconstruction based (says so right in the name). As such it will likely latch on to spurious similarities instead of real etymological connections, and also fail to usefully root anything at all. This explains e.g. the bizarre neighbor-joining tree in Fig. 5, which shows Indo-European as paraphyletic versus everything not-IE-or-Uralic in a single branch … which should be all you need to know about the reliability of this type of study.