Eurogenes Blog: formal statistics

Saturday, June 27, 2020

Major updates to ADMIXTOOLS

An important message from Nick Patterson:

Dear Eurogenes bloggers,

Many of you use ADMIXTOOLS and you might like to know that there is a new release on github [LINK] with some important enhancements.

From the README

*** NEW ***

1)

Version 7.0 has numerous upgrades.

a) Two new executables --qpfstats qpfmv allow precomputation of f-statistic basis. This can greatly reduce computation costs.
b) qpAdm, qpWave, qpGraph support qpfstats output as input.
*** This is a much improved way of running with allsnps: YES. ***
c) A new experimental feature of qpGraph (halfscore: YES) allows comparison of 2 phylogenies + a (weak) goodness of fit score. Be careful if running with a large number of populations and consider reducing block size say blgsize: .005

2)

Note that several of the new ideas implemented in version 7.0 were developed collaboratively with Robert Maier, who has implemented them along with the great majority of other ADMIXTOOLS functionality in R: See https://github.com/uqrmaie1/admixtools
Executables run fast, and it has features not available in this C version, such as interactive exploration of graph phylogenies.
A manuscript describing the algorithmic ideas and providing documentation of the methods is in preparation.

qpfstats is the most important new executable. This estimates f-statistics and covariance on a basis.

a) This can be passed into other programs of the package without having to reaccess the genotype files, greatly speeding the computations.
b) In allsnps: YES mode a new computation is carried out (explained in qpfs.pdf) that is much more logical when there is a lot of missing data. Sometimes standard errors are greatly reduced.
qpfstats can be used with up to 30 populations. Much beyond that the output files become large.

As usual there may be bugs...

Nick Patterson 6/27/2020

Update 29/06/2020: As pointed out above, qpfstats is the most important new executable. Indeed, Nick Patterson now recommendeds that qpAdm analyses run with the allsnps: YES flag should be based on qpfstats output.

Several of my recent blog posts featured qpAdm models run with the allsnps: YES flag, but they were based on genotype data because obviously I didn't know anything about qpfstats at the time.

So I went back and ran some of these models again, just to make sure that they were still relevant. Below are three examples which you can compare to the original analyses here, here and here, respectively.

TUR_Arslantepe_LC_Maykop
RUS_Maykop_Novosvobodnaya 0.281±0.042
TUR_Arslantepe_LC 0.719±0.042
chisq 10.923
tail prob 0.449752
Full output

TUR_Barcin_C
RUS_Vonyuchka_En 0.137±0.031
TUR_Buyukkaya_EC 0.863±0.031
chisq 15.074
tail prob 0.0889099
Full output

UKR_N_admixed
RUS_Progress_En 0.083±0.020
UKR_N 0.917±0.020
chisq 6.825
tail prob 0.65538
Full output

As far as I can tell, they're very similar to the original runs, which is a relief, because it means that the conclusions in my blog posts still make sense.

Saturday, March 2, 2019

Maykop: a multi-ethnic layer cake?

Let's speculate about the linguistic affinities of the currently available ancient populations from the Caucasus and surrounds. I put together a series of outgroup f3-stats to help things along. They're available for download here.

Maykop
Georgian 0.258224
Abkhasian 0.257899
Latvian 0.257376
Swedish 0.257301
Turkish_Trabzon 0.256996
Basque_Spanish 0.256589
Chechen 0.256514
Icelandic 0.256418
Norwegian 0.256325
Lezgin 0.256272
Irish 0.256227
Tabasaran 0.256092
Italian_Bergamo 0.25605
English_Cornwall 0.256032
Polish_East 0.255991
Scottish 0.255955
Adygei 0.255913

Steppe_Maykop
Latvian 0.261845
Russian_North 0.26145
Estonian 0.260355
Finnish 0.260211
Lithuanian 0.260072
Udmurd 0.259804
Ingrian 0.259663
Surui 0.259637
Vepsa 0.259608
Karelian 0.259532
Karitiana 0.259482
Russian_West 0.259397
Russian_Central 0.259274
Wichi 0.259106
Saami 0.258982
Komi 0.258945
Icelandic 0.258854
Swedish 0.258814
Mordovian 0.258604
Irish 0.25859

Eyeballing the stats might be enough to get a general impression about what they mean, but to understand them properly it's necessary to get technical with something like PAST3 (see here). That's because f3-stats pick up shared genetic drift from all drift paths, and don't especially focus on more recently shared ancestry. This can often lead to confusing outcomes.

Below are a few examples of linear models based on my f3-stats. Note that many Indo-European speakers, especially from Northern Europe, are foremost attracted to ancient samples from the Pontic-Caspian steppe. On the other hand, non-Indo-European speakers, from such far flung locations as the Caucasus and Iberia, show relatively stronger affinity to ancient samples from Anatolia and the Caucasus. Moreover, Uralic speakers show elevated affinity to ancient hunter-gatherer samples from Eastern Europe and Siberia. Makes sense, right?

Based on these and other data, I'd say that Maykop and the culturally related Steppe Maykop were something of a multi-ethnic polity, with many near and far related languages spoken by its people, including perhaps Kartvelian, Northwest Caucasian, Yeniseian and Indo-European. But it seems to me that Proto-Indo-European was spoken by steppe foragers turned pastoralists just outside of the Maykop zone. And I'm quite sure that after the Maykop collapse various early Indo-European groups pushed across the Caucasus and deep into the Near East. Just take a look at the f3-stats and linear model for Hajji_Firuz_BA to see what I mean.

See also...

An exceptional burial indeed, but not that of an Indo-European

The Steppe Maykop enigma

Late PIE ground zero now obvious; location of PIE homeland still uncertain, but...

search this blog

Saturday, June 27, 2020

Major updates to ADMIXTOOLS

Saturday, March 2, 2019

Maykop: a multi-ethnic layer cake?