search this blog

Saturday, May 19, 2018

Global25 PAST-compatible datasheets


I'm planning to run regular workshops over the next few months on how to get the most out of Global25 data with various programs, and expecially PAST (see here). So if you have Global25 coordinates, please stay tuned.

To that end, I've put together four color-coded, PAST-compatible Global25 datasheets with thousands of present-day and ancient samples, available at the links below:

Global_25_PCA.dat

Global_25_PCA_pop_averages.dat

Global_25_PCA_scaled.dat

Global_25_PCA_pop_averages_scaled.dat

PAST is an awesome little statistical program and simple to use. The manual is available here. To kick things off, here's a quick guide how to run a Neighbor Joining tree on your Global25 coordinates:

- download the Global_25_PCA_pop_averages_scaled.dat from the last link above

- open the dat file with something a little more advanced than Windows notepad, like, say, TextPad (see here)

- stick your scaled coordinates at the bottom of the sheet, so that they look exactly like those of the other samples, except give yourself an original symbol, like, say, a black star

- open the edited dat file with PAST and choose all of the columns and rows by clicking the empty tab above the labels

- then, at the top, go to Multivariate > Clustering > Neighbor joining

After a few seconds you should see a nice, color-coded tree like the one below, except you'll also be on it, in black text. I'm very happy with these results, by the way. As far as I can see, all of the populations and individuals cluster exactly where they should.


Those of you who are already very proficient in using PAST, feel free to go nuts with these new datasheets and show us the results in the comments below. I'll try to put together a workshop for beginners within the next couple of weeks.

See also...

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 2: intra-European variation

Modeling genetic ancestry with Davidski: step by step

47 comments:

Davidski said...

@Open Genomes & Roy King

Mycenaeans cluster near Minoans on my tree, while Hajji_Firuz_ChL are on a different branch next to Seh_Gabi_ChL.

This is line with all of my other analyses of these two pairs of ancient populations.

Roy King said...

@Davidski
Looks really good! We have some differences in method--NJ vs Ward's--your trees have a lot of chaining, but overall we see very similar results. You can also cluster individual samples rather than population means which furnishes nuanced structure. There is definitely a pair of Aegean and Anatolian BA clusters, adding support to the view of the region as pre-IE/pre-Greek Koine culture undergirding Greek and Anatolian adstrates/superstrates.

Shaikorth said...

Speaking of individual samples, I0244 should definitely be separated from the rest of Potapovka. If it's not the average is noticeably more ANE and divergent from Steppe_EMBA than was likely typical.

Open Genomes said...

@David

I like that you're doing trees now.
I think one problem is that the populations are not really clusters. You can see that there are still outliers in each population relative to the other, for example, MA2203, who has substantial Balkan Neolithic compared to the other Anatolian Bronze Age individuals.

I think an even better approach would be to create trees of individuals. This way we could easily identify outliers (e.g. the two Dzharkutan women) and their associations. It's much more informative for our purposes than population averages. (Notice all the discussion and posts about individual samples, and what their results might mean historically.)

I also agree with Roy King that the Neighbor-Joining tree will cause outliers to "attract" other subclades and change the ordinary topology of the tree. Because this is refined data (Global25) not raw data (SNPs), a more simple algorithm like Ward's distance-squared would be more informative.

Ward's euclidean distance-squared is implemented in PAST, so you can give it a try with individuals instead of populations and run it against both your scaled and unscaled data. It's worthwhile comparing the scaled and unscaled trees. The unscaled trees are more sensitive to minor admixture components than the scaled trees, because the unscaled trees use all 25 components while the scaled trees basically just use the first 6, and severely downweight all the others.

Running the individual samples using Ward's distance-squared algorithm and doing it with both the scaled and unscaled data should be pretty easy to do. Try it and see what happens.

Open Genomes said...

@David

Do you update your Global25 eigenvectors with each new sample set?

Davidski said...

@Open Genomes

I generally use other methods based on raw SNP data to explore the relationships between the populations and individual samples.

But I find that this Neighbor joining tree based on scaled Global25 data produces essentially the same results.

I'm skeptical that any method that clusters, say, the Mycenaeans with Chalcolithic Iranians, is just as reliable, because I've never seen this happen using raw SNP data.

By the way, the samples are labeled and color coded by archaeological culture and geography, not genetic relationships, and yes, I update the Global25 datasheets whenever new samples come in. The links are always the same.

Open Genomes said...

@David

The question is, when you update your sheets, so you recalculate the values for existing samples? In other words, do you get different values for Global25 for existing samples when new ones are added? (That's what I meant by updated eigenvectors.)

Davidski said...

@Open Genomes

The question is, when you update your sheets, so you recalculate the values for existing samples?

No, the coordinates are always the same, and new samples are added to this framework.

This won't change unless I see a discrepancy between the Global25 and other types of analyses, like those based on formal stats.

At the moment, Global25 output is showing practically the same relationships and ancestry proportions (when modeled with nMonte) as my other analyses based on raw SNP data, so there's no need to change anything.

Open Genomes said...

@David

It seems that if the values (eigenvectors) aren't updated when new samples are added, then the analysis is based on the more limited sample set that was used when the initial Global25 calculator was created.

In the most recent batch of studies, new divergent populations have appeared that didn't exist before, for example Botai / Siberian_N and the various "outliers" that cluster with them, Okunevo, and the historical Sarmatian populations, to name a few. These don't really cluster with any modern samples, so it's not like these "fit" well on the existing tree. Of course, for samples like DA45, who clusters very closely with modern Han Chinese, a Global25 update isn't necessary, because the existing Han Chinese population was taken into account.

Leaving the same values based on a much less complete sample set loses information.

You might want to try recalculating the values for the 25 components now and see what happens.

Davidski said...

@Open Genomes

In the most recent batch of studies, new divergent populations have appeared that didn't exist before, for example Botai / Siberian_N and the various "outliers" that cluster with them, Okunevo, and the historical Sarmatian populations, to name a few.

But these new samples behave exactly as they should in the Global25. Have a look where they cluster in my NJ tree.

Leaving the same values based on a much less complete sample set loses information.

I check this whenever new samples come and it's not a problem.

A much bigger problem is adding highly drifted populations, because they dominate many of the dimensions and screw up the analysis. Many, if not most, ancient populations are like that.

You might want to try recalculating the values for the 25 components now and see what happens.

I've seen what happens and there's no point changing something that already works very well.

Seinundzeit said...

PAST has a bunch of very cool functions.

Personally, I find K-means clustering to be pretty fun. For example, I used the Global_25_PCA_scaled.dat, and added some additional scaled Pashtun coordinates (a motley bunch from both Afghanistan and Pakistan). When trying to explain everyone with only 3 clusters (which obviously isn't enough), I was surprised to see an incredibly crisp/clean recapitulation of the African/West Eurasian/East Eurasian division.

Interestingly, both contemporary South Asia and contemporary Central Asia get divided between "West Eurasian" and "East Eurasian" clusters. All contemporary southern Central Asians (Tajiks, Pashtuns, the Kalasha, and the Kho) and contemporary northwestern South Asians/Upper Caste Indians (Kohistani, Gujar, Khatri, Sindhi, Brahmins, Kshatriya, etc) are bracketed under the "West Eurasian" cluster, while nearly all other contemporary populations in the two regions are bracketed under the "East Eurasian" cluster.

On a note of more parochial interest, when I try for 32 clusters I see the emergence of a Tajik cluster (all Tajikistanis, whether Pamiri, Yaghnobi, or Parsiwan, with the inclusion of Zafarshan_IA), a Balochistani cluster (Baloch, Brahui, and Makrani, with the inclusion of Iranian_Bandari, and one Sindhi), a Pashtun cluster (all Afghan Pashtuns and nearly all Pakistani Pashtuns. Although, a few of the Pakistani Pashtuns cluster instead with northwestern South Asians; included in the "Northwestern South Asian" cluster are 3 Yusufzai Pashtun, and 4 of the 10 samples used in David's Global_25 "Pashtun" average), a northwestern South Asian cluster (Kohistani, Gujar, Khatri, Sindhi, and a few Pakistani Pashtuns), a Burusho cluster, a Brahmin cluster, and much more. Very sensible stuff.

On the basis of these classifications, I'll produce new averages; we'll see if this has any effect on modelling.

Mike the Jedi said...

Nice, I look forward to this series. I'd especially like to know how to get PAST to churn out a West Eurasia PCA that looks similar to Dave's typical arrangement (with the Euro-Siberia HG cline on the left, West Asia on the right, Old Mediterranean/Neolithic cline along the bottom, etc.). I've fiddled with everything I can think of and can't get mine to look like that to save my life.

The Global 25 is shaping up very nicely. Looks like every modern meta-population in the world is represented at this point, with the exception of the Semang (Malaysian Negritos). They're significantly diverged from other SE Asians, so hopefully they'll make an appearance on the list in the near future for completeness' sake.

Open Genomes said...

@David

Here are three versions of the worldwide Global25 hierarchical clustering tree for individuals:

(You can open these in your web browser, or download the PDF files and open it in any latest version of a PDF reader, but if you open it locally you will need to wait about two minutes for the PDF to initially be rendered. Opening it locally allows you to search text. Either way, magnify the page/PDF to at least 100% and scroll all the way to the right and then down to where the tree begins.)

These worldwide trees all have 64 clusters, which seems just about optimal. Clusters are defined by clades that branch off at or above a certain tree height which produces the designated number of clusters. All members of a cluster have no more than a certain common maximum distance from each other, and clusters are not based on populations, even though they generally conform to them. However, the outlier individuals who cluster elsewhere are taken into account, regardless of the population label. The clusters are outlined in light grey boxes:

1. Ward's Euclidean distance-squared method using scaled data

2. Ward's Euclidean distance-squared method using unscaled data

3. Neighbor_joining Euclidean distance-squred using scaled data

In the neighbor-joining tree, ancient and divergent individuals don't form clusters in most cases, because the algorithm is weighted toward first creating obvious low-level clusters first, and then attaching the others one by one. This isn't that useful because we can't see the clustering for all the samples, except in the broadest way,

For example, the Middle Late Bronze Age Hittite Era Anatolian MA2203 along with many other samples appears as an outlier to a set of low-level clusters formed by closely-related East indian tribals. This is the problem described when highly-drifted populations are added. Ward's Euclidean distance-squared method avoids this.

The two Ward's Euclidean distance-squared method trees, using scaled and unscaled data respectively, have differences:

For example using the Middle Late Bronze Age Old Hittie Era Anatolian MA2203:

The relative positions of MA2203 using scaled and unscaled data

Here we can see that using scaled data, MA2203 clusters with Bronze Age Anatolians and a Cypriot, but using unscaled data MA2203 clusters with a Mycenaean, Balkans Iron Age, and the Germany Medieval outlier STR_300.

Using the Khvalynsk Eneolithic and EHG samples:

The relative positions of Khvalynsk Eneolithic and EHG using scaled and unscaled data

Here we can see that while Khvalynsk Eneolithic clusters with EHG and Latvia Middle Neolithic using scaled data, Khvalynsk clusters with MA1 and the Sintashta_o3 outliers in a larger clade along with Siberia Neolithic and Botai.

David, what do you think most closely conforms with what you're seeing with the SNP data? Ward's distance-squared method scaled or unscaled data?

Roy King said...

@Davidski
But I find that this Neighbor joining tree based on scaled Global25 data produces essentially the same results.

“I'm skeptical that any method that clusters, say, the Mycenaeans with Chalcolithic Iranians, is just as reliable, because I've never seen this happen using raw SNP data.”

I’m confused. I’ve never seen Ward’s clustering method on Global-25 data cluster Mycenaeans or Bronze Age Anatolians with Iran Chalocolthic samples!

Shaikorth said...

@Open Genomes

Could be that Sintashta_o3 is pulling Khvalynsk with it between the AG3 and EHG branches because o3's mostly EHG with some WSHG and at least one of the Khvalynsk samples might have WSHG too if scaled Global25 is accurate. In fig S3.22 of Narasimhan (f4(Mbuti.DG, Test, Anatolia_N, Han.DG) vs f4(Mbuti.DG, Test, West_Siberia_N, Ganj_Dareh_N)) Khvalynsk is on a cline between Sintashta_o3 and Yamnaya.

Khvalynsk_Eneolithic:I0433

Sidelkino:Sidelkino 77.20
CHG 13.35
West_Siberia_N 9.45
Botai:BOT2016 0.00
AfontovaGora3 0.00
Barcin_N 0.00

Davidski said...

@Open Genomes

David, what do you think most closely conforms with what you're seeing with the SNP data? Ward's distance-squared method scaled or unscaled data?

I don't know yet. It's something that needs to be worked out here.

In regards to the outliers, the obvious outliers from the homogeneous populations are marked with the _o suffix. Outliers from heterogeneous populations aren't singled out, because most individuals from such groups would be outliers, so there's not much point.

There are also plenty of subtle outliers in many of the homogeneous population sets, like Anatolia_MLBA MA2203. I'm aware of most of them, but I don't think they deserve any special attention. There are datasheets with individual samples, so people can create their own population sets and averages.

Shaikorth said...

The authors of Narasimhan et al. should have been slightly more precise when assigning their samples. For instance these two don't look like members of the same cluster:

"distance%=5.775 / distance=0.05775"

Sintashta_MLBA_o1:I0983

Sidelkino:Sidelkino 35.05
CHG 22.70
West_Siberia_N 21.40
AfontovaGora3 9.90
Ganj_Dareh_N 5.85
Barcin_N 5.10
ShamankaEN:DA249 0.00

"distance%=5.0196 / distance=0.050196"

Sintashta_MLBA_o1:I1007

West_Siberia_N 47.4
AfontovaGora3 35.1
CHG 9.7
Ganj_Dareh_N 7.8
Sidelkino:Sidelkino 0.0
Barcin_N 0.0
ShamankaEN:DA249 0.0

Aram said...

Arza , Supermord

I answered to Your questions here.

https://www.blogger.com/profile/05717857095182763668


---

Concerning Novosvobodnaya. It has differences from Maykop. The main difference is the increase of ANF/EEF. You can see on the Admixture.
The apparition of this G2a2a represents this increase, In this paper Novosvobodnaya mitogenomes I didn't saw typical EEF mitogenomes. But the other old paper had different set.

There is also the problem of that Dolmens in NW Caucasus. They look similar to European ones. Is it coincidence or someone introduced them from Europe?
With current data I can't have definitive answer to this question.

Aram said...

Oops. This is the good link.

https://eurogenes.blogspot.com/2018/05/on-genetic-prehistory-of-greater.html?showComment=1526807801478&m=0#c6525231009220856131

Matt said...

@Mike: I'd especially like to know how to get PAST to churn out a West Eurasia PCA that looks similar to Dave's typical arrangement (with the Euro-Siberia HG cline on the left, West Asia on the right

Well, dimension 3 v 9 basically has all the West Eurasian populations in the right relative positions. But there is the issue that Global 25 doesn't really seem to have anything like dimensions where world populations sit at zero and West Eurasian populations are distributed across the plot.

What I find easiest, if you want to use the Global 25 data, but visualise purely the West Eurasian samples, is to remove all the samples you're interested in (non-West Eurasian samples) from the dataset, then rerun the set through Past3's PCA function again. Once you resample, it'll look like a West Eurasian PCA. To do that quickly Davidski's helpfully put all the major regions by colour code, and for Central Asia, just sort by PC2 and remove samples that look too far down the cline.

The Global 25 is shaping up very nicely.

It really still looks pretty good, considering even the Iberomaurasians who are very distant from everyone else don't really seem to suffer from compression. At some point an update as OpenGenomes would suggest is probably a good idea, but I think maybe not worthwhile until there's a really big stock of new adna and it looks quiet like not much new will come out for a while..

A little better is still probably the PCAs based on Martiniano's imputed diploid dataset. The "What You See Is What You Get" PCA - http://eurogenes.blogspot.co.uk/2018/01/genetic-maps-featuring-67-ancient.html. I think they better characterise distance and have more signs of continuity that make sense (e.g. Lezgins to Yamnaya signal which is seems logical and intuitive is better preserved etc) and that makes them more useful at finer scales. But G25 is more informative about the whole picture until such time as lots more ancient populations have decent coverage or are put through the same imputation.

Rob said...

@ Aram
The local expert - Trifonov - highlights the Caucasian dolmens are distinct
Also there is a massive chronological gap between European (4000s) and NW Caucasian

Matt said...

@All, one thing I'd add is that, in the individual datasheet, I find it really useful to add a Group column: https://imgur.com/a/Y9B4qFW

This allows you to:

a) process the data back through group differences functions which maximize betwen group variation (and so should make groups differences more prominent in the visualisation, at the cost of intra individual differences within population). For instance, you can reprocess data back through group based PCA and get slightly more between group separation.

b) when visualising the data, you can use group labels and convex hulls to better understand the shape of the data and pop averages without having to compute an average separately.

...

One other thing, in case anyone is interested, I generated a simulated Deeply Diverged SE Asian, using a regression equation and the Man Bac samples, taking advantage of the proportion estimates in the paper where VN29 (sample labelled Man Bac here) is about 50% Deeply Diverged and the others are about 27-30%: https://pastebin.com/tXNn4cVi

(I've just used the Man Bac samples here. They give estimates for Deeply Diverged in all the samples, but one problem I did find is that the position on PC2 is hard to place depending on what you use, as VN29 is more displaced towards the East Eurasian end than the other Man Bac, but less so than Nui Nap, with mixed signal for Oakaie. I think using the other samples probably confounds Austroasiatic vs other Eats Asian ancestry with Deeply Diverged signal, so I haven't done it, though this does mean fewer samples to work with and so more noise in other ways.)

Anyway, really strong signal for this ghost in PC11, which splits Nganasan and Man Bac VN29 at one end from Naxi, Han_NChina, Korean at the other: https://imgur.com/a/nq06PuN

One of the things it seems to me about this, by the way, is that the Austoasiatic India populations probably have enriched Deeply Diverged relative to a simple mix of Man_Bac_All (27% Deeply Diverged) and an ASI representative like the Paniya.

It also seems really difficult to calculate whether their East Eurasian ancestry is all Man_Bac like because of this issue; it's not clear whether nMonte involving the India Austroasiatic SE Asian ancients prefer Man_Bac due to matching in the Austroasiatic farmer signal, or the Deeply Diverged signal, and this seems important to try and understand how populations were interacting in NW mainland SE Asia prior to Austroasiatic groups going to India.

Hieu Phamnhu said...

Anyone can show me the PCA map for Southeast Asian population based on above data?
I 'm really curious to discover.

Open Genomes said...

@David

A few of the rows in your PAST tables have a trailing comma. These have to be removed manually or some programs treat them as an extra field. Can you fix that?

Estess said...

@Lukasz M

I like correlation function in PAST. It shows relationship in simple and definite way.

Relationship of Poprad (scaled averages). First 100 values.


How did you obtain those relationships? Can't figure out which feature on PAST.

Thanks

Matt said...

To stick my oar in regarding Neighbour Joining vs Classical Hierarchical Trees with Ward's method, compare the position of ancient European HG in the two trees here (raw Euclidean distance): https://imgur.com/a/NA7vWJc

In the first NJ tree, they join with their nearest neighbour in the dataset, Ukraine_Eneolithic, which joins with CWC_Baltic and which joins with present day NE Europeans. Outliers generally get long branches from their nearest neighbour, rather than being on very diverged clades within the overall tree.

In the second tree, the form a outlying clade within the North European part of the tree, with no obvious preference among Bronze Age Europeans, or modern North Europeans.

You can argue whether this sort of chain effect is likely to lead to false inferences; if A is most related to B than D is, and B is most related to C, then A will end up closer on the tree to C despite A possibly not being closer to C than another population D.

But if we want to understand which are the closest populations at a glance, even if they are not very close at all, it seems like the NJ tree is probably superior, and not inferior at detecting outliers.

(Ultimately, admixture will mean all no trees without admixture edges can ever be a perfect representation of population history anyway, and even those with edges will have limits.)

Btw, Davidski, I believe the Dungan sample(s) have gone missing (not too important really, but they may be useful / important to some people out there).

Arza said...

Re: West Eurasian PCA inside G25

PC4/5/2 probably works best (or 2D PC4/5 with a subset of samples)

Re: NJ vs. Ward
It looks like NJ tree picked up nicely some clines (Baltic_BA-Ukrainian or a Balkan one).

Re: missing samples
I would add also Nenets, who in G10 were quite important in models involving Uralic populations.

ryukendo kendow said...

@ David

I didn't know that you did not run the entire PCA anew with new samples each time. Wouldn't this give us less opportunity to explore dimensions more strongly differentiating ancients than moderns, e.g. Matt's Danubian vs Cardial and presumably the new Steppe MLBA-IA vs EMBA dims (as the decorrelation is done on signals from moderns and then the newer ancients we get which you just add are forced onto old eigenvectors instead of the other way, especially now that floods of new aDNA are coming and all these may expose new dims).

Rob said...

Another level of analysis is always good, but these clusters obscure fine print detail because they match groups within broadly similar genetic profiles. Just like back in the old days of ‘classic markers’ Croats and French eg sometimes clustered together despite having little recent shared history

Davidski said...

@rk

I check how the Global25 performs against new data with PCA and formal stats to make sure that it's not missing anything, and so far it's not missing anything.

The problem with adding new samples, especially ancient samples, to the framework is that many of them aren't suitable because they're too drifted.

In other words, imagine the Global25 based on 100 populations that behave like the Kalasha people. Let me assure you that you wouldn't be able to get too many sensible models that match models based on formal statistics.

Davidski said...

@Rob

Another level of analysis is always good, but these clusters obscure fine print detail because they match groups within broadly similar genetic profiles. Just like back in the old days of ‘classic markers’ Croats and French eg sometimes clustered together despite having little recent shared history.

There's no single magic bullet. In other words, to get a complete picture it's necessary to run a variety of analyses, but I find that starting off with a nice, simple, accurate NJ tree is always useful.

Davidski said...

@All

As per Matt's suggestion, I've added "group" columns to the datasheets with individual samples, so feel free to download them again and try out this feature.

Davidski said...

@Matt

a) process the data back through group differences functions which maximize betwen group variation (and so should make groups differences more prominent in the visualisation, at the cost of intra individual differences within population). For instance, you can reprocess data back through group based PCA and get slightly more between group separation.

I haven't actually tried this yet. Can you describe briefly how you're going about this?

Open Genomes said...

@David

What happened to BR2, Otzi, Stuttgart LBK, and Saqqaq?

Also, I agree that the big remaining missing piece here are the Orang Asli such as the Jehai (Jahai) and Senoi. These are the people closest to the Hoabinhians, the Mesolithic aboriginal people of Southeast Asia, formerly spread throughout the region.

Unravelling the Genetic History of Negritos and Indigenous Populations of Southeast Asia

"The SNP genotype data (devoid of any personal identification and anonymized) used in the population analyses will be made freely available on request to the corresponding author."

Can you please request the data and add them to Global25?

Davidski said...

@Open Genomes

BR2 should be in there, but under a different ID. Stuttgart might also be there under a different ID, but if not there are plenty of LBK samples in the datasheets exactly like that. Oetzi was too damaged to bother with. Not sure about Saqqaq. Might have to look into that.

If you want those Negrito samples in the Global25, then just request them, and mail them over.

Samuel Andrews said...

This is a super fun tool that can give us answers we would normally have to do lots of nMonte runs to get. I can also use this for mtDNA.

Ryan said...

What's the easiest way to convert a .dat file to excel in general?

Angriff Bernhard said...

You should be able to just import the .dat file into Excel directly.

Lukasz M said...

@ Davidski said...

@Matt

a) process the data back through group differences functions which maximize betwen group variation (and so should make groups differences more prominent in the visualisation, at the cost of intra individual differences within population). For instance, you can reprocess data back through group based PCA and get slightly more between group separation.

I haven't actually tried this yet. Can you describe briefly how you're going about this?

--------------------------------------

When you have group column added, you have opportunity to change some options at PCA Summary panel: Groups > Between-group or Within-group. Normally it is inactive.

Open Genomes said...

Here's an updated version of the Global25 tree showing 64 clusters:

Global25 Ward's Euclidean distance-squared method hierarchical clustering tree with all 4222 individuals using scaled data

Zoom in to 100% and then scroll all the way to the right and then down to where the tree begins. You can download the PDF and open it locally, but it may take about two minutes to display. Once it displays, in your local PDF reader you can search for text in the tree.

The clusters are defined by cutting the tree at the height that produces 64 clusters, so they are all evenly defined, regardless of number of individuals in each. These clusters could be used instead of "populations" in nMonte, and this should give better results because unlike "populations", they don't have outliers outside of the cluster. The population averages would be specific to that one clade in the tree.

Slumbery said...

Open Genomes:

Why are Bashkirs labelled as a Caucasus population? They are nowhere near. That should be Urals by the labeling-logic applied elsewhere.

Lukasz M said...

@Davidski
What similarity index you use in NJ tree. Because every gives more or less different results. You tried all? I think besides Euclidean, the best are Manhattan, Cosine and Correlation.

Davidski said...

@Lukasz M

What similarity index you use in NJ tree.

In the tree above, the default one, so Euclidean. But I do also use Manhattan.

Open Genomes said...

@David

The following are duplicate sample IDs, and some are clearly actual duplicates.

Can you fix these to be unique, and also remove the duplicates?

Id, Group
_________

495_R01C01,Mordovian
495_R01C01,Uzbek
495_R01C02,Mordovian
495_R01C02,Uzbek
495_R02C01,Mordovian
495_R02C01,Uzbek
495_R02C02,Mordovian
495_R02C02,Uzbek
ARI11,Ethiopian_Ari_blacksmith
ARI11,Ethiopian_Ari_cultivator
ARI5,Ethiopian_Ari_blacksmith
ARI5,Ethiopian_Ari_cultivator
ARI6,Ethiopian_Ari_blacksmith
ARI6,Ethiopian_Ari_cultivator
ARI7,Ethiopian_Ari_blacksmith
ARI7,Ethiopian_Ari_cultivator
G25001,Greek_Central_Anatolia
G25001,Greek_Trabzon
G25002,Greek_Central_Anatolia
G25002,Greek_Trabzon
G25003,Greek_Central_Anatolia
G25003,Greek_Trabzon
HG00171,Finnish
HG00171,Finnish_East
K-126,Khatri
K-126,Kohistani
azerB38,Azeri
azerB38,Azeri_Iran
azerB59,Azeri
azerB59,Azeri_Iran
azerB61,Azeri
azerB61,Azeri_Iran
azerB64,Azeri
azerB64,Azeri_Iran
azerB8,Azeri
azerB8,Azeri_Iran
azerE1,Azeri
azerE1,Azeri_Iran
azerE3,Azeri
azerE3,Azeri_Iran
azerE6,Azeri
azerE6,Azeri_Iran
azerE70,Azeri
azerE70,Azeri_Iran
azerb72,Azeri
azerb72,Azeri_Iran

Onur Dincer said...

@Open Genomes

G25001,Greek_Central_Anatolia
G25001,Greek_Trabzon
G25002,Greek_Central_Anatolia
G25002,Greek_Trabzon
G25003,Greek_Central_Anatolia
G25003,Greek_Trabzon


I supplied these samples to David, they are all different individuals, I know them individually. So at least these are not actual duplicates. Fixing their IDs would be enough.

Open Genomes said...

@Slumberry

The mistake of Bashkirs is fixed, the're labeled as Urals now.

@Onur

Do you know that these Pontic and Central Anatolian Greeks cluster rather closely with Pre-Hittites, presumably Hattic speakers?

Onur Dincer said...

@Open Genomes

Do you know that these Pontic and Central Anatolian Greeks cluster rather closely with Pre-Hittites, presumably Hattic speakers?

Can you provide the GEDmatch numbers of the post-Neolithic ancient Anatolian samples from the Damgaard et al. study? PCAs and dendrograms are strongly affected by sampling choice, so a GEDmatch comparison would better resolve the population relationships and distances.