Eurogenes Blog: genetic genealogy

Showing posts with label genetic genealogy. Show all posts

Tuesday, November 5, 2019

Modeling your ancestry has never been easier

An exceedingly simple, yet feature-packed, online tool ideal for modeling ancestry with Global25 coordinates is freely available HERE. It works offline too, after downloading the web page onto your computer. Just copy paste the coordinates of your choice under the "source" and "target" tabs, and then mess around with the buttons to see what happens. The screen caps below show me doing just that.

Another free, easy to use online tool that works with Global25 coordinates is the Principal Component Analysis (PCA) runner HERE. Below is a screen cap of me checking out one of the many PCA that it offers.

See also...

Getting the most out of the Global25

Friday, July 12, 2019

Getting the most out of the Global25

The first thing you need to know about the Global25 is that I update the relevant datasheets regularly, usually every few weeks, but they're always at these links:

Global25 datasheet ancient scaled

Global25 pop averages ancient scaled

Global25 datasheet ancient

Global25 pop averages ancient

...

Global25 datasheet modern scaled

Global25 pop averages modern scaled

Global25 datasheet modern

Global25 pop averages modern

Global25 data for samples from a variety of papers that have been published recently will eventually be incorporated into the main datasheets linked above, but the process might take several weeks or even months. In the meantime, feel free to use the temporary datasheets below. Thanks for your patience.

Allentoft 2023

Chylenski 2023

Jeong 2024

Koptekin 2022

Olalde 2023

Peltola 2022

Penske 2023

Posth 2023

Sirak 2024

Skourtanioti 2023

Stolarek 2023

Varela 2023

Wang 2023

Yu 2023

Each sample has a population code and an individual code. The population codes represent the countries, ethnic groups and/or archeological affinities of the samples, and I often modify these codes to suit my needs. On the other hand, the individual codes are unique to most of the samples and I usually don't change them.

So if you'd like to know more details about the samples try searching for their individual codes via a decent online search engine. Basic information about many of the samples is also available in the "anno" files here.

The main purpose of the Global25 is to provide data for mixture modeling. In other words, for estimating ancestry proportions, both ancient and modern (see here). This can be done on your computer with the R program and the nMonte R script, or online with a couple of different tools, which I discuss below.

If you don't have R installed on your computer, you can get it here, while nMonte is available here. For this tutorial please download nMonte and nMonte3, and store them in your main working folder (usually My Documents).

Once you have R set up, make sure its working directory is the same place where you stored nMonte. You can check this in R by clicking on "File" and then "Change dir". Additionally, you'll need two nMonte input files in the working directory titled "data" and "target". Examples of these files are available here. We'll be using them to test the ancient ancestry proportions of a sample set from present-day England.

Before you can begin the analysis you need to first call the nMonte script by typing or copy pasting source('nMonte.R') into the R console window, and then hitting "enter" on your keyboard. This is what you should see in the R console window afterwards.

To start the mixture modeling process, type or copy paste getMonte('data.txt', 'target.txt') into the R console window, hit "enter", and wait for the results. After a short time, probably less than a minute or two, you should see this output.

The data and target files contain population averages. And, as you can see, the results that these population averages have produced are in line with what one would expect from such a model focusing on the genetic shifts in Northern Europe during the Late Neolithic. Very similar ancient ancestry proportions have been reported for the English and other Northern Europeans recently in scientific literature.

However, when focusing on exceptionally fine-scale genetic variation that isn't reflected too well in the Global25 population averages, a more effective strategy might be to use multiple individuals from each reference population and let nMonte3 aggregate and average the inferred ancestry proportions.

This is often the case when attempting to model ancestry proportions for more recent periods, such as the Middle Ages. So let's try this with the English sample set using a modified data file, which is available here.

Replace the old data file with the new one in your working directory, and, like before, copy paste into the R console window the following two commands, hitting "enter" after each one: source('nMonte3.R') and getMonte('data.txt', 'target.txt'). This is what you should eventually see.

It's difficult to say how accurate these estimates are. But they look more or less correct considering the limited and less than ideal reference samples. For instance, the individuals labeled SWE_Viking_Age_Sigtuna are supposed to be stand ins for Danish and Norwegian Vikings, but they're a relatively heterogeneous group from Sweden, possibly with some British or Irish ancestry, so they might be skewing the results.

However, I'll be adding many more ancient samples to the Global25 datasheets as they become available, including lots of new Vikings, which should greatly improve the accuracy of these sorts of fine-scale mixture models.

An exceedingly simple, yet feature-packed, online tool ideal for modeling ancestry with Global25 coordinates is the VahaduoJS. It's freely available HERE, and it also works offline after downloading the web page. Just copy paste the coordinates of your choice under the "source" and "target" tabs, and then mess around with the buttons to see what happens. The screen caps below show me doing just that.

However, it's important to note that the Global25 is a Principal Component Analysis (PCA), so it makes good sense to also use it for producing PCA graphs. To do this just plot any combination of two or three of its Principal Components (PCs) to create 2D or 3D graphs, respectively. This can be done with a wide variety of programs, including PAST, which is freely available here.

To produce a 2D graph, open a Global25 datasheet in PAST, choose comma as the separator, highlight any two columns of data, click on the "Plot" tab and, from the drop down list, pick "XY graph". Below is a series of graphs that I created in exactly this way. I also color coded the samples according to their geographic origins. This was done by ticking the "Row attributes" tab.

PAST can also be used to run PCA on subsets of the Global25 scaled data to produce remarkably accurate plots of fine-scale population structure. For instance, here's a plot based on present-day populations from north of the Alps, Balkans and Pyrenees.

To try this create a new text file with your choice of populations from the Global25 scaled datasheet, open it with PAST and choose Multivariate > Ordination > Principal Components Analysis. I've already put together several datasheets limited to European, Northern European, West Eurasian and South Asian populations. They're available at the links below along with more details on how to run them with PAST.

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 2: intra-European variation

Global25 workshop 3: genes vs geography in Northern Europe

The South Asian cline that no longer exists

Another free, easy to use online tool that works with Global25 coordinates is the Vahaduo Global25 Views [LINK]. Below is a screen cap of me checking out one of the many PCA that it offers.

And if you're fond of tree-like structures as a means to describe fine-scale genetic variation, please see this blog post...

Global25 workshop 4: a neighbour joining tree

See also...

New Global25 interpretation tools

Saturday, June 1, 2019

They came, they saw, and they mixed

Y-chromosome haplogroup N is strongly associated with Uralic-speaking populations. That's probably because it was a salient feature of the gene pool of the earliest Uralic speakers, and it went with them as they migrated across northern Eurasia. However, some of its younger subclades appear to have spread with the speakers of Indo-European and Turkic languages.

For instance, N-Y10931 seems to be a marker of the Rurikids, a Varangian dynasty that, according to most sources, ruled the Kievan Rus in what are now Russia and Ukraine. And the Kievan Rus was a lose medieval political federation in which Slavic, Finnic (west Uralic) and Germanic languages were probably spoken. The latest on the genetic genealogy of the Rurikids was presented a couple of days ago at the Centenary of Human Population Genetics conference in Moscow, and there's an abstract of the talk available here (download the PDF and scroll down to page 84).

I'm not aware of any Rurikids among the thousands of ancients in my dataset, or even of any samples belonging to N-Y10931. But I do have the genome of someone who belongs to N-Y4339, which, as per the abstract linked to above, is proximally ancestral to N-Y10931. Not only does this person come from Viking Age Scandinavia, but he was buried in a crouched position typical of Slavic funerary customs of the time.

The individual in question is vik_84001. His genome was published recently along with a paper on the population structure of the Swedish town of Sigtuna way back when it was a Viking stronghold (see here). This is where his Y-chromosome sequence, labeled ERS2540883, is positioned on the YFull Y-chromosome phylogenetic tree. Click on the image to go to YFull.

However, the result is likely to be compromised to some extent by missing data. If so, it's possible that vik_84001 does indeed belong to N-Y10931 and ought to be sitting near or even among that cluster of Russian samples (Rurik descendants?) at the bottom of the page.

In any case, vik_84001 seems to be the closest individual in the ancient DNA record to a Rurikid. The Principal Component Analysis (PCA) below is based on my Global25 data. It features 18 other Viking Age individuals from Sigtuna alongside vik_84001 (look for the black dots). The relevant datasheet is available here. Interestingly, despite his eastern Y-haplogroup, vik_84001 is one of the few Sigtuna ancients who clusters strongly with present-day Swedes.

But here's what happens when I model his ancestry proportions with the Global25/nMonte method using a wide range of reference populations from Northern and Eastern Europe. The Swedes in this model are the same as those in the PCA.

vik_84001
Swedish,84.6
Ingrian,9.2
Russian_Tver,6.2
Belarusian,0
Estonian,0
Finnish,0
Finnish_East,0
Karelian,0
Latvian,0
Mordovian,0
Russian_Kostroma,0
Russian_Kursk,0
Russian_Orel,0
Russian_Pinega,0
Russian_Smolensk,0
Russian_Voronez,0
Ukrainian,0
Vepsian,0

[1] "distance%=2.3778"

Yep, despite his position in the PCA, vik_84001 shows a strong signal of ancestry related to the present-day populations of northwestern Russia. I'm not sure what this means exactly, but it's certainly fascinating stuff. And, by the way, I usually wouldn't use so many similar reference populations in a single Global25/nMonte model because of the problem of "overfitting", but in some cases it's OK to do so if the nMonte algorithm has enough recent genetic drift to latch onto.

See also...

More on the association between Uralic expansions and Y-haplogroup N

Fresh off the sledge

Uralic-specific genome-wide ancestry did make a signifcant impact in the East Baltic

It was always going to be this way

Conan the Barbarian probably belonged to Y-haplogroup R1a

Thursday, February 15, 2018

Modeling genetic ancestry with Davidski: step by step

There are many different ways to model your genetic ancestry but I prefer the Global25/nMonte method. This is a step by step guide to modeling ancient ancestry proportions with this simple but powerful method using my own genome.

As far as I know, the vast majority of my recent ancestors came from the northern half of Europe. This may or may not be correct, but it gives me somewhere to start, so that I can come up with a coherent model. If you don't have this sort of information, because, perhaps, you were adopted, then just look in the mirror, and work from there. Like I say, it's not imperative that you know anything whatsoever about your ancestry, because your genetic data will do the talking, but you do need a model when modeling.

In scientific literature nowadays Northern Europeans are often described as a three-way mixture between Yamnaya-related pastoralists, Anatolian-derived early farmers, and Western European Hunter-Gatherers (WHG). So let's see if this model works for me. Obviously, if it does, then it'll confirm the information that I have about my origins, but it might also reveal finer details that I'm not aware of. The datasheet that I'm using for this model is available here.

[1] distance%=6.9025 / distance=0.069025

Davidski

Yamnaya_Samara 53.9
Barcin_N 30.75
Rochedane 15.35
Tepecik_Ciftlik_N 0

Yep, the model does work, with a fairly reasonable distance of almost 7%. The ancestry proportions more or less match those from scientific literature and the plethora of analyses that I've featured at this blog on the topic. Please note that I've kept things very simple, using only four reference populations and individuals as proxies for four distinct streams of ancestry. But I've put my own twist on this Neolithic/Bronze Age model by including two populations from Neolithic Anatolia (Barcin_N and Tepecik_Ciftlik_N), just to see what would happen. The WHG proxy is Rochedane.

Admittedly, though, my Yamnaya cut of ancestry appears somewhat bloated at over 53%, and the model's distance is a little higher than what I normally see for really strong models. So let's check if I can get a better fitting and more sensible result by adding a slightly more easterly forager proxy than Rochedane: Narva_Lithuania.

[1] distance%=5.9331 / distance=0.059331

Davidski

Yamnaya_Samara 45.75
Barcin_N 31.45
Narva_Lithuania 22.8
Rochedane 0
Tepecik_Ciftlik_N 0

The statistical fit does improve, and when given a choice between Rochedane and Narva_Lithuania, the algorithm picks the latter as the only source of extra forager input in my genome.

What could this mean? It might mean that a large part of my ancestry derives from the Baltic region. Actually, I know for a fact that this is true. But even if I had no idea about my genealogy, this result would be a very strong hint about my genetic origins. Indeed, let's follow this trail and try to further improve the fit of the model by adding a more relevant Yamnaya-related proxy, such as early Baltic Corded Ware (CWC_Baltic_early).

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Holy shit! To be honest, I wasn't expecting this sort of resolution and accuracy, and I can't promise that everyone using the Global25/nMonte method will see such incredibly nuanced outcomes, but this isn't a fluke. It can't be, because it gels so well with everything that I know about my ancestry. Please note also that I belong to Y-chromosome haplogroup R1a-M417, which is a lineage intimately associated with the Corded Ware expansion across Northern Europe (for instance, see here).

But of course, the Baltic and nearby regions haven't been isolated from migrations and invasions since the Corded Ware times. For instance, at some point, probably during the Bronze Age, Uralic-speaking groups moved west across the forest zone of Northeastern Europe and into the East Baltic and northern Scandinavia. It's generally accepted that they brought Siberian admixture with them (see here). Moreover, from the Iron Age to the Middle Ages, East Central Europe was under intense pressure from a wide range of nomadic steppe groups with complex ancestry, such as the Sarmatians, Avars, Huns, and Mongolians. Did any of these peoples leave their mark on my genome? At the risk of overfitting the model, let's explore this possibility by adding a few more reference populations.

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Han 0
Mongolian 0
Nganassan 0
Rochedane 0
Sarmatian_Pokrovka 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Nothing changes when I add the Han Chinese, Mongolians, Nganassans (a Uralic group from Siberia), and Sarmatians to the model. But what about if I throw in the only ancient Slav in my datasheet?

[1] distance%=2.9904 / distance=0.029904

Davidski

Slav_Bohemia 85.9
CWC_Baltic_early 7.7
Narva_Lithuania 6.4
Barcin_N 0
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Considering that the vast majority of my recent ancestors were Poles, thus a Slavic-speaking people from near the Baltic, this outcome makes perfect sense. And check out the new distance! But the problem now is that I'm overfitting the model by using two very similar and probably very closely related references, CWC_Baltic_early and Slav_Bohemia. And overfitting should be avoided at all costs. So it might be useful to break up this effort into two models: one focusing on the Neolithic and Bronze Age, and the other on the Iron Age and Middle Ages. I'll do that soon, but not just yet, because there are still too few Iron Age and Medieval samples available from the Baltic region and surrounds for meaningful analyses of this type.

See also...

Genetic ancestry online store (to be updated regularly)

Tuesday, October 31, 2017

Genetic ancestry online store

Update 05/02/2025: To get your Global25 coords, please use the app HERE. The whole process usually takes a couple of days. Feel free to spread the word.

Please don't order the Global25 unless you have experience in modeling Global25 data with the Vahaduo analysis tools.

Note that the conversion of VCF, BAM, CRAM and/or fastq files is 30 to 50€ extra depending on the case. For enquiries please email teepean47 on g25requests@gmail.com.

...

Following a rigorous testing phase, the awesome Global 25 analysis is now available at the store for $12 USD. What's so awesome about this test, you might ask? See here and here.

Please send your request and autosomal genotype data (from AncestryDNA, FTDNA, LivingDNA, MyHeritage or 23andMe) to eurogenesblog at gmail dot com.

However, note that this test is free for anyone who already has Global 10 coordinates (see here). That's right, if you already have Global 10 coordinates, all you have to do is to send me your data and say what it's for. Simple as that.

...

My Celtic vs Germanic Principal Component Analysis (PCA) is now available via the store for $6 USD (see here). Please note that this test is only really useful for people of Central, Northern and/or Western European origin, and indeed geared for those of overwhelmingly Northwestern European ancestry.

Please send your request and autosomal genotype data (from AncestryDNA, FTDNA, LivingDNA, MyHeritage or 23andMe) to eurogenesblog at gmail dot com.

...

The popular Basal-rich K7 admixture test is now available via the store for $6 USD. It's suitable for everyone, except people with significant (>10%) Sub-Saharan ancestry. For more information about this test and some ideas about what to do with the output see here and here.

Please send your request and autosomal genotype data (from AncestryDNA, FTDNA, LivingDNA, MyHeritage or 23andMe) to eurogenesblog at gmail dot com.

See also...

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 2: intra-European variation

Global25 workshop 3: genes vs geography in Northern Europe

Getting the most out of the Global25

Modeling genetic ancestry with Davidski: step by step

search this blog