search this blog

Tuesday, March 22, 2011

Reconstructing the Ancestral North Indian (ANI) genome

Back in 2009, Reich et al. theorized that the current South Asian gene pool was basically made up of two founding genetic components; Ancestral North Indian (ANI), and Ancestral South Indian (ASI). The distilled ANI, they noted, was more similar to the genomes of modern Northwest Europeans than those of the Adygei from the Caucasus. This is obviously out of whack with geography, but it does make sense based on what I've seen in my experiments on the Pakistani samples from the HGDP. Many of them, especially the Pathans, carry numerous segments, or haploblocks, that basically look North European. This gave me an idea to try and reconstruct the ANI genome based on such fragments. The first chromosome of my composite sample, which I call the "ANI composite" is available for download here. It's a PLINK Ped file in illumina AB format with 19,261 SNPs.

Below are several PCA plots featuring the "ANI composite", obviously not including the HGDP samples used to make it (see below). Overall, it seems to resemble most closely my reference samples from Eastern Europe. I have to admit that I was very pleased to see it behaving like a set of genotypes from a real human subject across many dimensions of genetic variation. PCA are very sensitive to anomalies, such as unusually long runs of homozygosity, so the fact that my composite can pass for a normal sample on these plots is fantastic.

So how did I do this? Well, it wasn't very difficult, but a bit tedious, so I need a break before continuing. I used information from my earlier experiments with ADMIXMAP, HAPMIX and RHH Counter to locate and delineate North European-like segments in phased Pakistani HGDP samples. I phased the data myself with BEAGLE, in a pool of South Asian and Middle Eastern samples, so as not to bias the results of phasing and imputation towards Northern Europe. In order to keep the alleles in phase when loaded into PLINK, I duplicated the haplotypes, producing completely homozygous individuals out of each one. Then I created an ANI composite dummy with 100% no calls, and loaded the haplotypes into this sample with a Python script. The first to load were the Pathan haplotypes, followed by the Burusho. I chose individuals from these two groups to make up the backbone of the putative ANI genome because they always seem to come out most "North European" in my ADMIXTURE and PCA/MDS runs compared to other South Asians. The empty spaces were filled with haplotypes from the Brahui and Balochi. Below is a list of all the samples used:

Pathan HGDP00213
Pathan HGDP00214
Pathan HGDP00218
Pathan HGDP00224
Pathan HGDP00241
Pathan HGDP00243
Pathan HGDP00254
Pathan HGDP00258
Pathan HGDP00259
Pathan HGDP00262
Pathan HGDP00264

Burusho HGDP00338
Burusho HGDP00356
Burusho HGDP00364
Burusho HGDP00382
Burusho HGDP00392
Burusho HGDP00412
Burusho HGDP00417
Burusho HGDP00423
Burusho HGDP00428
Burusho HGDP00433

Brahui HGDP00007
Brahui HGDP00009
Brahui HGDP00017
Brahui HGDP00041
Brahui HGDP00047

Balochi HGDP00054
Balochi HGDP00058
Balochi HGDP00062
Balochi HGDP00072

The phased data and the "ANI" haplotypes used in this experiment are available on request from eurogenesblog [at] hotmail [dot] com. I welcome feedback and suggestions on how to improve my methodology. Admittedly, this was a test run, so it's unlikely to be perfect.


wagg said...

I am highly skeptical of what is claiming Dienekes in "On the northern/southern Caucasoid contributions to Asia" (And for for some strange reason.... once again, my comment doesn't show up over there).

South Asians share a specific allele (for lactase persistence) with Europeans (clearly rattached to R1a1a, R1b1b2 and R1b-V88 (and European mtDNA lineages such as H)) and not with west Asians (it is very marginal there and obviously arrived from outside)?
Difficult to explain in his perpsective, I think.

Frequency and spread of T13910 alleles

South Asians and Europeans share obviously a specific relation of which are excluded west Asians.

And claiming the north european component he previously detected in autosomal data was not actually "real" is strange, because these could easily fit with a north-east european component:

pic 1

pic 2

pic 3

pic 4

pic 5

pic 6

pic 7

pic 8

and so on...

Onur said...

Wagg, I already replied to your above arguments with my this comment in Dienekes' blog:

"Wagg or Waggg (whatever), if you read my above comments carefully, you'll see that when I talked about the absence of the Northern European component in South Asians I was referring to Dienekes' supervised ADMIXTURE analysis. In unsupervised ADMIXTURE analyses, some components that are modal in Northern Europeans do indeed show up in some South Asian populations (especially those from the north), however in much smaller amounts, but they also show up in not so trivial amounts in many West Asian populations in the same unsupervised ADMIXTURE analyses, so there is nothing contrary to what I wrote in my above comments.

As for your lactase persistence allele example, just a single allele (and an allele, as you say, that also exists among West Asians) doesn't say anything about overall population relationships. Besides, I wrote in my above comments that we shouldn't make too much inferences from the distribution of one single haplogroup, and now you are doing the same for a single allele.:D

Lastly, the photos you present doesn't say anything about South Asians in general. Anyone who've been in South Asia (including its northernmost regions including Afghanistan) should know better. I can show you many many millions of swarthy (much swarthier than the swarthiest West Asians) South Asian photos. Also I can show you millions of very light pigmented West Asian photos. Note that most of the photos you show are those of children, who, we know, are lighter pigmented than adults."

In response, you made such an argument, which would later be deleted due to the failure in Blogger:

"For such a specific characteristic to be this widespread in theses specific regions, it has to be meaningful, I think.

The south Asians share a link with European populations that they do not share with west Asians (it's absent or very rare in west Asia, and given its spread and frequency in west Asia, it's not autocthonous, it seems arrived from outside) while from the results shown on this page we could expect the contrary.
Looking at the map it seems that the connection/transmission occured via the central Asian steppes and not west Asia either (in central Asia the frequency is not that important but we know there was an important east Asian genetic flow during iron age so the frequency of that allele was obviously higher in the past, in this region).

These things are transmitted they don't magically appear. Doesn't it plead for a certain quantity of R1a1a (not all, in my mind) coming from central Asia and carrying this specific mutation, in the past?"

To it I wrote such a reply, which too would later be deleted due to the Blogger failure:

"Wagg, I didn't deny that they were transmitted, all I have been saying is that the affect of the transmissions you mention on the overall South Asian genome is small."

Here is the link of the relevant Dienekes thread:

BTW, I should add that we don't know from where the relevant lactase persistence allele spread. Also we don't how much of R1a1a in South Asia came from Europe (I strongly suspect that a very small percentage of it came from Europe).