Update 02/02/2013: A 23andMe scientist explains what went wrong with the "overfitting" fix:
Last Friday the 25th we pushed out an update to Ancestry Composition to address the “overfitting” issue that’s been discussed. There was a problem with the update, in some cases leading to the reporting of spuriously high levels of Sub-Saharan African ancestry. We reverted the update during the weekend. We’ve figured out the source of the problem, and will be re-deploying a corrected update in the near future. Although we don’t have a timeline quite yet, look for another update soon.
Source: 23andMe Community
Update 27/01/2013: The Ancestry Composition overhaul is not going smoothly. Some clients are today reporting unexpectedly high levels of Sub-Saharan ancestry in their results, and this might be a bug caused by the "overfitting" fix. See here for live updates on the crisis.
I have some great news for 23andMe clients who were used as reference samples for the company's new Ancestry Composition (AC) tool. These people received AC outcomes which were largely based on their self-reported ancestry at 23andMe (as opposed to their genetic origins), but the issue has now been fixed and they will see changes to their results within 7 days. A 23andMe scientist explains:
Ancestry Composition (AC) works by learning (training) a set of useful features from reference individuals with known ancestry (the training set) and then using these features to predict the ancestry of our customers.
Our set of reference individuals consists in part of customers who reported their 4 grandparents were born in the same country. Remember that we also remove the outliers, or people whose genetic ancestry doesn't match their survey answers. From this set, AC learns to associate certain haplotypes with their geographical origin. AC is then able to recognize similar haplotypes and thus to predict the ancestry of other customers.
However, when predicting the ancestry of reference individuals, AC suffers from overfitting, a problem common to many supervised learning methods. As a consequence, AC predicts the ancestry of most reference individuals as being 100% from their grandparents’ birthplace.
We addressed this issue using a method inspired from cross-validation. We divided the training set into 5 folds, each containing 20% of the reference individuals. We then trained 5 AC models in which each fold in turn is excluded from the set of reference individuals. So each of these models is learned using 80% of the reference individuals. Additionally, we retain the model that was trained using all the reference individuals. From this process, we end up with 6 different models from which we can predict the ancestry of our customers.
Now, when predicting the ancestry of a customer, we start by figuring out if he/she is a reference individual. If yes, we identify the fold in which the customer belongs, and we use the corresponding model for prediction. If not, we use the fold containing all of the reference data. This way, we ensure that AC was never trained using the haplotypes of the individual it tries to predict.
For more info and a discussion about this update, see the relevant thread at the 23andMe Community.
I was one of the unlucky and unhappy people who were used as an AC reference sample, and as a result generated some negative publicity for 23andMe across the blogosphere and on various fora. I'm planning to generate a lot of positive publicity for the company when my new result comes through, because I think the AC can be one hell of a tool if done properly.
23andMe’s Ancestry Composition – a preliminary review