So, with Humata's help I analyzed all R1a1a individuals that have STR111 data. Again, I used adjusted distances to perform this analysis, which I described and used here and here.
In order to analyze the data I had to increase IDs to 10 digits to prevent malfunction of the used software Fitch.
There are various versions to illustrate the data. Here, I am presenting two layouts.
Rectangular tree layout (see high resolution image):
Polar tree layout (see high resolution image):
Again, to better see "what is what" I annotate each ID with the proposed group used in the R1a1a and Subclades Project at FTDNA, and I used the same colors for the subclades as in this figure from FTDNA. Even with 111 STR values the main R1a1a SNPs (Z93, Z283, etc.) are overlapping.
So what does give more accurate results, unrooted network analysis or rooted tree analysis?
It is quiet obvious that even STR111 data are not sufficient enough to differentiate between the major R1a1a subclades. There are multiple overlapping haplogroups in the tree (presented below) and in the network (presented previously), i.e. Z283 and Z93 are overlapping when focusing only on the STR111 data. All previously presented trees are adding SNP information to the tree to correct this obvious overlapping. As an example:
179005 Krikor Mirijanian, Arapkir, Turkey who is Z93+, L342+, L657-.
Based on his STR111 values 179005 is closest to Z283+, Z280+ and Z283+, Z284+ individuals and not to other Z93+ individuals.
Update:
After Semargl was asked how he generated his R1a1a STR111 tree and why his tree shows clear clusters along the SNP branches, he responded that he is using not only STR but also SNP information to generate the tree. Additionally, his current phylogenetic tree is a cladogram, that means that the cladogram tree does not have any information about the age or diversity of the R1a1a branches, e.g. the Ashkenazi-Jewish Z93+, L342+, L657- cluster and the MacDonald cluster takes a large part of the tree, even though these clusters are known to be very narrow (low diversity). Hopefully, he can generate a new tree that includes all this information.
From the Fitch software manual:
In Fitch you can also randomize the input order of the sequences with option "j", jumble. Often the input order of the sequences affects the outcome of the analysis. This can be assessed by randomizing the input order. The program also asks you to specify the number of times you want to randomize the input order of the sequences. It is advisable to do jumbling at least 10 times, because it almost certainly improves the results.
This is why I repeated the analysis with 10 runs as advised. Indeed, the R1a1a STR111 tree looks a little bit better now.
Rectangular tree layout (see high resolution image):
Polar tree layout (see high resolution image):
One of the new discovered SNPs is Z1282, downstream of L342. It was found in N77532 Sundardas Tulsyan, India. He is in the tree as # "N77532-2C*". Based on the presented tree and the previously presented network analysis #184336 , SAUD ABDUL AZIZ, Qatar would be a good candidate for L1282, too (1844336-2C* in the tree and in the network analysis).
Palisto,
ReplyDeleteI am interested to make these plots. I tried to use the ACD tool of http://vaedhya.blogspot.com/, but failed to download the tool. Could you help me to make similar plots. I have used the McGee tools extensively.
Regards,
Wim
These are two different things. Do you want to use ACD tool or do you want to make STR trees?
ReplyDeleteIf you want to use the ACD tool you need to know what your starting data set is: Admixture data with K=? (e.g. Dodecad K=12 or Harappa K=16)
If you want to generate STR trees you need to know the number of STRs in your data set. (e.g. STR67, STR111)
Palisto,
ReplyDeletethanks for your response. I realized I do not want to use the ACD tool at the moment. I wanted to generate the STR trees. So far i succeeded to create: https://docs.google.com/file/d/0B-MTEoTmfh9YLUJydHpBdE9KU28/view
of the L147.1* groep. It was created using mcGee and the Phylip data and using Kitsch as described on http://www.roperld.com/PHYLIPTreeViewUse.htm using neighbouring of the timescales of mcgee. To display i did not use Treeview, but the ape package of R. I see a few improvements possible. I would prefer to use a "neighbouring" function of the STR values. Do you have suggestions?