I did a WGS test with Nebula Genomics, re-aligned the test with T2T and uploaded it to the YFULL site. I did the FTDNA Y37 test because I couldn't find any close matches.>(I have no matches at Y25 and Y37)? My ancestors were Balkan Turks living in Bulgaria. So I know that these haplogroups are very rare in my region. FTDNA R-FT395781 YFULL R-Y278111 Waiting for your recommendations thanks...
|
My personal opinion is that being part of an under tested group is the perfect reason to be the pioneer, upgrade to BigY and start a haplogroup and/or regional group project.
toggle quoted message
Show quoted text
I did a WGS test with Nebula Genomics, re-aligned the test with T2T and uploaded it to the YFULL site. I did the FTDNA Y37 test because I couldn't find any close matches.>(I have no matches at Y25 and Y37)? My ancestors were Balkan Turks living in Bulgaria. So I know that these haplogroups are very rare in my region. FTDNA R-FT395781 YFULL R-Y278111 Waiting for your recommendations thanks...
|
One of my Big Y matches (a ninth cousin by paper trail) doesn't match me at Y-37, so lack of matches at Y-37 doesn't necessarily mean a lack of matches at Big Y.
toggle quoted message
Show quoted text
On Tue, Jun 18, 2024 at 10:27 PM, Mark Miller <fuddaruski@...> wrote: My personal opinion is that being part of an under tested group is the perfect reason to be the pioneer, upgrade to BigY and start a haplogroup and/or regional group project. I did a WGS test with Nebula Genomics, re-aligned the test with T2T and uploaded it to the YFULL site. I did the FTDNA Y37 test because I couldn't find any close matches.>(I have no matches at Y25 and Y37)? My ancestors were Balkan Turks living in Bulgaria. So I know that these haplogroups are very rare in my region. FTDNA R-FT395781 YFULL R-Y278111 Waiting for your recommendations thanks...
|
At Family Tree DNA, R-FT395781 is a six-person haplogroup and is about 2000 years old. I'll presume that you have looked at your results from YFull and found that you are FTA29830-, which limits the pool of people you could match more closely than 2000 years ago to the two FT395781* testers at Family Tree DNA. The general idea behind Y-STR matches is that matches are only given if you are related within the last 1000 years or so. The accuracy with which that 1000 years can be guessed depends on the size of the test. I've posted about this previously in messages 7996 and 8087, and more generally about TMRCAs in messages 312 and 3672 and on my website.
This means that some people who are related to you within the last 1000 years or so may not match your Y-37 test, simply because your ancestors or their ancestors have had a more-than-average number of mutations in their first 37 markers over the last 1000 years. The more markers you test, the less important this effect becomes, so there is definitely an advantage to upgrading to Y-111, because you could recover people that are otherwise lost to the system. If you still have no matches, then you can be confident that no-one in the Family Tree DNA database is related to you within the last few centuries. However, your test results will still be in the database in case a closer match does come along in the future.
Upgrading beyond Y-111 to BigY-700 is a little more questionable. In essence, you're repeating the WGS test you've already taken, you're just paying for very similar data to be stored on a different system. We already know which haplogroup you're going to be in: your only hope beyond this is that you may find that you are more closely related to one of the two FT395781* testers, and establish a new haplogroup for the two of you. You'd also get access to improved information on your haplogroup, as FTDNA would be able to add your test data to the statistics in their Discover platform, and it would help our efforts to reconstruct migration patterns in Europe, as we sorely lack information from south-eastern Europe. But whether the upgrade from Y-111 to BigY-700 would be worth it to you depends on whether you think the benefit from this extra information is worth the cost of re-testing.
Cheers,
Iain.
|
Thank you very much for your advice. I will upgrade to the Y111 test at the next sale. If FTDNA announces a new improved test that offers a new T2T ALIGNMENT then I will do that. ??
|
I applied a few days ago to set up a group project for R-Z49 at FTDNA.
|
I will upgrade my test to Y111 in the next discount period.
|
FTDNA have said they will eventually re-align all Big Y 700 tests to so-called "T2T" reference genomes, and they've already done this for some kits.
Vince
|
The Big700 offers very inadequate coverage for the T2T reference, offering around 25 million bases. Human Ydna, on the other hand, can reach approximately 80 million bases, although this varies from person to person with recent studies.?
|
I don't think it is reasonable to suggest that Big Y is inadequate.
?
For one thing, most of the 80 million bases are phylogenetically unhelpful. ?For this reason, full sequencing of the Y-chromosome - even if it were commercially available, which I don't think it currently is - provides only an incremental benefit over conventional sequencing with realignment to the novel reference.
I suspect affordable long-read sequencing is many years away. ?And when it IS available I think most of us will find the marginal benefit to be minimal.
Vince
toggle quoted message
Show quoted text
On Sat, Jun 22, 2024 at 07:48 AM, <6458923@...> wrote:
The Big700 offers very inadequate coverage for the T2T reference, offering around 25 million bases. Human Ydna, on the other hand, can reach approximately 80 million bases, although this varies from person to person with recent studies.?
|
WGSs, which offer short read, can currently call over 60 million bases. The BigY 700 is around 25 million, which is why I said it is insufficient. I think FTDNA will also offer a WGS test with short read? in a few years. I also agree with the view that long read will not become cheaper in the near future.
|
WGS files from Nebula, when aligned to T2T almost never produce coverage of more than 45 mbp. ?And FTDNA files average closer to 30 mbp than 25.
And that still doesn't mean that the difference (45 vs 30) is phylogenetically significant. ?I suspect most of those 15 "additional" mbp won't prove to be very useful.
toggle quoted message
Show quoted text
On Sat, Jun 22, 2024 at 10:57 AM, <6458923@...> wrote:
WGSs, which offer short read, can currently call over 60 million bases. The BigY 700 is around 25 million, which is why I said it is insufficient. I think FTDNA will also offer a WGS test with short read? in a few years. I also agree with the view that long read will not become cheaper in the near future.
|
On Sat, Jun 22, 2024 at 08:30 PM, vineviz wrote:
"WGS files from Nebula, when aligned to T2T almost never produce coverage of more than 45 mbp."This sentence is completely wrong. ?My own Nebula 30x WGS result R-Y278111* -- T2T .BAM Nebula Genomics 11X, 59.8 Mbp, 150 bp and several other results I know have readings over 60 million bases. Please don't say "Never" if you don't have enough information.?
|
I agree, my Dante test realigned to 60.3Mbp (from 23.6) My dad¡¯s Nebula test from 23.6 to 60.1M BP
I have seen some Nebula results at or around 45mbp, and I think these come from Nebula realigning just the y bam. I¡¯ve read where others have had better luck realigning the complete BAM and then extracting the Y, I don¡¯t know enough about this to make a comment one way or the other. Only that is what I did on the two tests mentioned above.?
toggle quoted message
Show quoted text
On Jun 22, 2024, at 1:41?PM, 6458923@... wrote:
?On Sat, Jun 22, 2024 at 08:30 PM, vineviz wrote:
"WGS files from Nebula, when aligned to T2T almost never produce coverage of more than 45 mbp."This sentence is completely wrong. ?My own Nebula 30x WGS result R-Y278111* -- T2T .BAM Nebula Genomics 11X, 59.8 Mbp, 150 bp and several other results I know have readings over 60 million bases. Please don't say "Never" if you don't have enough information.?
|
Thank you for your comment I have two questions What were the benefits of testing your father? What were the advantages of two T2T WGS?
|
Good question. Probably not a good answer: ?I wanted my parents WGS to put on a drive. I don¡¯t know what the future holds and it is a record I thought I would be glad I have one day. If not me, my kids or grandkids.?
Of immediate benefit was my father¡¯s MTDNA information, and of course his autosomal. The cost was not terribly more than just doing a MtDNA test.?
YDNA was not the reason for the test. That is a result of playing around with WGS Extract.?
I do have my parent¡¯s ancestry kits, however I went ahead and did WGS on them both. I think autosomal is only limited by the computing abilities of the various platforms that host such results. If I am wrong, oh well.?
toggle quoted message
Show quoted text
On Jun 22, 2024, at 2:38?PM, 6458923@... wrote:
?Thank you for your comment I have two questions What were the benefits of testing your father? What were the advantages of two T2T WGS?
|
There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms: Total number of loci = the number of base pairs that have any reads in a test
Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed
The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies: https://ydna-warehouse.org/benchmarks Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.
The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).
This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.
So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.
When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.
- Iain.
|
|
What Iain is missing about long read technology is that there will be additionat STRs reported.? The current short read NGS technology is actually worse than the original Sanger technology in terms of identifying STRs and reporting out on those with longer/arger motifs.? ?FTDNA took a step backwards in reporting STRs when them moved from Sanger to NGS based techniques.? ? Long read technologies will also provide better insight into sequence structural deletions and rearrangements which NGS may not correctly identify or report on.
Wayne
On Saturday, June 22, 2024 at 03:57:04 PM EDT, Iain via groups.io <gubbins@...> wrote:
There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms: Total number of loci = the number of base pairs that have any reads in a test
Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed
The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies: https://ydna-warehouse.org/benchmarks Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.
The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).
This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.
So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.
When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.
- Iain.
|
On Sat, Jun 22, 2024 at 09:57 PM, Iain wrote: thank you for your reply, we are far away from creating a complete T2T or De Novo Genome Assembly. There is currently not a single company in the world offering a telomere to telomere (T2T). For creating a real T2T sequence like hs1/CP086569.2 you need to use 18 flow cells of nanopore sequencing to get 166x coverage plus many high quality short read genomes from a specially grown cell line using a special Pore-C protocol. Plus they further used a couple of PacBio HiFi WGS runs. This is out of the reach for most citizen scientists. YSEQ is trying first baby steps to make at least long read nanopore sequencing available. .How does the Nebula WGS 100x help for T2T alignment?
|