Re: Should I upgrade to Y37- BIGY Y700 ?

Hi Robert,

I have found when working with autosomal, that it is much easier to sort matches that come from my mother (over 5000 matches) and father (over 7000 matches) rather than using my own dna (just over 700 matches). I also have both which give me my dad's MT line as well as my mother's.

I also discovered that my mother and one of my sisters matched my father-in-law and one of his daughters at the 4th to distant cousins. The connection would have been in middle Europe in the early 1800's (Germany or the Netherlands) although I can't quite find the common ancestor, but I do know the surname in common.

On Saturday, June 22, 2024 at 02:32:40 PM CDT, Robert McMillan via groups.io <tensawmac@...> wrote:

I agree, my Dante test realigned to 60.3Mbp (from 23.6)

My dad’s Nebula test from 23.6 to 60.1M BP

I have seen some Nebula results at or around 45mbp, and I think these come from Nebula realigning just the y bam. I’ve read where others have had better luck realigning the complete BAM and then extracting the Y, I don’t know enough about this to make a comment one way or the other. Only that is what I did on the two tests mentioned above.?

Robert McMillan

Show quoted text

On Jun 22, 2024, at 1:41?PM, 6458923@... wrote:

?On Sat, Jun 22, 2024 at 08:30 PM, vineviz wrote:

"WGS files from Nebula, when aligned to T2T almost never produce coverage of more than 45 mbp."This sentence is completely wrong. ?My own Nebula 30x WGS result R-Y278111* -- T2T .BAM Nebula Genomics 11X, 59.8 Mbp, 150 bp and several other results I know have readings over 60 million bases. Please don't say "Never" if you don't have enough information.?

Re: Should I upgrade to Y37- BIGY Y700 ?

#8146

On Sat, Jun 22, 2024 at 4:43?PM Wayne via <dna_wayne=[email protected]> wrote:

What Iain is missing about long read technology is that there will be additionat STRs reported.? The current short read NGS technology is actually worse than the original Sanger technology in terms of identifying STRs and reporting out on those with longer/arger motifs.? ?FTDNA took a step backwards in reporting STRs when them moved from Sanger to NGS based techniques.? ? Long read technologies will also provide better insight into sequence structural deletions and rearrangements which NGS may not correctly identify or report on.

Wayne

On Saturday, June 22, 2024 at 03:57:04 PM EDT, Iain via <gubbins=[email protected]> wrote:

There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:

Total number of loci = the number of base pairs that have any reads in a test

Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed

The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:

Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.

The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).

This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.

So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.

When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.

- Iain.

Re: Should I upgrade to Y37- BIGY Y700 ?

#8145

Hi Wayne - I was lumping STRs in with structural variants, since I think the applicability will be the same. Sure, we should be able to get the "original 111" without a separate test, but that's not a visible success to the end user unless it positively affects the pricing point. Otherwise the additional STRs are really only likely to create significant benefit from moderate improvements to the TMRCAs and splitting haplogroups at the generational level for the few who need that. That and, after nearly ten years, we still haven't got full benefit from the additional STRs beyond the first 111.

?

- Iain.

Re: Should I upgrade to Y37- BIGY Y700 ?

#8144

On Sat, Jun 22, 2024 at 09:57 PM, Iain wrote:
thank you for your reply, we are far away from creating a complete T2T or De Novo Genome Assembly. There is currently not a single company in the world offering a telomere to telomere (T2T). For creating a real T2T sequence like hs1/CP086569.2 you need to use 18 flow cells of nanopore sequencing to get 166x coverage plus many high quality short read genomes from a specially grown cell line using a special Pore-C protocol. Plus they further used a couple of PacBio HiFi WGS runs. This is out of the reach for most citizen scientists. YSEQ is trying first baby steps to make at least long read nanopore sequencing available. .How does the Nebula WGS 100x help for T2T alignment?

Re: Should I upgrade to Y37- BIGY Y700 ?

#8143

What Iain is missing about long read technology is that there will be additionat STRs reported.? The current short read NGS technology is actually worse than the original Sanger technology in terms of identifying STRs and reporting out on those with longer/arger motifs.? ?FTDNA took a step backwards in reporting STRs when them moved from Sanger to NGS based techniques.? ? Long read technologies will also provide better insight into sequence structural deletions and rearrangements which NGS may not correctly identify or report on.

Wayne

On Saturday, June 22, 2024 at 03:57:04 PM EDT, Iain via groups.io <gubbins@...> wrote:

There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:

Total number of loci = the number of base pairs that have any reads in a test

Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed

The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:
https://ydna-warehouse.org/benchmarks
Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.

The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).

This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.

So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.

When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.

- Iain.

Re: Should I upgrade to Y37- BIGY Y700 ?

#8142

I understand, thank you

Re: Should I upgrade to Y37- BIGY Y700 ?

#8141