There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:
Total number of loci = the number of base pairs that have any reads in a test
Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed
The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:
Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.
The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).
This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.
So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.
When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.
- Iain.