¿ªÔÆÌåÓý

ctrl + shift + ? for shortcuts
© 2025 Groups.io

Re: #AncientDNA - Merovingian period in Flanders (Belgium) #AncientDNA

 

Hi Bertram, all,

A new paper concerning this study and these aDNAs was recently published: 0pubmed

The 5 aDNAs identified as belonging to R-U106 were confirmed by FTDNA, with more refined haplogroups.

For example, KOS015 (Koksijde 15), previously identified as R-FGC17465, is now identified as R-FGC17460, by FTDNA. Or an average TMRCA going from 1500 BCE to 200 BCE (if we consider KOS015 as positive for the 11 equivalent SNPs of this haplogroup).

BAM/FastQ files are available from the European Nucleotide Archive under project PRJEB70768.

Cheers,

Ewenn

Le sam. 20 mai 2023, 16:13, Bertram de Verdun via <u106verdun=[email protected]> a ¨¦crit?:
Here below a link to a paper about two burial sites from Merovingian period in Belgium, close to French boundary.

?
?

The paper is in Flemish, but we can see that several individuals belong to U106.

Regards

Bertram


Interesting article about Yamnaya and certain diseases

 

¿ªÔÆÌåÓý

https://www.msn.com/en-gb/health/medical/ancient-dna-traces-multiple-sclerosis-origins-to-5-000-year-old-migrations/ar-AA1mVKJR

Dan D


Re: Family Finder haplogroups

 

"Apparently, the number of BigY tests processed by FTDNA has been a surprisingly constant?47?tests per day for the last three years."

BigY testing had a nice uptick last "month".??

June 5? 111,760
July 7? ? 113,917? ?
?
2157 results over that 32 day "month".? Averages out to 67.4 results per day.

No new results have posted since July7, but FTDNA has been dealing with severe storm difficulties from last Sun or Mon.

Too early for Father's Day sales, but perhaps a boost from DNA Day or Rootstech sales.


The untold story of the Human Genome Project: How one man's DNA became a pillar of genetics

 

¿ªÔÆÌåÓý

An interesting article raising ethical concerns about the early days of the Human Genome Project:

?

?

Debbie Kennett


Resolving the source of branch length variation in the Y chromosome phylogeny

 

¿ªÔÆÌåÓý

A new preprint: "Resolving the source of branch length variation in the Y chromosome phylogeny"

?

?

Debbie Kennett


Re: General DNA

 

¿ªÔÆÌåÓý

I¡¯d quibble with the use of the term ¡°descendants¡± in the title. ?The deceased may be collateral relatives of GW, but GW has no documented descendants¡­


On Jul 6, 2024, at 22:54, Linda Wheaton via groups.io <lbucher@...> wrote:

?
Interesting,
Through my paper trail, George Steptoe Washington, nephew, is my 4th cousin five times removed through my Dabney/Maupin connection. I think MANY, MANY are descended from some of those families in early USA.

On Saturday, July 6, 2024 at 04:35:47 PM CDT, Al <alholdcroft@...> wrote:


Possibly of wider DNA interest to American members -?

DNA Study IDs Descendants of George Washington from Unmarked Remains, Findings to Aid Service Member IDs Going Back to World War II

https://www.sciencedaily.com/releases/2024/03/240328111048.htm

You can thank the remarkable developments in DNA anslysis, and Major Ferguson and his Light Infantry lads at Brandywine in 1777, lying in cover in long grass, who opted not to dishonourably shoot a general officer wearing an unfeasably large hat.?

Al


Re: General DNA

 

Interesting,
Through my paper trail, George Steptoe Washington, nephew, is my 4th cousin five times removed through my Dabney/Maupin connection. I think MANY, MANY are descended from some of those families in early USA.

On Saturday, July 6, 2024 at 04:35:47 PM CDT, Al <alholdcroft@...> wrote:


Possibly of wider DNA interest to American members -?

DNA Study IDs Descendants of George Washington from Unmarked Remains, Findings to Aid Service Member IDs Going Back to World War II

https://www.sciencedaily.com/releases/2024/03/240328111048.htm

You can thank the remarkable developments in DNA anslysis, and Major Ferguson and his Light Infantry lads at Brandywine in 1777, lying in cover in long grass, who opted not to dishonourably shoot a general officer wearing an unfeasably large hat.?

Al


General DNA

 

Possibly of wider DNA interest to American members -?

DNA Study IDs Descendants of George Washington from Unmarked Remains, Findings to Aid Service Member IDs Going Back to World War II

https://www.sciencedaily.com/releases/2024/03/240328111048.htm

You can thank the remarkable developments in DNA anslysis, and Major Ferguson and his Light Infantry lads at Brandywine in 1777, lying in cover in long grass, who opted not to dishonourably shoot a general officer wearing an unfeasably large hat.?

Al


Re: Y111 STR counts for U106

 

It's been very interesting to look at the STR frequencies.?

I have over 5000 matches at Y12 but only a few at Y37 (other than several known Y 5th cousins).? ?I've wondered why the steep decline and from the frequencies I can see why.

I extracted a subset of frequencies for R-S1774, which is below Z30>Z8.? Group 77B.? ?99 men in this sample

I have no unusual markers in the first 12.? ?And only one in the first 25.? But from STRs 26 to 37, I get hammered.? ?Out of the 10 single copy markers, I have three low frequency markers, two of which are very low (IMO).? ?

The following is based upon the frequencies of the R-S1774 haplotypes (group 77B).? ?
  • DYS437? ? My value occurs only 3 times out of 99? ?(Y25 group)
  • Y-GATA-H4? ? My value occurs only 4 times out of 99
  • DYS456? ? My value occurs 18 times out of 99
  • DYS570? ? My value occurs 8 times out of 99

In addition, my pair of values for multi marker CDY, have a frequency of approximately 1 in 10 (for the whole U106 database).? Same 1 in 10 frequency for the 4 values for DYS464.? ??

By the time I go through the first 37 markers, I already have 6 low frequency values.? ?Sets up a challenge for matching.??

My STRs from 38 to 111 are less volatile.? Only 4 markers out of the norm.? ?Sometimes, I get matches at Y67 and Y111 from men who are AWOL at Y37.

Three out of the six low frequency values were flagged by FTDNA as red fast-moving markers.

This has been a good exercise in Excel.? ?I don't mean to imply that any of the above supersedes BigY SNP testing.



Re: Y111 STR counts for U106

 

Way back in ancient DNA times DYF371X results did appear to support some branching under Z326.? Overlaying DYF371X results on top of the current branch structure would be an interesting exercise.? I also wonder if in some of the well tested lines such as Cecil if relevant DYS371X branching is observed.? I have not done DYF371X across my broad paternal line.

Wayne

On Monday, July 1, 2024 at 12:51:05 PM EDT, Myles Twete <matwete@...> wrote:


Thanks Martin.

And I¡¯m glad you included the ZERO counts.? On my leg of U106, we our all DYS425=0, which evidently formed about 1500BCE with Z326 haplogroup.

But that ZERO itself doesn¡¯t tell the actual story (at least as regards DYS425).? I tested my father¡¯s sample at YSEQ for DYF371X (DYS425 is a subset) and it appears that my father and I are DYS425=10c

Our sample: DYF371X 10c-10c-13c-14c

Most common DYF371 haplotype in R1b is 10c-12t-13c-14c

You see, Y67/111/etc. DYS425 only reports a STR count if the repeats are type ¡°T¡± --- if instead it flipped to ¡°C¡± at some point, the test does not read or report it.

Of the 297 ¡°Zero¡± samples for DYS425 in the database, I¡¯d guess that most or all of them have repeats that are ¡°12c¡±, ¡°11c¡± or ¡°10c¡±.

We are currently testing 3 more samples in our tree for DYF371X---one with MRCA at about 500ybp, another at perhaps 1000ybp and another at about 1500ybp.

Since we see a few DYS425=12 in our Project (Cecil-Cessill), I suspect that the change that happened around Z326 simply flipped from 12t to 12c, then over time on our part of the tree mutated further to 11c, then 10c.? Or maybe some other scenarios¡­

?

If any of you are also under Z326 and see a DYS425=0, you might also consider testing for DYF371X.

There are 10x more DYS425=0 than any other result except CYS425=12, which is about 16x the number who have zero as the result.

For sure, the STR difference counts reported for sample comparisons under Z326 are ignoring any mutations at DYS425 unless they backmutate to ¡°t¡±.

This means that while we might think we have a GD:1 match at Y67 with someone, the actual number could be GD:2, it¡¯s just that that second one is masked by this ZERO.

Not all zeroes are the same¡­

?

Thanks!

?

-Myles

?

?

From: [email protected] [mailto:[email protected]] On Behalf Of Martin Abrams via groups.io
Sent: Sunday, June 30, 2024 9:04 PM
To: [email protected]
Subject: [R1b-U106] Y111 STR counts for U106

?

I needed a project to re-teach myself Excel.? ?I used to be pretty good at Excel but that was 15 years ago and I have not really touched it for the last 10 years.

For my re-teaching "moment".....hopefully of interest to some people......
I have captured all the STRs in the U106 project.? ? I then calculated how many of each number for each of the 111 STRs (except the multicopy STRs).

example, the first STR, DYS393

Total 7240

11

2

12

192

13

6659

14

338

15

49


There are several Basement Subgroups at the end of the Classic or Colorized results, a few of which are not very U106ish.? ?At this point, I have kept them in the stats, as on my preliminary effort, I wanted to stay as near as possible to the original totals for each STR.

btw, the totals of each level (approximately)
Y12? ?7240
Y25? ?7151
Y37? ?7093
Y67? ?6187
Y111 4976

I also did a subset of the Blue groups, Z30 and Z8 (combined).? Groups 65 to 99.

The attached Excel files are the summaries.? They do not contain the 7000 plus rows of STR data.


Re: Y111 STR counts for U106

 

Thank you, Martin, for sharing this.

I went through and put in my own STRs, found those values where my numbers were not common, and finally compared those results with a my BigY matches using the information on a couple of project sites. In the end, the 10 testers you identified with a value of 16 at DYS532 seem to all be positive for R-S3997. This seems to be a unique value we share in contrast to others who fall under U106.

I'm not sure that really means much, but I found it interesting none the less.

Ed


Re: Y111 STR counts for U106

 

¿ªÔÆÌåÓý

Thanks Martin.

And I¡¯m glad you included the ZERO counts.? On my leg of U106, we our all DYS425=0, which evidently formed about 1500BCE with Z326 haplogroup.

But that ZERO itself doesn¡¯t tell the actual story (at least as regards DYS425).? I tested my father¡¯s sample at YSEQ for DYF371X (DYS425 is a subset) and it appears that my father and I are DYS425=10c

Our sample: DYF371X 10c-10c-13c-14c

Most common DYF371 haplotype in R1b is 10c-12t-13c-14c

You see, Y67/111/etc. DYS425 only reports a STR count if the repeats are type ¡°T¡± --- if instead it flipped to ¡°C¡± at some point, the test does not read or report it.

Of the 297 ¡°Zero¡± samples for DYS425 in the database, I¡¯d guess that most or all of them have repeats that are ¡°12c¡±, ¡°11c¡± or ¡°10c¡±.

We are currently testing 3 more samples in our tree for DYF371X---one with MRCA at about 500ybp, another at perhaps 1000ybp and another at about 1500ybp.

Since we see a few DYS425=12 in our Project (Cecil-Cessill), I suspect that the change that happened around Z326 simply flipped from 12t to 12c, then over time on our part of the tree mutated further to 11c, then 10c.? Or maybe some other scenarios¡­

?

If any of you are also under Z326 and see a DYS425=0, you might also consider testing for DYF371X.

There are 10x more DYS425=0 than any other result except CYS425=12, which is about 16x the number who have zero as the result.

For sure, the STR difference counts reported for sample comparisons under Z326 are ignoring any mutations at DYS425 unless they backmutate to ¡°t¡±.

This means that while we might think we have a GD:1 match at Y67 with someone, the actual number could be GD:2, it¡¯s just that that second one is masked by this ZERO.

Not all zeroes are the same¡­

?

Thanks!

?

-Myles

?

?

From: [email protected] [mailto:[email protected]] On Behalf Of Martin Abrams via groups.io
Sent: Sunday, June 30, 2024 9:04 PM
To: [email protected]
Subject: [R1b-U106] Y111 STR counts for U106

?

I needed a project to re-teach myself Excel.? ?I used to be pretty good at Excel but that was 15 years ago and I have not really touched it for the last 10 years.

For my re-teaching "moment".....hopefully of interest to some people......
I have captured all the STRs in the U106 project.? ? I then calculated how many of each number for each of the 111 STRs (except the multicopy STRs).

example, the first STR, DYS393

Total 7240

11

2

12

192

13

6659

14

338

15

49


There are several Basement Subgroups at the end of the Classic or Colorized results, a few of which are not very U106ish.? ?At this point, I have kept them in the stats, as on my preliminary effort, I wanted to stay as near as possible to the original totals for each STR.

btw, the totals of each level (approximately)
Y12? ?7240
Y25? ?7151
Y37? ?7093
Y67? ?6187
Y111 4976

I also did a subset of the Blue groups, Z30 and Z8 (combined).? Groups 65 to 99.

The attached Excel files are the summaries.? They do not contain the 7000 plus rows of STR data.


Y111 STR counts for U106

 

I needed a project to re-teach myself Excel.? ?I used to be pretty good at Excel but that was 15 years ago and I have not really touched it for the last 10 years.

For my re-teaching "moment".....hopefully of interest to some people......
I have captured all the STRs in the U106 project.? ? I then calculated how many of each number for each of the 111 STRs (except the multicopy STRs).

example, the first STR, DYS393

Total 7240

11 2
12 192
13 6659
14 338
15 49

There are several Basement Subgroups at the end of the Classic or Colorized results, a few of which are not very U106ish.? ?At this point, I have kept them in the stats, as on my preliminary effort, I wanted to stay as near as possible to the original totals for each STR.

btw, the totals of each level (approximately)
Y12? ?7240
Y25? ?7151
Y37? ?7093
Y67? ?6187
Y111 4976

I also did a subset of the Blue groups, Z30 and Z8 (combined).? Groups 65 to 99.

The attached Excel files are the summaries.? They do not contain the 7000 plus rows of STR data.


Re: Should I upgrade to Y37- BIGY Y700 ?

 

Hi Robert,
I have found when working with autosomal, that it is much easier to sort matches that come from my mother (over 5000 matches) and father (over 7000 matches) rather than using my own dna (just over 700 matches). I also have both which give me my dad's MT line as well as my mother's.

I also discovered that my mother and one of my sisters matched my father-in-law and one of his daughters at the 4th to distant cousins. The connection would have been in middle Europe in the early 1800's (Germany or the Netherlands) although I can't quite find the common ancestor, but I do know the surname in common.

On Saturday, June 22, 2024 at 02:32:40 PM CDT, Robert McMillan via groups.io <tensawmac@...> wrote:


I agree, my Dante test realigned to 60.3Mbp (from 23.6)
My dad¡¯s Nebula test from 23.6 to 60.1M BP

I have seen some Nebula results at or around 45mbp, and I think these come from Nebula realigning just the y bam. I¡¯ve read where others have had better luck realigning the complete BAM and then extracting the Y, I don¡¯t know enough about this to make a comment one way or the other. Only that is what I did on the two tests mentioned above.?

Robert McMillan

On Jun 22, 2024, at 1:41?PM, 6458923@... wrote:

?On Sat, Jun 22, 2024 at 08:30 PM, vineviz wrote:
"WGS files from Nebula, when aligned to T2T almost never produce coverage of more than 45 mbp."This sentence is completely wrong. ?My own Nebula 30x WGS result R-Y278111* -- T2T .BAM Nebula Genomics 11X, 59.8 Mbp, 150 bp and several other results I know have readings over 60 million bases. Please don't say "Never" if you don't have enough information.?


Re: Should I upgrade to Y37- BIGY Y700 ?

 



On Sat, Jun 22, 2024 at 4:43?PM Wayne via <dna_wayne=[email protected]> wrote:
What Iain is missing about long read technology is that there will be additionat STRs reported.? The current short read NGS technology is actually worse than the original Sanger technology in terms of identifying STRs and reporting out on those with longer/arger motifs.? ?FTDNA took a step backwards in reporting STRs when them moved from Sanger to NGS based techniques.? ? Long read technologies will also provide better insight into sequence structural deletions and rearrangements which NGS may not correctly identify or report on.

Wayne

On Saturday, June 22, 2024 at 03:57:04 PM EDT, Iain via <gubbins=[email protected]> wrote:


There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:


Total number of loci = the number of base pairs that have any reads in a test


Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed


The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:

Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.


The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).


This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.


So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.


When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.


- Iain.


Re: Should I upgrade to Y37- BIGY Y700 ?

 

Hi Wayne - I was lumping STRs in with structural variants, since I think the applicability will be the same. Sure, we should be able to get the "original 111" without a separate test, but that's not a visible success to the end user unless it positively affects the pricing point. Otherwise the additional STRs are really only likely to create significant benefit from moderate improvements to the TMRCAs and splitting haplogroups at the generational level for the few who need that. That and, after nearly ten years, we still haven't got full benefit from the additional STRs beyond the first 111.

?

- Iain.


Re: Should I upgrade to Y37- BIGY Y700 ?

 

On Sat, Jun 22, 2024 at 09:57 PM, Iain wrote:
thank you for your reply, we are far away from creating a complete T2T or De Novo Genome Assembly. There is currently not a single company in the world offering a telomere to telomere (T2T). For creating a real T2T sequence like hs1/CP086569.2 you need to use 18 flow cells of nanopore sequencing to get 166x coverage plus many high quality short read genomes from a specially grown cell line using a special Pore-C protocol. Plus they further used a couple of PacBio HiFi WGS runs. This is out of the reach for most citizen scientists. YSEQ is trying first baby steps to make at least long read nanopore sequencing available. .How does the Nebula WGS 100x help for T2T alignment?


Re: Should I upgrade to Y37- BIGY Y700 ?

 

What Iain is missing about long read technology is that there will be additionat STRs reported.? The current short read NGS technology is actually worse than the original Sanger technology in terms of identifying STRs and reporting out on those with longer/arger motifs.? ?FTDNA took a step backwards in reporting STRs when them moved from Sanger to NGS based techniques.? ? Long read technologies will also provide better insight into sequence structural deletions and rearrangements which NGS may not correctly identify or report on.

Wayne

On Saturday, June 22, 2024 at 03:57:04 PM EDT, Iain via groups.io <gubbins@...> wrote:


There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:


Total number of loci = the number of base pairs that have any reads in a test


Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed


The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:
https://ydna-warehouse.org/benchmarks
Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.


The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).


This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.


So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.


When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.


- Iain.


Re: Should I upgrade to Y37- BIGY Y700 ?

 

I understand, thank you


Re: Should I upgrade to Y37- BIGY Y700 ?

 

There are a lot of potentially iffy comparisons going on in the last few posts on this thread, so I think it's important to step back and think about what these numbers mean. I'm aware that many of the people posting are aware of most of the following, but I think it's important for the benefit of the wider audience - I'll try to pitch this at a fairly intermediate level. We first need to define two terms:


Total number of loci = the number of base pairs that have any reads in a test


Callable loci = the number of base pairs that where a mutation can be securely identified or dismissed


The numbers are often very different from each other, and even between different estimates of the same number by different individuals/organisations, because the numbers depend strongly on the quality thresholds that are being used by individuals or companies. Two of the most important are the read quality and the mapping quality. The read quality says the security with which an allele can be called A, C, G or T. The mapping quality says the probability that that section of DNA has been accurately mapped back onto the reference sequence. It's the mapping quality that is more important for this discussion. James Kane keeps a set of benchmarks from each test type here, which are homogeneously reduced and therefore can be directly compared between tests and companies:
https://ydna-warehouse.org/benchmarks
Nebula's 30x 150bp test currently benchmarks at around 15 million callable loci and 23 million total reads when mapped to GRCh38. When a T2T reference is used, this increases to about 16 million callable loci and normally about 45 million total reads, but can be between about 30 and 60 million. Note that many people will not have 60 million base pairs in their entire Y chromosome to begin with.


The Y chromosome contains around 23-24 million base pairs which can be termed the "readable Y". Typically, this is the limit for the total number of loci returned by BigY or WGS tests. If you have more than this total number of loci, then you must be mapping parts of the "non-readable Y", which may include the centromere, DYZ19 and Yq12 regions. These regions are considered non-readable because they contain many repeating sequences, which are all longer than the 100-150 base pair chunks that these tests are split up into. DYZ19 is just about readable in parts because the repeat length (125bp) is close to the read length. But the centromere and Yq12 regions contain many repeats of identical sequences that span many hundreds or even thousands of base pairs. There is no way you can take a read from a short-read test and accurately map it back onto the long repeats in the centromere and Yq12. Any reads from a 100bp or 150bp WGS test that are recovered beyond about 24 million base pairs are effectively useless, because they cannot be accurately mapped back to a place on the chromosome, even if the alignment software suggests they can - your map of the centromere and Yq12 regions may look very different from the T2T reference sequence due to large-scale mutations between haplogroups, which we simply don't yet know enough about, and the mapping software can easily conflate one SNP for another in such circumstances. For example, you might be able to state that there may(!) be a SNP somewhere in the DYZ3 region, but you won't be able to say whether it's a real SNP or just a bunching up of bad reads, or what the true location of the SNP is within the DYZ3 region, or whether it's phylogenically identical to the same SNP in a different test (since it could be on a different repeat).


This brings us to callable loci. If you can accurately map several reads back to a reference chromosome, and if all those reads pass certain quality thresholds, then you can call whether or not a SNP exists in that location. The very best that's achievable with current 100x or 150x base pair technology is about 23-24 million base pairs but, in practice, the limitation is normally closer to 14-16 million base pairs. This is the useful part of the test and the real number that matters (unless you plan on combining multiple tests together). You cannot expect to go any higher than this without increasing the read length.


So it's the callable loci that matter for almost every genetic genealogy application. We won't be able to make meaningful use of these extra tens of millions of non-callable loci without long-read technology like T2T and a much better understanding of the structural variation of the Y chromosome on large scales across different haplgroups.


When that becomes available commercially, it will be an incremental addition for most people. All most people can expect are a bunch of extra SNPs within many of their haplogroups. That will help revise TMRCAs and hugely reduce the uncertainties, but it's not going to be a game-changer like BigY was when it arrived. The people it's going to be most useful for are those with recent surname problems - people who really need to squeeze every SNP and structural variant out of a test to separate individual generations in a genetic family tree. If the mutation rate can be brought substantially below one mutation per generation (instead of the current 83 years/SNP), then we can start to say a relationship between two testers might be (e.g.) two to four generations beyond their earliest known ancestor, rather than the swathes of centuries that current TMRCA estimates provide. There's real application there, but a large part of it is limited to a subset of people who have already taken BigY or similar tests.


- Iain.