Sequencing a fully complete human genome has been the work of decades. Twenty years ago, the Human Genome Project (HGP) declared its work finished, with an asterisk. Even a decade later, eight percent of the genome – so-called “junk DNA” – was beyond our understanding. But the idea of junk DNA was stuck in the collective science crawl. Mother Nature is a cheapskate and genes are expensive. Living beings lose genes to resist threats they do not face. Why would DNA violate its own principle of parsimony? Lots of tweed arguments ensued.
Even though we unraveled that final eight percent, there were still some obnoxious holdouts. But now in a flurry of more than a dozen peer-reviewed articles, a coalition of researchers report that they have sequenced an entire human reference genome, from start to finish – telomere to telomere. Thanks to their efforts, we not only know what “junk” DNA does, but we know how it does it.
“It’s a big deal,” said co-author Erich D. Jarvis. “Every base pair in a human genome is now complete.”
“You would think that with 92% of the genome completed a long time ago, another 8% wouldn’t contribute much,” Jarvis added. “But from that missing eight percent, we’re now gaining a whole new understanding of how cells divide.”
“Like a broken record”
A single “representative” copy of the human genome is about three billion base pairs long. This is gigantic. During sequencing, researchers break DNA molecules into pieces of manageable length. With euchromatin – the 92% of our genome sequenced by HGP – it’s easy to stitch the sequence back together. The problem arises when it comes time to sequence and reassemble heterochromatin: the DNA of that last eight percent.
Far from being junk, heterochromatin codes for important cogs in the cellular machinery that runs our DNA. Instead of coding for “normal” proteins, heterochromatin makes DNA accessory molecules, including a type called centromeres. Centromeres are the piece that holds two strands of a chromosome together, and they are an indispensable part of cell division. But until now, centromeres have been a major obstacle in the search for a reference genome.
A few stretches of heterochromatin loop over the same series of a few nucleobases, repeating them over and over like a broken record. Others are just long stretches of the same nucleobase – think “AAAAAAAAAAAAAAAA”, but thousands long bases. Centromeres have both. Historically, it’s been difficult to tell exactly how long these repetitive stretches last, let alone align them correctly. However, an international group of geneticists decided to pool their efforts, calling themselves the Telomere-to-Telomer (T2T) Consortium. Jarvis’ lab used a number of tools to help T2T clean up “messy” DNA sequences and generate error-free results.
One of these tools is Merfin. Merfin is a high-powered DNA sequencing tool, which T2T has used to clean up some of the most error-prone lengths in the human genome, including the centromeres.
“The genomes we generate in the lab can contain many errors,” said Giulio Formenti, a postdoctoral fellow in Jarvis’ lab, who developed Merfin. “If even one or a few base pairs are wrong, it can have a big impact on the overall accuracy of the genomic sequence.” Centromeres are long and repetitive, so they are very sensitive to this type of error. But they are important enough that we need to correct them.
“Extents of identical base pairs, such as AAA, are difficult to assess for existing technology,” Formenti added. “There are often errors in these sequences, even now. Merfin corrects them.
The T2T team focused on a unique genome, derived from a kind of non-viable cell created when a sperm fertilizes an egg that lacks a nucleus. Because of this glitch in their development, these cells have two copies of DNA from the father – and no information from the mother. They are diploid cells, but they have a single genetic lineage. This made them prime targets for use as a single end-to-end genome. This also made them prime targets for Merfin.
In addition to Merfin, the researchers used Pacific Biosciences’ HiFi DNA sequencing machine, as well as the Oxford Nanopore sequencing method. Nanopore is able to read up to a million base pairs at a time, while HiFi excels in precision. All of a sudden the centromeres became much easier to sequence and align. “That was the last piece of the puzzle — like putting on a new pair of glasses,” said T2T co-author and co-chair Adam Phillippy, an NIH researcher.
Looking forward to
Although the new reference genome is complete, it comes from a single genetic line. Therefore, human genome sequencing does not automatically represent the full diversity of human haplotypes. “To address this bias,” the researchers write in a report, “The Human Pangenome Reference Consortium has partnered with the T2T Consortium to build a collection of high-quality reference haplotypes from a diverse set of samples.” In this way, the researchers intend to pursue a reference genome for the entire human race.
In the meantime, scientists intend to use this reference genome to better understand genetic diseases, aging and the process of human evolution.
“Since we had the first draft human genome sequence, determining the exact sequence of complex genomic regions has been a challenge,” T2T consortium co-chair Evan Eichler said in a statement. declaration. “I’m glad we got the job done. The comprehensive plan will revolutionize the way we think about human genome variation, disease and evolution.
Yes, “Merfin’ DNA” is a lame Beach Boys joke. I still think it’s funny, and I’ll die on that hill.