商务合作
动脉网APP
可切换为仅中文
NEW YORK – Researchers from the Dana-Farber Cancer Institute and their collaborators have developed a new algorithm that can achieve near telomere-to-telomere (T2T) genome assemblies using only standard nanopore reads.
纽约——达纳-法伯癌症研究所的研究人员及其合作者开发了一种新算法,该算法仅使用标准纳米孔测序数据即可实现近乎端粒到端粒(T2T)的基因组组装。
Described in a
描述在
BioRxiv
生物预印本档案网
preprint, the tool, named Hifiasm (ONT), lowers the data input requirement to construct near complete genomes while eliminating the need for multiple sequencing technologies or ultra-long nanopore reads.
预印本中提到,这个名为 Hifiasm (ONT) 的工具降低了构建接近完整基因组的数据输入要求,同时消除了对多种测序技术或超长纳米孔读数的需求。
According to Heng Li, an associate professor of biomedical informatics at Dana-Farber and the corresponding author of the paper, Hifiasm (ONT) is an evolution of
根据达纳-法伯生物医学信息学副教授、论文通讯作者李恒的说法,Hifiasm (ONT) 是
Hifiasm
Hifiasm
, a
,一个
de novo
从头开始
genome assembly tool initially developed for Pacific Biosciences High-Fidelity (HiFi) sequencing reads.
最初为太平洋生物科学公司的高保真(HiFi)测序读数开发的基因组组装工具。
Due to the limited read lengths of PacBio HiFi sequencing, Hifiasm, though, often deemed one of the
由于PacBio HiFi测序的读长有限,Hifiasm尽管常被认为是一个
best-performing assemblers
表现最佳的汇编程序
, previously struggled with complex repetitive regions such as centromeres or long segmental duplications at 10 kb to 20 kb, the study authors noted.
,研究作者指出,以前在处理复杂的重复区域时遇到困难,例如10 kb到20 kb的着丝粒或长片段重复。
To achieve more complete human genome assemblies, Li's team released
为了实现更完整的人类基因组组装,李的团队发布了
Hifiasm (UL)
Hifiasm (UL)
last year, an iteration of Hifiasm that combines PacBio HiFi data with ultra-long nanopore reads of more than 100 kb. Still, generating ultra-long nanopore reads can be costly and technically challenging, as it requires a large amount of high molecular weight DNA, Li pointed out.
去年,Hifiasm 的一个迭代版本结合了 PacBio HiFi 数据和超过 100 kb 的超长纳米孔读数。不过,李指出,生成超长纳米孔读数可能会成本高昂且技术上具有挑战性,因为它需要大量高分子量 DNA。
With the goal of making near-T2T genome assembly more accessible, Li and collaborators developed Hifiasm (ONT), which only requires standard simplex nanopore reads. While the new tool is a 'very similar algorithm' compared to other Hifiasm assemblers, given their shared base code, developing it brought unique challenges to the team, Li said.
为了使接近T2T的基因组组装更加普及,李和合作者开发了Hifiasm(ONT),它仅需要标准的单分子纳米孔测序读数。李表示,虽然这个新工具与其他Hifiasm组装软件相比是“非常相似的算法”,因为它们共享基础代码,但开发过程为团队带来了独特的挑战。
.
。
For instance, the Hifiasm framework includes an error correction step to make HiFi reads nearly error-free before assembly, assuming the errors are rare and random. However, such an error correction algorithm was not adequate for nanopore simplex reads given their higher raw error rates, Li noted.
例如,Hifiasm框架包含一个纠错步骤,以在组装前使HiFi读取几乎无错误,前提是假设错误很少且随机。然而,李指出,考虑到纳米孔单工读取的较高原始错误率,这种纠错算法并不适用于纳米孔单工读取。
'Nanopore reads generally have a little bit higher error rate, and it is not so random,' said Haoyu Cheng, an assistant professor of biomedical informatics at Yale University and the first author of the study. 'This makes it a little bit harder for us to separate which one is the sequence error versus which one is a real [genetic] variant.'.
“纳米孔测序的错误率通常略高,而且其错误并非完全随机,”耶鲁大学生物医学信息学助理教授、该研究的第一作者程浩宇表示。“这让我们更难以区分哪些是测序错误,哪些是真实的(基因)变异。”
To cope, the researchers developed a new error correction algorithm for Hifiasm (ONT) that leverages haploid phasing information while considering base quality scores. Using that, Hifiasm (ONT) can correct most nanopore simplex reads to near error-free, the study authors noted.
为了应对这一问题,研究人员为Hifiasm (ONT) 开发了一种新的错误校正算法,该算法利用单倍体定相信息,同时考虑碱基质量分数。研究作者指出,通过这种方法,Hifiasm (ONT) 能够将大多数纳米孔单链读段校正到接近无错误的水平。
To evaluate the performance of Hifiasm (ONT), Li and his team generated standard nanopore simplex reads for seven Genome in a Bottle (GIAB) human samples: HG001, HG002, HG003, HG004, HG005, HG006, and HG007.
为了评估Hifiasm(ONT)的性能,李和他的团队为七个“瓶中基因组”(GIAB)人类样本生成了标准的纳米孔单工读数:HG001、HG002、HG003、HG004、HG005、HG006和HG007。
Each sample was sequenced on a PromethIon 48 device using R10.4 flow cells. The target coverage for the samples was 50X or higher, and the average read length of the datasets, represented by N50, was 30 kb.
每个样本使用R10.4流动槽在PromethIon 48设备上进行测序。样本的目标覆盖率为50倍或更高,数据集的平均读长(以N50表示)为30 kb。
As part of benchmarking efforts, the researchers applied Hifiasm (ONT) to the GIAB human genomes and compared it with another commonly used T2T assembler named
作为基准测试工作的一部分,研究人员将Hifiasm(ONT)应用于GIAB人类基因组,并将其与另一个常用的T2T组装工具进行了比较,该工具名为
Verkko
网络
, which was developed by researchers at the National Human Genome Research Institute (NHGRI). Given that Verkko cannot directly assemble nanopore simplex reads due to their error rates, the authors preprocessed the data using a recently developed error correction tool called Herro.
,该软件由美国国家人类基因组研究所(NHGRI)的研究人员开发。鉴于Verkko无法直接组装纳米孔单工读数,因为其错误率较高,作者使用了一种最近开发的纠错工具Herro对数据进行了预处理。
Overall, the authors concluded that the assemblies made by Hifiasm (ONT) 'consistently exhibit higher quality' than those produced by Verkko paired with Herro, while the former is also faster and does not require high-end computing power.
总体而言,作者得出结论,Hifiasm (ONT) 生成的组装“质量始终更高”于 Verkko 与 Herro 配合生成的结果,而且前者速度更快,且不需要高端计算能力。
Additionally, the researchers compared genomes assembled from nanopore standard simplex reads using Hifiasm (ONT) and those generated with PacBio HiFi reads. From that experiment, they concluded that nanopore assemblies showed comparable quality to their HiFi counterparts across all samples while showing substantially higher contiguity.
此外,研究人员比较了使用Hifiasm(ONT)从纳米孔标准单工读数组装的基因组与使用PacBio HiFi读数生成的基因组。通过该实验,他们得出结论:在所有样本中,纳米孔组装的质量与HiFi组装相当,同时显示出显著更高的连续性。
.
。
'If we use PacBio HiFi data to do the assembly, we cannot get any chromosome telomere to telomere,' Cheng said. 'But with the nanopore data, we can get tens of them.'
“如果我们使用PacBio HiFi数据进行组装,我们无法获得任何从染色体端粒到端粒的完整序列,”程说。“但使用纳米孔数据,我们可以获得数十个。”
Still, Li pointed out that compared with PacBio HiFi data, nanopore simplex reads still have lower raw base accuracy, especially for the homopolymer regions. 'If you want to get best base accuracy, PacBio HiFi data is still better,' he noted. 'Although PacBio HiFi data also have homopolymer errors, it is better than nanopore.' .
不过,李指出,与PacBio HiFi数据相比,纳米孔单链读取的原始碱基准确性仍然较低,尤其是在均聚物区域。“如果你想要获得最佳的碱基准确性,PacBio HiFi数据仍然更好,”他指出,“尽管PacBio HiFi数据也存在均聚物错误,但它比纳米孔要好。”
Furthermore, the researchers compared Hifiasm (ONT) against Verkko paired with Herro using nanopore simplex ultra-long read data. Although the assembly quality for Verkko paired with Herro improved substantially with ultra-long reads, Hifiasm (ONT) still outperformed the combo, the researchers noted.
此外,研究人员将Hifiasm(ONT)与使用纳米孔单链超长读取数据的Verkko和Herro组合进行了比较。研究人员指出,尽管Verkko与Herro组合在使用超长读取时组装质量显著提高,但Hifiasm(ONT)仍然表现更优。
.
。
Besides human genomes, Hifiasm (ONT) also demonstrated its performance in nonhuman genomes, such as
除了人类基因组之外,Hifiasm (ONT) 还在非人类基因组中展示了其性能,例如
Arabidopsis
拟南芥
, tomato, and zebrafish, Cheng noted.
、番茄和斑马鱼,程指出。
'Hifiasm (ONT) represents a significant methodological advance in genome assembly,' Jan Korbel, head of data science at the European Molecular Biology Laboratory (EMBL), who was not involved in the study, wrote in an email.
“Hifiasm (ONT) 代表了基因组组装方法学上的重大进步,”欧洲分子生物学实验室 (EMBL) 数据科学主管 Jan Korbel 在一封电子邮件中写道,他并未参与这项研究。
Judging by the proof-of-concept data presented by the authors, Korbel said Hifiasm (ONT) demonstrated 'impressive' computational efficiency and assembly performance, while it 'effectively overcomes well-known limitations of ONT simplex reads — namely, high and recurrent error rates — to enable near-T2T assemblies without relying on ultra-long reads.' .
根据作者提供的概念验证数据,科尔贝尔表示,Hifiasm(ONT)展示了“令人印象深刻的”计算效率和组装性能,同时它“有效克服了ONT单工读取的众所周知的局限性——即高且反复的错误率——从而在不依赖超长读取的情况下实现接近T2T的组装。”
While his team has not tried out the algorithm, Korbel said, it is interested in applying Hifiasm (ONT) to large-scale cohort assemblies using nanopore simplex data, which could provide further insights into structural variation and population genetics of repetitive regions in the genome. With the algorithm's lower input requirements, Korbel said, he is also interested in exploring the use of Hifiasm (ONT) in clinical genomics.
虽然他的团队尚未尝试过该算法,但科尔贝尔表示,有兴趣将 Hifiasm (ONT) 应用于使用纳米孔单工数据的大规模队列组装,这可以为进一步了解基因组中重复区域的结构变异和群体遗传学提供更多的见解。科尔贝尔还表示,由于该算法的输入要求较低,他也对探索 Hifiasm (ONT) 在临床基因组学中的应用感兴趣。
.
。
'For those of us who have used Oxford Nanopore sequencing for a long time, this is what we have wanted,' said Danny Miller, a nanopore sequencing expert at the University of Washington who was also not involved in the study. While the results of the preprint study still need to be rigorously tested, Miller noted, his team has tried out Hifiasm (ONT) and is so far seeing similar results.
“对于我们这些长期使用牛津纳米孔测序的人来说,这正是我们所期望的,”华盛顿大学的纳米孔测序专家丹尼·米勒说道,他同样没有参与这项研究。米勒指出,虽然预印本研究的结果仍需严格验证,但他的团队已经尝试了Hifiasm(ONT),目前看到了类似的结果。
.
。
Specifically, Miller said his team has tested Hifiasm (ONT) on samples from the
具体来说,米勒表示他的团队已经对来自
1000G ONT Sequencing Consortium
1000G ONT测序联盟
, where it was able to make more complete genome assemblies with less data input while avoiding the need for ultra-long reads. Miller also praised Hifiasm (ONT) for its speed, noting that his team 'can pretty easily assemble a whole genome in less than a day on a modest server' using the algorithm. .
,它能够在减少数据输入的情况下完成更完整的基因组组装,同时避免了对超长读取的需求。米勒还赞扬了Hifiasm(ONT)的速度,他指出,使用该算法,他的团队“可以轻松地在不到一天的时间内在一台普通的服务器上组装整个基因组”。
Moreover, Miller, who is a physician-scientist, said Hifiasm (ONT) can also be useful for analyzing medically relevant genes that were previously challenging to fully solve. For instance, in one of the samples his team tested, Hifiasm (ONT) was able to discern the correct copy number of the highly homologous SMN1 and SMN2 genes, which are implicated in spinal muscular atrophy (SMA).
此外,身为医生科学家的米勒表示,Hifiasm (ONT) 还可以有助于分析以前难以完全解析的医学相关基因。例如,在他的团队测试的一个样本中,Hifiasm (ONT) 能够辨别出与脊髓性肌萎缩症 (SMA) 相关的高度同源的 SMN1 和 SMN2 基因的正确拷贝数。
.
。
'Clinically, we don't necessarily care or need to get true T2T assemblies, but we do need confidence that we are assembling the more complex regions that are clinically relevant,' Miller said. 'From a clinical perspective, that is what we are interested in.'
“在临床上,我们不一定非要获得真正的T2T组装,但我们确实需要有信心能够组装出那些具有临床相关性的更复杂区域,”米勒说。“从临床角度来看,这才是我们感兴趣的。”