EN
登录

Consortium利用杂交长读组装技术对人类基因组中的复杂结构变异进行编目

Consortium Catalogs Complex Structural Variation in Human Genomes With Hybrid Long-Read Assembly

GenomeWeb 等信源发布 2024-11-27 10:05

可切换为仅中文


NEW YORK – Using a plethora of sequencing technologies and computational tools, researchers from the Human Genome Structural Variation Consortium (HGSVC) have assembled dozens of near-complete human genomes to elucidate complex structural variants in the human genome that were previously deemed intractable.

纽约——人类基因组结构变异协会(HGSVC)的研究人员利用大量测序技术和计算工具,组装了数十个近乎完整的人类基因组,以阐明人类基因组中以前被认为难以解决的复杂结构变异。

.

.

The researchers hope that the database, published in a preprint in BioRxiv in September, can serve as a resource for the scientific community to further explore the biomedical relevance of complex variants in the genome.

研究人员希望,该数据库于9月在BioRxiv的预印本中发布,可以作为科学界进一步探索基因组中复杂变异的生物医学相关性的资源。

The findings also add to data previously released by the international consortium, which is funded by the US National Institutes of Health and aims to systemically catalog structural variants, primarily using samples from the 1000 Genomes Project.

这些发现还补充了国际财团先前发布的数据,该财团由美国国立卫生研究院资助,旨在系统地对结构变异进行分类,主要使用1000个基因组计划的样本。

'This is the first study [for the consortium] where we have a sizable number of telomere-to-telomere chromosomes,' said Jan Korbel, head of data science at the European Molecular Biology Laboratory (EMBL) and one of the corresponding authors of the preprint. 'Our aim is to understand structural variation throughout the genome, in particular in regions where the structural variation is more complex.'.

欧洲分子生物学实验室(EMBL)数据科学负责人、预印本通讯作者之一简·科尔贝尔(JanKorbel)说:“这是(该联盟)首次进行端粒到端粒染色体数量可观的研究。我们的目标是了解整个基因组的结构变异,特别是在结构变异更复杂的区域。”。

For their study, the HGSVC researchers sequenced 65 human genome samples that represented five continental groups and 28 populations, generating 130 haplotype-resolved genome assemblies. Sixty-three of these samples were from the 1000 Genomes Project, with the remaining two from the International HapMap Project and the Genome in a Bottle Consortium.

在他们的研究中,HGSVC研究人员对代表五个大陆群体和28个群体的65个人类基因组样本进行了测序,产生了130个单倍型解析的基因组组装体。这些样本中有63个来自1000个基因组计划,其余两个来自国际HapMap项目和Genome in a Bottle Consortium。

.

.

The broad consent granted for the use of these samples enables the consortium to distribute the results openly, including primary sequencing data as well as structural variant calls, Korbel noted.

Korbel指出,获得使用这些样品的广泛同意,使财团能够公开分发结果,包括主要测序数据以及结构变异调用。

To construct the genome assemblies, the study deployed both HiFi sequencing from Pacific Biosciences as well as nanopore sequencing from Oxford Nanopore Technologies, leveraging the former's high accuracy and the latter's ultra-long-read capabilities, Korbel said.

Korbel说,为了构建基因组组件,该研究部署了太平洋生物科学公司的HiFi测序以及牛津纳米孔技术公司的纳米孔测序,利用了前者的高精度和后者的超长读取能力。

On average, the study achieved 47X coverage per sample with PacBio HiFi sequencing using the Sequel II or Revio platforms and 56X coverage for nanopore sequencing using the Oxford Nanopore PromethIon device and R9.4.1 flow cells. For nanopore sequencing, the average coverage depth for ultra-long reads — reads that are longer than 100 kb — was 36X per sample, according to the study.

平均而言,该研究使用Sequel II或Revio平台通过PacBio HiFi测序实现了每个样品47倍的覆盖率,使用Oxford nanopore PromethIon设备和R9.4.1流通池实现了纳米孔测序56倍的覆盖率。根据这项研究,对于纳米孔测序,超长读数(长于100 kb的读数)的平均覆盖深度是每个样品的36倍。

.

.

Additionally, the authors performed single-cell template strand sequencing (Strand-seq), optical genome mapping, Hi-C sequencing, isoform sequencing (Iso-seq), and RNA sequencing.

此外,作者进行了单细胞模板链测序(strand-seq),光学基因组作图,Hi-C测序,同工型测序(Iso-seq)和RNA测序。

The HGSVC team constructed ​​haplotype-resolved assemblies using Verkko, an automated hybrid genome assembly algorithm developed by researchers at the National Human Genome Research Institute (NHGRI). The phasing signal for the assembly process was generated using Graphasing, which leverages Strand-seq data to globally phase assembly graphs, allowing researchers to produce chromosome-scale de novo haplotypes for diploid genomes without parental sequencing data.

HGSVC团队使用Verkko构建了单倍型解析组装,Verkko是美国国家人类基因组研究所(NHGRI)研究人员开发的一种自动混合基因组组装算法。。

.

.

In certain challenging genomic regions, such as centromeres or the Yq12 region, the researchers also supplemented Verkko with Hifiasm, a de novo assembly tool developed by Dana-Farber Cancer Institute researcher Heng Li and his team.

在某些具有挑战性的基因组区域,例如着丝粒或Yq12区域,研究人员还向Verkko补充了Hifiasm,这是达纳法伯癌症研究所研究员Heng Li及其团队开发的从头组装工具。

By using two long-read sequencing technologies, the study authors said they were able to close 92 percent of previously reported gaps in genome assemblies that used only PacBio HiFi reads. Moreover, they achieved telomere-to-telomere status for 39 percent of the chromosomes analyzed in the study.

通过使用两种长读测序技术,研究作者表示,他们能够弥补之前报道的仅使用PacBio HiFi读数的基因组组装中92%的缺口。此外,他们在研究中分析的染色体中有39%达到了端粒到端粒的状态。

In these near-complete genomes, the HGSVC researchers identified 188,500 structural variants (SVs), 6.3 million indels, and 23.9 million single-nucleotide variants (SNVs) by comparing them against the T2T-CHM13v2.0 reference. When using GRCh38-NoALT as a reference, the researchers cataloged 176,531 SVs, 6.2 million indels, and 23.5 million SNVs.

在这些近乎完整的基因组中,HGSVC研究人员通过与T2T-CHM13v2.0参考文献进行比较,确定了188500个结构变体(SV),630万个插入缺失和2390万个单核苷酸变体(SNV)。当使用GRCh38 NoALT作为参考时,研究人员对176531个SV,620万个indel和2350万个SNV进行了分类。

.

.

As part of the study, the researchers also delved into many disease-associated genomic regions, where structural variants had not been comprehensively studied due to their challenging sequences.

作为研究的一部分,研究人员还深入研究了许多与疾病相关的基因组区域,这些区域的结构变异由于其具有挑战性的序列而尚未得到全面研究。

One such analysis focused on the 5 Mb Major Histocompatibility Complex (MHC) region. After analyzing 130 complete or near-complete MHC haplotypes, the researchers identified 170 SVs that had not been previously reported. They also uncovered a previously unknown copy number variant — a deletion of HLA-DPA2 on one haplotype — as well as low-frequency gene-level SVs, such as a deletion of MICA on one haplotype.

其中一项分析集中在5 Mb主要组织相容性复合体(MHC)区域。在分析了130个完整或接近完整的MHC单倍型后,研究人员确定了170个以前没有报道过的SV。他们还发现了一个以前未知的拷贝数变异-一种单倍型上HLA-DPA2的缺失-以及低频基因水平的SV,例如一种单倍型上MICA的缺失。

.

.

Another disease-relevant and structurally complex part of the genome is the region containing the SMN1 and SMN2 genes, which are implicated in spinal muscular atrophy (SMA). During the study, the researchers were able to assemble, validate, and profile two-thirds of haplotypes in that region, fully resolving the structure and copy number of SMN1/2, SERF1A/B, NAIP, and GTF2H2/C.

基因组中另一个与疾病相关且结构复杂的部分是包含SMN1和SMN2基因的区域,这些基因与脊髓性肌萎缩症(SMA)有关。在研究过程中,研究人员能够组装,验证和分析该区域三分之二的单倍型,完全解析SMN1/2,SERF1A/B,NAIP和GTF2H2/C的结构和拷贝数。

.

.

Lastly, the HGSVC team sought to tackle centromeres, often considered the most structurally challenging regions of the human genome due to α-satellite tandem repeat DNA. They completely assembled and validated 1,246 human centromeres, uncovering 4,153 new α-satellite high-order repeat (HOR) variants and novel array organization among the active α-satellite HOR arrays.

最后,HGSVC团队试图解决着丝粒问题,由于α-卫星串联重复DNA,着丝粒通常被认为是人类基因组中结构最具挑战性的区域。他们完全组装并验证了1246个人类着丝粒,在活跃的α卫星HOR阵列中发现了4153个新的α卫星高阶重复(HOR)变体和新型阵列组织。

.

.

'For me, this [study] is really exciting,' said Danny Miller, a physician-scientist and nanopore sequencing expert at the University of Washington. 'I think it shows that we can now consistently and reproducibly resolve complex variations using long-read sequencing.'

华盛顿大学(University of Washington)的内科医生兼纳米孔测序专家丹尼·米勒(Danny Miller)说:“对我来说,这项研究真的很令人兴奋。”我认为这表明,我们现在可以使用长读测序一致且可重复地解决复杂的变异。”

Additionally, Miller, whose team is currently applying nanopore long-read sequencing in a separate study reanalyzing 1000 Genomes Project samples to build a comprehensive structural variant catalog, said the paper will help researchers gain a better understanding of the structural variants in some of the most challenging regions of the genome.

此外,Miller的团队目前正在另一项研究中应用纳米孔长读测序,重新分析1000个基因组计划样本,以建立一个全面的结构变异目录,他说,这篇论文将帮助研究人员更好地了解基因组中一些最具挑战性的区域的结构变异。

.

.

For instance, the HGSVC researchers demonstrated the diversity of haplotypes spanning the SMN region. With such information, clinicians can now start to ask whether there are individuals who are more susceptible to having an SMN1 deletion or other mutational events, he noted.

例如,HGSVC研究人员证明了跨越SMN区域的单倍型的多样性。他指出,有了这些信息,临床医生现在可以开始询问是否有人更容易发生SMN1缺失或其他突变事件。

Miller also applauded the authors' efforts to investigate structural variants in challenging genomic regions such as centromeres. Their findings will help other researchers generate hypotheses and study the clinical relevance of these SVs moving forward, he said.

米勒还赞扬了作者在研究具有挑战性的基因组区域(如着丝粒)的结构变异方面所做的努力。他说,他们的发现将帮助其他研究人员提出假设,并研究这些SV的临床相关性。

According to Korbel, the data for the current study are available on the International Genome Sample Resource (IGSR) server hosted by EMBL. The consortium also plans to share the data, including the raw sequencing reads, on the Amazon cloud to facilitate computing, he noted.

Despite the progress made by the HGSVC to fill the gaps in the human genome, there are still some thorny regions remaining where the team was 'underpowered to see everything,' Korbel said. Most of these unresolved segments are in the acrocentric short arms of chromosomes 13, 14, 15, 21, and 22, he said, which are known to undergo extensive ectopic recombination and have the highest degree of sequence homology.

科贝尔说,尽管HGSVC在填补人类基因组空白方面取得了进展,但仍有一些棘手的地区,该团队“没有足够的能力看到一切”。他说,大多数这些未解决的片段位于染色体13、14、15、21和22的近端短臂中,已知这些短臂经历了广泛的异位重组,并且具有最高程度的序列同源性。

.

.

Korbel noted that his team will collaborate with the Human Pangenome Reference Consortium (HPRC) to further tackle these remaining dark spots of the genome moving forward.

Korbel指出,他的团队将与人类泛基因组参考联盟(HPRC)合作,进一步解决基因组中剩余的黑暗点。

'As the sequencing quality goes up further, we will immediately look into those regions that we are currently still not able to fully resolve to see what they reveal to us in terms of structural variants,' Korbel said.

科尔贝尔说:“随着测序质量的进一步提高,我们将立即调查那些目前尚无法完全解决的区域,看看它们在结构变异方面向我们揭示了什么。”。