人类参考基因组的GIAB基因组分层资源-动脉网

The GIAB genomic stratifications resource for human reference genomes

Nature 等信源发布 2024-10-19 09:50



可切换为仅中文







AbstractDespite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software.

摘要尽管测序和变异调用工具的种类越来越多，但没有一个工作流程在整个人类基因组中表现得同样好。了解上下文相关的性能对于研究人员，临床医生和开发人员在选择测序硬件和软件时做出明智的权衡至关重要。

Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses.

在这里，我们描述了一组“分层”，它们是BED文件，定义了整个基因组的不同背景。我们为GRCh37/38以及新的T2T-CHM13参考文献定义了这些区域，增加了许多新的难以测序的区域，随着领域的发展，这些区域对于理解性能至关重要。

Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example.

具体而言，我们强调了相对于以前的参考文献，CHM13中难以映射和富含GC的分层增加。然后，我们将基准测试性能与每个参考进行比较，并显示CHM13中这些额外困难区域带来的性能损失。此外，我们以牛津纳米孔技术为例，演示了分层如何在不同的平台迭代中跟踪特定于上下文的改进。

The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes..

产生这些分层的方法可以作为蛇形管道在https://github.com/usnistgov/giab-stratifications.我们预计，在为任何常用的参考基因组构建测序管道时，这将有助于实现精确的风险回报计算。。

IntroductionThe last few decades have brought a vast array of increasingly-powerful sequencing platforms and associated software to read DNA molecules. However, no tool or pipeline performs equally across all genomic contexts within the human genome. Particularly difficult genomic contexts include large duplications and large repeats.

引言过去几十年来，带来了大量越来越强大的测序平台和相关软件来读取DNA分子。然而，没有任何工具或管道在人类基因组内的所有基因组环境中表现相同。特别困难的基因组环境包括大重复和大重复。

Additionally, many sequencing platforms have relatively low performance in homopolymers, and platforms that perform better in homopolymers use short-reads which lack the mapping advantage long reads have in large repeats. The mappers and variant callers used to analyze reads from these platforms also bring context-specific performance implications due to the assumptions (implicit or explicit) they often make when processing sequencing data1.

此外，许多测序平台在均聚物中的性能相对较低，而在均聚物中表现更好的平台使用短读段，这缺乏长读段在大重复序列中的映射优势。由于在处理测序数据时经常做出的假设（隐式或显式），用于分析这些平台读数的映射器和变体调用者也会带来特定于上下文的性能影响1。

Therefore, improving and fully utilizing the sequencing landscape will require detailed analysis of how different tools perform in a given genomic context.To this end, we previously developed “genome stratifications” which are carefully-defined browser extensible data (BED) files that divide the human genome into meaningful contexts for benchmarking.

因此，改进和充分利用测序环境将需要详细分析不同工具在给定基因组背景下的表现。为此，我们之前开发了“基因组分层”，这是精心定义的浏览器可扩展数据（BED）文件，可将人类基因组划分为有意义的背景以进行基准测试。

The genomic stratifications were originally developed in collaboration with the Global Alliance for Genomics and Health (GA4GH)2 and are being further developed by the Genome in a Bottle Consortium (GIAB). Coding regions, low mappability regions, high GC content regions, and various types of repetitive regions are examples of such genomic stratifications, and these are currently defined with regard to two linear references, GRCh37 and GRCh38.

基因组分层最初是与全球基因组学与健康联盟（GA4GH）2合作开发的，目前正在由Genome in a Bottle Consortium（GIAB）进一步开发。编码区，低可映射性区，高GC含量区和各种类型的重复区是这种基因组分层的例子，目前这些是关于两个线性参考文献GRCh37和GRCh38定义的。

These stratifications are designed to be used with benchmarks such as those developed by GIAB, which generates variant benchmarks for a set of human genomes to enable development, optimization, evaluation, and compari.

。

FTBL: https://ftp.ncbi.nlm.nih.gov//genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_feature_table.txt.gz

FTBL：https://ftp.ncbi.nlm.nih.gov//genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_feature_table.txt.gz

GFF: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz

GFF：https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz

Additionally, the script required a .fai index file which was created from the CHM13v2.0 reference assembly.Generating GC content BED files using seqtk for CHM13v2.0We use an existing script created to generate the GRCh38 GC Content Stratification BED files. The script required seqtk version-1.3-r106 tool, bedtools v2.27.1, and tabix v1.9.

此外，该脚本需要一个从CHM13v2.0引用程序集创建的.fai索引文件。使用seqtk for CHM13v2.0生成GC内容床文件我们使用创建的现有脚本来生成GRCh38 GC内容分层床文件。该脚本需要seqtk版本1.3-r106工具、bedtools v2.27.1和tabix v1.9。

Three essential data files were required to run the script file: the CHM13v2.0 FASTA, the CHM13 genome file. The genome was converted to BED format by adding a middle column of 0 (such that each line had the length of the entire chromosome). We ran seqtk for various fractions of GC content, all within windows of 100 bp.

运行脚本文件需要三个基本数据文件：CHM13v2.0 FASTA，CHM13基因组文件。通过添加0的中间列（使得每条线具有整个染色体的长度），将基因组转换为BED格式。我们对GC含量的各个部分运行了seqtk，所有这些都在100 bp的窗口内。

After running seqtk, we added 50 bp slop to each BED file and merged.Lift-over for OtherDifficult regionsIn order to find the coordinate of well-studied genes including MHC, KIR, and VDJ that are considered as difficult regions, we performed liftover for such regions from GRCh38 to CHM13v2.0. To obtain the OtherDifficult regions data of the GRCh38 we referred to the reference sample released by the GIAB https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.1/GRCh38/OtherDifficult/.

运行seqtk后，我们向每个BED文件中添加了50 bp的斜率并进行了合并。提升其他困难区域为了找到被认为是困难区域的经过充分研究的基因（包括MHC，KIR和VDJ）的坐标，我们对从GRCh38到CHM13v2.0的这些区域进行了提升。为了获得GRCh38的其他困难地区数据，我们参考了GIAB发布的参考样本https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.1/GRCh38/OtherDifficult/.

To perform the lift-over, we used the minimap2 (v2.24) aligner with arguments -ax asm5 followed by bedtools bamtobed and merge (v2.30.0). The resulting BED files are provided as part of the GIAB stratification resource.Snakemake pipelineOverviewThis work (first done as part of a hackathon) was incorporated into a snakemake pipeline which can be found at https://github.com/usnistgov/giab-stratifications-pipeline and https://github.com/usnistgov/giab-stratifications.

为了执行提升，我们使用了带有参数的minimap2（v2.24）对齐器-ax asm5，然后是bedtools bamtobed和merge（v2.30.0）。产生的BED文件作为GIAB分层资源的一部分提供。Snakemake管道概述这项工作（最初是作为hackathon的一部分完成的）被整合到Snakemake管道中，可以在https://github.com/usnistgov/giab-stratifications-pipeline和https://github.com/usnistgov/giab-stratifications.

The latter repository holds the global configuration for the three references in this work, and references the former repository as a submodul.

后一个存储库保存了这项工作中三个引用的全局配置，并将前一个存储库作为子模块引用。

Only contained valid chromosomes (i.e., 1-22, X, Y).

仅包含有效的染色体（即1-22，X，Y）。

File was bgzip compressed.

文件已被bgzip压缩。

File was a valid BED file (three columns, tab-delimited, with 2nd and 3rd columns as non-negative integers with 3rd greater than 2nd).

文件是有效的BED文件（三列，制表符分隔，第二列和第三列为非负整数，第三列大于第二列）。

All regions in the BED file were sorted in numeric order (i.e., chromosomes ordered 1-22, X, then Y with each region then sorted by start and end).

BED文件中的所有区域均按数字顺序排序（即染色体顺序为1-22，X，然后Y，每个区域然后按开始和结束排序）。

No regions overlapped with each other.

没有区域相互重叠。

No region overlapped a gap region (which included the PAR on chromosome Y)

没有区域与间隙区域重叠（其中包括Y染色体上的PAR）

No region fell outside chromosomal boundaries.

。

Evaluating the utility of stratifications for benchmarkingWe created an assembly-based benchmark from the Q100 assembly for HG002. Specifically, the HG002 Q100 small variant benchmark was created using v0.011 of DeFrABB (https://github.com/usnistgov/giab-defrabb), the T2T-HG002-Q100v1.0 diploid assembly (https://github.com/marbl/hg002), and GRCh38 reference (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/).DeFrABB (Development Framework for Assembly-Based Benchmarks) is a snakemake-based pipeline created to facilitate the iterative development of benchmarks sets for evaluating variant callsets using high-quality diploid assemblies (https://github.com/usnistgov/defrabb).

评估分层对基准测试的效用我们从Q100组件为HG002创建了一个基于组件的基准。具体来说，HG002 Q100小变体基准测试是使用DeFrABB的v0.011创建的(https://github.com/usnistgov/giab-defrabb)，T2T-HG002-Q100v1.0二倍体组件(https://github.com/marbl/hg002)，以及GRCh38参考(https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/)。DeFrABB（基于程序集的基准测试开发框架）是一个基于蛇形图的管道，旨在促进基准集的迭代开发，以使用高质量的二倍体程序集评估各种调用集(https://github.com/usnistgov/defrabb)。

DeFrABB first generates assembly-based variant calls using dipcall v0.3 (https://github.com/lh3/dipcall)44. Dipcall was run with default parameters with the following Z-drop parameter, -z200000,10000,200, which yielded more contiguous assembly-assembly alignments compared to the default value. After reformatting and annotation, the variant set reported by dipcall (VCF) was used as the draft benchmark variants.

DeFrABB首先使用dipcall v0.3生成基于程序集的变体调用(https://github.com/lh3/dipcall)使用默认参数运行Dipcall，并使用以下Z-drop参数-Z20000010000200，与默认值相比，它产生了更多连续的组件-组件对齐。重新格式化和注释后，dipcall（VCF）报告的变体集被用作基准测试变体草案。

Note that we call these “draft” variants since this benchmark has not been officially evaluated and released by GIAB yet; however, GIAB and the Telomere to Telomere Consortium have polished and curated the assembly and variant calls sufficiently for it to be used for this analysis.The benchmark regions (analogous to the “confident regions” in the GIAB v4.2.1 small variant benchmarks) are defined as regions with a 1:1 alignment between each assembled haplotype and the reference (except chromosomes X and Y).

请注意，我们称这些“草案”变体，因为该基准尚未由GIAB正式评估和发布；然而，GIAB和端粒到端粒联盟已经对组装和变异调用进行了充分的修饰和策划，以便将其用于此分析。基准区域（类似于GIAB v4.2.1小变异基准中的“置信区域”）定义为每个组装单倍型与参考（X和Y染色体除外）之间具有1:1比对的区域。

These regions excluded gaps in the assembly and their flanking sequences, as well as any large repeats (sat.

这些区域排除了组装中的缺口及其侧翼序列，以及任何大的重复序列（sat）。

Data availability

数据可用性

All versions of the genome stratifications up to v3.5 (the latest as of this writing) are available on an FTP site hosted by NCBI here at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/.

所有版本的基因组分层都可以在NCBI托管的FTP网站上找到，最高版本为v3.5（本文撰写时的最新版本）https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/.

Code availability

代码可用性

The initial work for this study (which originally took place at a hackathon) is freely available https://github.com/collaborativebioinformatics/NIST-GREX. The preliminary version of the code to generate stratifications is available at https://github.com/genome-in-a-bottle/genome-stratifications. The full pipeline in snakemake is available at https://github.com/usnistgov/giab-stratifications.

这项研究的初步工作（最初是在一次黑客竞赛中进行的）可以免费获得https://github.com/collaborativebioinformatics/NIST-GREX.生成分层的代码的初步版本可在https://github.com/genome-in-a-bottle/genome-stratifications.snakemake的完整管道可在https://github.com/usnistgov/giab-stratifications.

A copy of the GitHub repository and HTML output of the snakemake pipeline are archived at Zenodo at https://zenodo.org/records/11176260..

GitHub存储库的副本和snakemake管道的HTML输出存档在Zenodohttps://zenodo.org/records/11176260..

ReferencesOlson, N. D. et al. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).Article

ReferencesOlson，N.D。等人，《精准FDA真相挑战V2：在难以绘制地图的区域中从短读和长读调用变体》。细胞基因组。2100129（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).Article

Krusche，P。等人。人类基因组中种系小变异调用基准测试的最佳实践。美国国家生物技术公司。37555-560（2019）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).Article

Zook，J.M.等人整合人类序列数据集提供了基准SNP和indel基因型调用的资源。美国国家生物技术公司。32246-251（2014）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).Article

Wagner，J.等人。通过链接和长读取对具有挑战性的小变体进行基准测试。细胞基因组。2100128（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Xiao, C., Zook, J., Trask, S. & Sherry, S. Abstract 5328: GIAB: Genome reference material development resources for clinical sequencing. Cancer Res. 74, 5328–5328 (2014).Article

Xiao，C.，Zook，J.，Trask，S。＆Sherry，S。摘要5328：GIAB：用于临床测序的基因组参考材料开发资源。癌症研究745328-5328（2014）。文章

Google Scholar

谷歌学者

Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).Article

Wagner，J.等人策划了挑战医学相关常染色体基因的变异基准。美国国家生物技术公司。40672-680（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Majidian, S., Agustinho, D. P., Chin, C.-S., Sedlazeck, F. J. & Mahmoud, M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol. 24, 221 (2023).Article

Majidian，S.，Agustinho，D.P.，Chin，C.-S.，Sedlazeck，F.J。＆Mahmoud，M。基因组变异基准：如果你不能测量它，你就不能改进它。基因组生物学。24221（2023）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 24, 464–483 (2023).Article

Olson，N.D.等人，《完整人类基因组序列时代的变异调用和基准测试》。Genet自然Rev。24464-483（2023）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).Article

English，A.C.，Menon，V.K.，Gibbs，R.A.，Metcalf，G.A。＆Sedlazeck，F.J.Truvari：精细的结构变异比较保留了等位基因多样性。基因组生物学。23271（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).Article

O'Leary，N.A。等人。NCBI的参考序列（RefSeq）数据库：现状，分类学扩展和功能注释。核酸研究44，D733–D745（2016）。文章

PubMed

Google Scholar

谷歌学者

Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: A joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2018).Article

Roy，S.等人，《验证下一代测序生物信息学管道的标准和指南：分子病理学协会和美国病理学家学院的联合推荐》。J、分子诊断。20,4-27（2018）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).Article

Nurk，S。等人。人类基因组的完整序列。科学376,44-53（2022）。文章

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Rhie, A. et al. The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023).Article

Rhie，A。等人。人类Y染色体的完整序列。自然621344-354（2023）。文章

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Antonarakis, S. E. Short arms of human acrocentric chromosomes and the completion of the human genome sequence. Genome Res. 32, 599–607 (2022).Article

Antonarakis，S.E。人类近端着丝粒染色体的短臂和人类基因组序列的完成。。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Foox, J. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat. Biotechnol. 39, 1129–1140 (2021).Article

Foox，J.等人，《ABRF下一代测序研究中DNA测序平台的性能评估》。美国国家生物技术公司。391129-1140（2021）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Pyke, R. M. et al. Computational KIR copy number discovery reveals interaction between inhibitory receptor burden and survival. Pac. Symp. Biocomput. 24, 148–159 (2019).ADS

Pyke，R.M。等人。计算KIR拷贝数发现揭示了抑制性受体负荷与存活之间的相互作用。太平洋。症状。生物计算。24148-159（2019）。广告

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).Article

Aganezov，S。等人。完整的参考基因组改进了对人类遗传变异的分析。科学376，eabl3533（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).Article

Behera，S.等人，《FixItFelix：通过修复参考错误来改进基因组分析》。基因组生物学。24，31（2023）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv 023754. https://doi.org/10.1101/023754 (2015).Dunn, T. & Narayanasamy, S. vcfdist: accurately benchmarking phased small variant calls in human genomes.

Cleary，J.G.等人。比较变体调用文件以进行下一代测序变体调用管道的性能基准测试。bioRxiv 023754。https://doi.org/10.1101/023754（2015年）。Dunn，T。＆Narayanasamy，S。vcfdist：准确地对人类基因组中的阶段性小变异调用进行基准测试。

Nat. Commun. 14, 8149 (2023).Article .

Nat.普通。148149（2023）。文章。

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput Sci. 2, 797–803 (2022).Article

Zheng，Z.等人。基于深度学习的长读变体调用的交响堆积和完全对齐。自然计算机科学。2797-803（2022）。文章

PubMed

Google Scholar

谷歌学者

English, A. C. et al. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02225-z (2024).Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).Article .

。美国国家生物技术公司。https://doi.org/10.1038/s41587-024-02225-z（2024年）。Jarvis，E.D.等人。高质量二倍体人类参考基因组的半自动组装。自然611519-531（2022）。文章。

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Smolka, M., Rescheneder, P., Schatz, M. C., von Haeseler, A. & Sedlazeck, F. J. Teaser: Individualized benchmarking and optimization of read mapping results for NGS data. Genome Biol. 16, 235 (2015).Article

Smolka，M.，Rescheneder，P.，Schatz，M.C.，von Haeseler，A。＆Sedlazeck，F.J。Triser：NGS数据读取映射结果的个性化基准测试和优化。基因组生物学。16235（2015）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).Article

Chen，N.-C.，Solomon，B.，Mun，T.，Iyer，S。＆Langmead，B。参考流：使用多个群体基因组减少参考偏差。基因组生物学。22，8（2021）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).Article

Vollger，M.R.等人。完整人类基因组中的片段重复及其变异。科学376，eabj6965（2022）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Majidian, S., Kahaei, M. H. & de Ridder, D. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinforma. 21, 253 (2020).Article

Majidian，S.，Kahaei，M.H。＆de Ridder，D。Hap10：使用链接读取重建准确且长的多倍体单倍型。BMC生物信息学。21253（2020）。文章

Google Scholar

谷歌学者

Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).Article

Chin，C.-S.等人。用单分子实时测序进行二倍体基因组分阶段组装。自然方法131050-1054（2016）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).Article

Sedlazeck，F.J.，Lee，H.，Darby，C.A。＆Schatz，M.C。穿透暗物质：远程测序和作图的生物信息学。Genet自然Rev。。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Dwarshuis, N. et al. StratoMod: Predicting sequencing and variant calling errors with interpretable machine learning. Comm. Bio. 7, 1613 (2024).Wagner, J. et al. Small variant benchmark from a complete assembly of X and Y chromosomes. Nat. Commun. in press. bioRxiv 2023.10.31.564997.

Dwarshuis，N。等人。StratoMod：通过可解释的机器学习预测测序和变异调用错误。Comm.Bio.71613（2024）。Wagner，J.等人。来自X和Y染色体完整组装的小变异基准。国家公社。正在印刷中。bioRxiv 2023.10.31.564997。

https://doi.org/10.1101/2023.10.31.564997 (2023).Pedersen, B. S. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. NPJ Genom. Med 6, 60 (2021).Article .

https://doi.org/10.1101/2023.10.31.564997（2023年）。Pedersen，B.S.等人。罕见人类疾病研究中的有效变异过滤和预期候选变异产量。NPJ基因组。医学杂志6，60（2021）。文章。

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Majidian, S. & Sedlazeck, F. J. PhaseME: Automatic rapid assessment of phasing quality and phasing improvement. Gigascience 9, giaa078 (2020).Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).Article .

Majidian，S。和Sedlazeck，F。J。PhaseME：自动快速评估相位质量和相位改进。Gigascience 9，giaa078（2020）。Gurevich，A.，Saveliev，V.，Vyahhi，N。＆Tesler，G。QUAST：基因组组装的质量评估工具。生物信息学291072-1075（2013）。文章。

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).Article

Rhie，A.，Walenz，B.P.，Koren，S。＆Phillippy，A.M.Merqury：基因组装配的无参考质量，完整性和定相评估。基因组生物学。21245（2020）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).Article

Benjamini，Y。＆Speed，T.P。总结和纠正高通量测序中的GC含量偏差。核酸Res.40，e72（2012）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Cheung, M.-S., Down, T. A., Latorre, I. & Ahringer, J. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 39, e103 (2011).Article

Cheung，M.-S.，Down，T.A.，Latorre，I。＆Ahringer，J。高通量测序数据中的系统偏差及其珠子校正。核酸研究39，e103（2011）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Yip, K. Y., Cheng, C. & Gerstein, M. Machine learning and genome annotation: a match meant to be? Genome Biol. 14, 205 (2013).Article

？基因组生物学。14205（2013）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659 (2019).Article

Fotsing，S.F.等人。短串联重复序列变异对基因表达的影响。纳特·吉内特。511652-1659（2019）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Turner, S. et al. Quality control procedures for genome-wide association studies. Curr. Protoc. Hum. Genet. Chapter 1, Unit1.19 (2011).

。货币。普罗托克。嗯，Genet。第1章，Unit1.19（2011）。

Google Scholar

谷歌学者

Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).Article

Rautiainen，M。等人。用Verkko组装二倍体染色体的端粒到端粒。美国国家生物技术公司。411474-1482（2023）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).Article

。基因组生物学。21129（2020）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012).Article

Derrien，T。等人。基因组可映射性的快速计算和应用。PLoS One 7，e30377（2012）。文章

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).Article

Li，H.等人。用于准确变异调用评估的合成二倍体基准。自然方法15595-597（2018）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Baid, G. et al. An Extensive sequence dataset of gold-standard samples for benchmarking and development. bioRxiv 2020.12.11.422022. https://doi.org/10.1101/2020.12.11.422022 (2020).Download referencesAcknowledgementsWe thank Sierra Miller and Katherine Gettings for their feedback. Certain commercial equipment, instruments, or materials are identified to specify adequately experimental conditions or reported results.

。bioRxiv 2020.12.11.422022。https://doi.org/10.1101/2020.12.11.422022（2020年）。下载参考文献致谢我们感谢Sierra Miller和Katherine Gettings的反馈。确定了某些商业设备、仪器或材料，以充分规定实验条件或报告的结果。

Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments, or materials identified are necessarily the best available for the purpose.Author informationAuthor notesThese authors contributed equally: Sina Majidian, Justin M.

这种识别并不意味着国家标准与技术研究所的推荐或认可，也不意味着所识别的设备、仪器或材料一定是用于该目的的最佳可用材料。作者信息作者注意到这些作者做出了同样的贡献：Sina Majidian，Justin M。

Zook.Authors and AffiliationsMaterial Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD., USANathan Dwarshuis, Jennifer McDaniel, Nathan D. Olson, Justin Wagner & Justin M. ZookHuman Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USADivya Kalra & Fritz J.

佐克。作者和附属机构马里兰州盖瑟斯堡国家标准与技术研究所材料测量实验室，USANathan Dwarshuis，詹妮弗·麦克丹尼尔，Nathan D.奥尔森，贾斯汀·瓦格纳和贾斯汀·M·佐科曼基因组测序中心，贝勒医学院，休斯顿，德克萨斯州，USADivya Kalra和Fritz J。

SedlazeckUniversity of Applied Sciences Upper Austria - FH Hagenberg, Hagenberg im Mühlkreis, AustriaPhilippe SanioCenter for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, 20892, USAPilar Alvarez JerezDepartment of Neurodegenerative Disease, UCL Queen Square Institute of Neurology, University College London, London, UKPilar Alvarez JerezDepartment of Genetics and Genomic Sciences and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount, Hess Center for Science and Medicine, New York, NY, USABharati JadhavDepartment of Computer Science, College of E.

。

PubMed Google ScholarDivya KalraView author publicationsYou can also search for this author in

PubMed Google ScholarDivya Kallaview作者出版物您也可以在

PubMed Google ScholarJennifer McDanielView author publicationsYou can also search for this author in

PubMed谷歌学者Jennifer McDanielView作者出版物您也可以在

PubMed Google ScholarPhilippe SanioView author publicationsYou can also search for this author in

PubMed Google ScholarPilar Alvarez JerezView author publicationsYou can also search for this author in

PubMed Google ScholarPilar Alvarez JerezView作者出版物您也可以在

PubMed Google ScholarBharati JadhavView author publicationsYou can also search for this author in

PubMed Google ScholarBharati JadhavView作者出版物您也可以在

PubMed Google ScholarWenyu (Eddy) HuangView author publicationsYou can also search for this author in

PubMed Google ScholarWenyu（Eddy）HuangView作者出版物您也可以在

PubMed Google ScholarRajarshi MondalView author publicationsYou can also search for this author in

PubMed Google Scholarajarshi MondalView作者出版物您也可以在

PubMed Google ScholarBen BusbyView author publicationsYou can also search for this author in

PubMed Google ScholarBen BusbyView作者出版物您也可以在

PubMed Google ScholarNathan D. OlsonView author publicationsYou can also search for this author in

PubMed Google ScholarNathan D.OlsonView作者出版物您也可以在

PubMed Google ScholarFritz J. SedlazeckView author publicationsYou can also search for this author in

PubMed Google ScholarFritz J.SedlazeckView作者出版物您也可以在

PubMed Google ScholarJustin WagnerView author publicationsYou can also search for this author in

PubMed Google ScholarJustin WagnerView作者出版物您也可以在

PubMed Google ScholarSina MajidianView author publicationsYou can also search for this author in

PubMed Google ScholarSina MajidianView作者出版物您也可以在

PubMed Google ScholarJustin M. ZookView author publicationsYou can also search for this author in

PubMed Google ScholarJustin M.ZookView作者出版物您也可以在

PubMed Google ScholarContributionsN.D., F.J.S, J.W., S.M., and J.M.Z designed the study. N.D. implemented the pipeline. N.D., D.K., J.M., N.D.O, P.S, P.A.J, B.J., E.H., R.M. and S.M. performed the analyses. N.D., B.B., F.J.S, S.M., and J.M.Z organized the study. All authors reviewed and approved the manuscript.Corresponding authorsCorrespondence to.

PubMed谷歌学术贡献。D、，F.J.S，J.W.，S.M。和J.M.Z设计了这项研究。N、 D.实施管道。N、 D.，D.K.，J.M.，N.D.O，P.S，P.A.J，B.J.，E.H.，R.M.和S.M.进行了分析。N、。所有作者都审查并批准了手稿。通讯作者通讯。

Sina Majidian or Justin M. Zook.Ethics declarations

Sina Majidian或Justin M.Zook。道德宣言

Competing interests

相互竞争的利益

F.J.S. receives research support from Genetech, Illumina, ONT and Pacbio. B.B. is a full-time employee of DNAnexus. The remaining authors declare no competing interests

F、 J.S.获得Genetech、Illumina、ONT和Pacbio的研究支持。B.B.是DNAnexus的全职员工。其余作者声明没有利益冲突

Peer review

同行评审

Peer review information

同行评审信息

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Nature Communications感谢匿名审稿人对这项工作的同行评审做出的贡献。可以获得同行评审文件。

Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationSupplementary InformationReporting SummaryTransparent Peer Review fileRights and permissions

Additional informationPublisher的注释Springer Nature在已发布地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息报告摘要透明的同行评审文件权限

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

开放获取本文是根据知识共享署名4.0国际许可证授权的，该许可证允许以任何媒体或格式使用，共享，改编，分发和复制，只要您对原始作者和来源给予适当的信任，提供知识共享许可证的链接，并指出是否进行了更改。

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/..

要查看此许可证的副本，请访问http://creativecommons.org/licenses/by/4.0/..

Reprints and permissionsAbout this articleCite this articleDwarshuis, N., Kalra, D., McDaniel, J. et al. The GIAB genomic stratifications resource for human reference genomes.

转载和许可本文引用本文Drawhuis，N.，Kalra，D.，McDaniel，J。等人，人类参考基因组的GIAB基因组分层资源。

Nat Commun 15, 9029 (2024). https://doi.org/10.1038/s41467-024-53260-yDownload citationReceived: 07 November 2023Accepted: 07 October 2024Published: 19 October 2024DOI: https://doi.org/10.1038/s41467-024-53260-yShare this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.

《国家公社》159029（2024）。https://doi.org/10.1038/s41467-024-53260-yDownload引文收到日期：2023年11月7日接受日期：2024年10月7日发布日期：2024年10月19日OI：https://doi.org/10.1038/s41467-024-53260-yShare本文与您共享以下链接的任何人都可以阅读此内容：获取可共享链接对不起，本文目前没有可共享的链接。复制到剪贴板。

Provided by the Springer Nature SharedIt content-sharing initiative

由Springer Nature SharedIt内容共享计划提供

全球产业链接平台

重庆市渝北区金星科技大厦A区5楼512室

联系电话：023-67139735（重庆）

关于我们

产品服务