使用FastOMA进行大规模正字法推理-动脉网

Orthology inference at scale with FastOMA

Nature 等信源发布 2025-01-03 19:43



可切换为仅中文







Abstract

摘要

The surge in genome data, with ongoing efforts aiming to sequence 1.5 M eukaryotes in a decade, could revolutionize genomics, revealing the origins, evolution and genetic innovations of biological processes. Yet, traditional genomics methods scale poorly with such large datasets. Here, addressing this, ‘FastOMA’ provides linear scalability for orthology inference, enabling the processing of thousands of eukaryotic genomes within a day.

基因组数据的激增，以及旨在在十年内对150万个真核生物进行测序的持续努力，可能会彻底改变基因组学，揭示生物过程的起源，进化和遗传创新。然而，传统的基因组学方法在如此大的数据集上规模很小。在这里，解决这个问题，“FastOMA”为直系同源推断提供了线性可扩展性，可以在一天内处理数千个真核基因组。

FastOMA maintains the high accuracy and resolution of the well-established Orthologous Matrix (OMA) approach in benchmarks. FastOMA is available via GitHub at .

FastOMA在基准测试中保持了公认的直系同源矩阵（OMA）方法的高精度和高分辨率。FastOMA可通过GitHub访问。

https://github.com/DessimozLab/FastOMA/

Main

主要

Within the decade, the Earth BioGenome initiative aims to sequence 1.5 M eukaryotes

在十年内，地球生物基因组计划旨在对150万个真核生物进行测序

. This paves the way for understanding how all species evolved from life’s common origin. Yet, due to processing limitations, even the thousands of genomes we have access to today are studied only piecemeal in practice. A fundamental step to comparative genomics analyses is to identify orthologs, genes of common ancestry that originated by speciation events.

。然而，由于处理限制，即使我们今天可以访问的数千个基因组在实践中也只是零碎的研究。比较基因组学分析的一个基本步骤是鉴定直系同源物，即由物种形成事件起源的共同祖先基因。

. When performed systematically, orthology delineation conveys how sequences were gained, lost or duplicated, assuming that their basic mode of inheritance is vertical descent. Deriving orthology enables many types of downstream analysis, such as annotation propagation, phylogenomics or phylogenetic profiling.

当系统地进行时，假设序列的基本遗传模式是垂直下降，则正畸描绘会传达序列是如何获得，丢失或复制的。推导正畸学可以进行多种类型的下游分析，例如注释传播，系统发育基因组学或系统发育分析。

State-of-the-art orthology methods face acute scalability issues

最先进的矫形学方法面临着严重的可扩展性问题

. Methods relying on all-against-all sequence comparisons can no longer keep up with today’s data, let alone tomorrow’s. For state-of-the-art pipelines such as our own Orthologous MAtrix (OMA) algorithm and database

。用于最先进的管道，例如我们自己的直系同源矩阵（OMA）算法和数据库

, this translates to >10 million central processing unit (CPU) hours to derive the orthology relationships of >2000 genomes that have been processed thus far. Methods relying on whole-genome alignment, such as TOGA (Tool to infer Orthologs from Genome Alignments)

，这意味着超过1000万个中央处理器（CPU）小时，以得出迄今为止已处理的>2000个基因组的直系同源关系。依赖全基因组比对的方法，例如TOGA（从基因组比对推断直系同源物的工具）

, are more efficient, but the genome alignment requirement limits their applicability to relatively closely related species. While small-scale comparative genomics has achieved remarkable progress, a more integrated, large-scale approach would be transformative.

，效率更高，但基因组比对要求限制了它们对相对密切相关物种的适用性。虽然小规模比较基因组学取得了显着进展，但一种更综合，大规模的方法将具有变革性。

To address this challenge, we introduce FastOMA, which dramatically speeds up orthology inference without sacrificing accuracy or resolution.

为了应对这一挑战，我们引入了FastOMA，它可以在不牺牲准确性或分辨率的情况下大大加快矫形推理的速度。

FastOMA is a complete rewrite of the OMA algorithm focused on scalability from the ground up (Fig.

FastOMA是对OMA算法的完全重写，该算法从一开始就关注可扩展性（图）。

). By combining ultrafast homology clustering using

)。通过使用结合超快同源聚类

-mers, taxonomy-guided subsampling and a highly efficient parallel computing approach, it achieves linear performance in the number of input genomes. First, we leverage our current knowledge of the sequence universe (with its evolutionary information stored in the OMA database) to efficiently place new sequences into coarse-grained families (hierarchical orthologous groups (HOGs) at the root level) using the alignment-free .

-mers，分类学指导的子采样和高效的并行计算方法，它在输入基因组数量上实现了线性性能。首先，我们利用我们目前对序列宇宙的了解（其进化信息存储在OMA数据库中），使用无比对有效地将新序列放入粗粒度家族（根级别的层次直系同源群（HOG））。

-mer-based OMAmer tool

-基于mer的OMAmer工具

. In an attempt to detect homology among unplaced sequences (which could belong to families that are absent from our reference database), we then perform a round of clustering using the highly scalable Linclust software

为了检测未定位序列（可能属于我们参考数据库中不存在的家族）之间的同源性，我们然后使用高度可扩展的Linclust软件进行了一轮聚类

. Next, we resolve the nested structure of the HOGs (Supplementary Information

接下来，我们解析HOG的嵌套结构（补充信息

) corresponding to each ancestor, in an efficient leaf-to-root traversal of the species tree. By avoiding sequence comparisons across different families, the number of computations is drastically reduced compared with conventional approaches (see

)对应于每个祖先，在物种树的有效叶到根遍历中。通过避免不同家族之间的序列比较，与传统方法相比，计算数量大大减少（请参见

Methods

方法

for details).

有关详细信息）。

Fig. 1: FastOMA algorithm overview.

图1:FastOMA算法概述。

Input proteomes are mapped to reference gene families using the OMAmer software, forming hierarchical orthologous groups (HOGs) at the root level (rootHOGs), see

使用OMAmer软件将输入的蛋白质组映射到参考基因家族，在根水平（rootHOGs）上形成层次直系同源群（HOGs），请参见

Methods

方法

. HOGs are inferred using a ‘bottom-up’ approach, starting from the leaves of the species tree and moving towards the root. At each taxonomic level, HOGs from the child level are merged, resulting in HOGs at the current level. To decide which HOGs should be merged, sequences from the child HOGs are used to create a MSA.

猪是使用“自下而上”的方法推断出来的，从物种树的叶子开始，向根部移动。在每个分类级别，子级别的HOG被合并，从而产生当前级别的HOG。为了决定应该合并哪些HOG，使用子HOG的序列来创建MSA。

, followed by gene tree inference

，然后是基因树推断

to identify speciation and duplication events

识别物种形成和复制事件

. Child HOGs are merged if their genes evolved through speciation (see

.如果幼猪的基因是通过物种形成进化而来的，那么它们就会合并（请参见

Methods

方法

and Supplementary Information

和补充信息

for details). Credit: human silhouette, T. Michael Keesey (

有关详细信息）。图片来源：人类剪影，T.Michael Keesey(

Public Domain Mark 1.0

公共领域标记1.0

); chimpanzee silhouette, Jonathan Lawley (

)；黑猩猩剪影，乔纳森·劳利(

CC0 1.0 Universal

CC0 1.0通用

); mouse silhouette, Soledad Miranda-Rottman (

)(

CC BY 3.0

CC比3.0

), PhyloPic.

），PhyloPic。

Full size image

全尺寸图像

FastOMA has high scalability without sacrificing accuracy in a diverse range of benchmarks. We assessed the accuracy of FastOMA on the Quest for Orthologs (QfO) suite of benchmarks

FastOMA具有很高的可扩展性，在各种基准测试中都不牺牲准确性。我们评估了FastOMA在寻找直系同源物（QfO）基准套件中的准确性

. FastOMA retains OMA’s high precision accuracy and even improves upon it in terms of recall, positioning it on the Pareto frontier of orthology inference methods. For instance, on the SwissTree reference gene phylogeny benchmark, FastOMA outperforms other methods with a precision of 0.955 in reference gene phylogenies (Fig.

FastOMA保留了OMA的高精度准确性，甚至在召回率方面对其进行了改进，使其处于正畸推理方法的帕累托前沿。例如，在SwissTree参考基因系统发育基准上，FastOMA优于其他方法，参考基因系统发育的精度为0.955（图）。

2a级

). With a recall in line with most state-of-the-art methods (0.69, lower than those of Panther and OrthoFinder), the balance of these metrics indicates a well-tuned approach to orthology inference, with a focus on minimizing false positives. Likewise, on the generalized species tree benchmark at the Eukaryota level, FastOMA is among those with the lowest topological error, with a normalized Robinson–Foulds distance—the number of different edges between two trees normalized by the total number of internal edges—of 0.225 to the reference tree, at moderate recall (Fig.

)。。同样，在真核生物水平的广义物种树基准上，FastOMA是拓扑误差最低的物种之一，归一化的Robinson-Foulds距离是两棵树之间不同边缘的数量，归一化为参考树的内部边缘总数为0.225，召回率适中（图）。

2b级

and Supplementary Information

和补充信息

–

Fig. 2: FastOMA is not only fast but also accurate.

图2:FastOMA不仅快速而且准确。

一

, QfO benchmark

，QfO基准

, agreement with SwissTree reference phylogeny covering 19 manually curated gene trees. The error bars indicate 95% confidence intervals comparing FastOMA with EnsemblCompara

，与涵盖19个手动策划的基因树的SwissTree参考系统发育一致。误差线表示将FastOMA与EnsemblCompara进行比较的95%置信区间

, Domainoid

Domainoid

, OrthoMCL

OrthoMCL

, Ortholnspector

，检验员

, sonicparanoid, PANTHER

，sonicparanoid，黑豹

, OrthoFinder, Hieranoid

Hieranoid OrthoFinder

and the OMA family including OMA pairs, OMA groups and OMA GETHOGs (graph-based efficient technique for HOGs)

以及OMA家族，包括OMA对，OMA组和OMA GETHOGs（基于图形的HOGs高效技术）

b类

, QfO benchmarking of the generalized species discordance test on the Eukaryota clade, where the gene tree inferred from orthologous genes is compared with the reference species tree considering up to 3,000 gene trees per method (see Supplementary Information

，QfO对真核生物进化枝的广义物种不一致性测试进行了基准测试，其中将从直系同源基因推断出的基因树与参考物种树进行了比较，每种方法最多考虑3000个基因树（请参阅补充信息）

2.1

for details).

有关详细信息）。

c级

, A computation time comparison of FastOMA and state-of-the-art alternatives.

，FastOMA和最先进替代方案的计算时间比较。

, The impact of species tree resolution on the complexity of the gene family evolutionary scenario (proxied by the number of gene losses over the gene family history). Each point represents a gene family (a rootHOG), whereby the size of a gene family corresponds to the number of genes in it

，种树分辨率对基因家族进化情景复杂性的影响（由基因家族史中基因丢失的数量表示）。每个点代表一个基因家族（根猪），基因家族的大小对应于其中的基因数量

(the figure is truncated to focus on the most relevant region; see Supplementary Fig.

（该图被截断以关注最相关的区域；参见Supplementary Fig.）。

for a version with all data, and see

有关包含所有数据的版本，请参见

Methods

方法

for the implied losses calculation).

用于隐含损失计算）。

Full size image

全尺寸图像

A key achievement of FastOMA is its linear scaling behavior (Fig.

FastOMA的一个关键成就是它的线性缩放行为（图）。

2摄氏度

), which opens up the possibility of processing extensive datasets rapidly. FastOMA inferred orthology among all 2,086 eukaryotic UniProt reference proteomes in under 24 h, using 300 CPU cores. In the same timespan, the original OMA algorithm could process only 50 genomes. Even methods optimized for speed such as OrthoFinder.

)。FastOMA使用300个CPU核心在24小时内推断出所有2086个真核UniProt参考蛋白质组中的直系同源性。在相同的时间跨度内，原始的OMA算法只能处理50个基因组。甚至是针对速度优化的方法，例如OrthoFinder。

or SonicParanoid

或SonicParanoid

still exhibit quadratic time complexity (Fig.

仍然表现出二次时间复杂性（图）。

2摄氏度

). Thus, FastOMA’s linear scalability breaks new ground.

)。因此，FastOMA的线性可扩展性开辟了新天地。

The initial sequence placement step using OMAmer helps FastOMA achieve its speed, but the subsequent alignment and tree inference steps are critical for its accuracy. Indeed, sequence placement alone is considerably less accurate than state-of-the-art methods in benchmarks (Supplementary Information .

使用OMAmer的初始序列放置步骤有助于FastOMA实现其速度，但随后的比对和树推断步骤对其准确性至关重要。事实上，单独的序列放置比基准中最先进的方法准确得多（补充信息）。

FastOMA exploits known taxonomic relationships to reduce the number of sequence comparisons. By default, it relies on the commonly used National Center for Biotechnology Information (NCBI) taxonomy

FastOMA利用已知的分类学关系来减少序列比较的数量。

, but users can specify any reference species phylogeny as input. To assess the impact of the resolution of the input tree on orthology accuracy, we compared FastOMA’s performance on UniProt reference proteomes with a more resolved species tree derived from the TimeTree resource

，但用户可以指定任何参考物种的系统发育作为输入。为了评估输入树的分辨率对正畸学准确性的影响，我们将FastOMA在UniProt参考蛋白质组上的性能与来自时间树资源的更分辨率的物种树进行了比较

. Compared with the NCBI taxonomy, this resulted in improved ortholog predictions, with more parsimonious gene family evolution history, lowering the number of implied gene losses across all gene families (Fig.

与NCBI分类法相比，这改进了直系同源预测，具有更简约的基因家族进化史，降低了所有基因家族中隐含的基因丢失数量（图）。

二维

). FastOMA is also robust to errors artificially introduced in the species taxonomy (Supplementary Figs.

)。FastOMA对物种分类学中人为引入的错误也很稳健（补充图）。

–

). FastOMA can thus use advances in taxonomic knowledge for better orthology predictions and will benefit from the higher resolution that is brought by new genomic sequences from large-scale sequencing projects.

)。因此，FastOMA可以利用分类学知识的进步进行更好的直系同源预测，并将受益于大规模测序项目的新基因组序列带来的更高分辨率。

FastOMA contains additional features that make it easier to deal with complex and noisy genomic data. It is designed to handle multiple isoforms for the genes resulting from alternative splicing and select the most evolutionarily conserved ones, and can also deal with fragmented gene models

FastOMA包含其他功能，可以更轻松地处理复杂且嘈杂的基因组数据。它旨在处理选择性剪接产生的基因的多种同工型，并选择进化上最保守的同工型，还可以处理片段化的基因模型

. Both features lead to noticeable improvements in FastOMA inferences (Supplementary Information

。这两个功能都可以显着改进FastOMA推论（补充信息

and

和

). As it uses the same data structure as OMA, FastOMA benefits from its rich ecosystem of downstream applications, including phylogenetic profiling, efficient gene family visualization, ancestral synteny inference and advanced phylostratigraphy, enabling researchers to trace gene family histories and understand gene emergence, duplication and loss events.

)。由于它使用与OMA相同的数据结构，FastOMA受益于其丰富的下游应用生态系统，包括系统发育分析，有效的基因家族可视化，祖先同线性推断和先进的系统地层学，使研究人员能够追踪基因家族史并了解基因出现，重复和丢失事件。

In conclusion, the FastOMA algorithm offers a unique solution for accurate orthology inference, making it possible to study evolutionary history at the scale of massive genomics projects. Future work will aim to further refine orthology inference by integrating structural protein data to improve resolution at deeper evolutionary levels, as well as gene order conservation as an additional layer of information..

总之，FastOMA算法为准确的正畸推断提供了独特的解决方案，使研究大规模基因组学项目规模的进化历史成为可能。未来的工作将旨在通过整合结构蛋白数据来进一步完善正畸推断，以提高更深层次进化水平的分辨率，并将基因顺序保守性作为额外的信息层。。

Methods

方法

FastOMA algorithm outline

FastOMA算法概述

FastOMA is a method for inferring orthology relationships. The input to FastOMA includes the proteome sets of species and the species tree. The FastOMA algorithm consists of two main steps: finding rootHOGs and inferring the nested structure of HOGs (Fig.

FastOMA是一种推断矫形关系的方法。FastOMA的输入包括物种的蛋白质组集和物种树。FastOMA算法包括两个主要步骤：找到根猪和推断猪的嵌套结构（图）。

Step 1: FastOMA gene family inference

The FastOMA algorithm infers gene families from the provided proteomes. The process begins by mapping the input proteomes onto the reference HOGs (Supplementary Information

FastOMA算法从提供的蛋白质组推断基因家族。

) using the OMAmer tool (Fig.

)使用OMAmer工具（图）。

). Proteins mapped to the same reference HOG are then grouped together, forming query rootHOGs, with the exclusion of proteins already present in the database. Thus, proteins in the database reference HOGs are not used in the next steps in FastOMA.

)。然后将映射到同一参考HOG的蛋白质分组在一起，形成查询根HOG，排除数据库中已经存在的蛋白质。因此，在FastOMA的下一步中不使用数据库参考HOG中的蛋白质。

Although each rootHOG ideally represents a single gene family, instances may arise where a gene family of query proteomes is split among multiple rootHOGs. To address this, FastOMA tries to find those query rootHOGs that are associated with the same gene family. FastOMA leverages the ability of OMAmer to report multiple rootHOGs to which the sequences could be mapped, along with their score.

尽管每个根猪理想地代表单个基因家族，但可能会出现查询蛋白质组的基因家族在多个根猪之间分裂的情况。为了解决这个问题，FastOMA试图找到那些与同一基因家族相关的查询根猪。FastOMA利用OMAmer的能力报告序列可以映射到的多个根猪及其得分。

This score (‘family_p’) is the .

这个分数（“family\u p”）是。

value of having as many or more

拥有尽可能多的价值

-mers in common between the protein sequence and the HOG under a binomial distribution, reported in negative natural logarithm. Considering a minimum threshold of 70 (by default), we construct a graph of rootHOGs, where each node represents a query rootHOG. In such a graph, we add an edge between two nodes (rootHOGs) when a minimum of ten proteins (by default) are mapped to both query rootHOGs and it represents at least either 80% of all proteins mapping to the bigger rootHOG or 90% of those mapping to the smallest one.

-在二项分布下，蛋白质序列和HOG之间的共同mers以负自然对数报告。考虑到最小阈值为70（默认情况下），我们构建了一个rootHOG图，其中每个节点代表一个查询rootHOG。在这样的图中，当至少十个蛋白质（默认情况下）映射到两个查询根猪时，我们在两个节点（根猪）之间添加一条边，它代表映射到较大根猪的所有蛋白质的至少80%或映射到最小根猪的蛋白质的90%。

This ensures a high overlap of protein content of the merged rootHOG. Finally, we group the members of all HOGs in each highly connected component of this graph in a single query rootHOG..

这确保了合并的根猪的蛋白质含量的高度重叠。最后，我们将此图的每个高度连接组件中所有HOG的成员分组到一个查询rootHOG中。。

It is worth noting that some proteins may not be assigned to any reference HOGs owing to no recognizable homologs in the reference database. In addition, there is a scenario where only one protein is mapped to the rootHOG, referred to as a singleton, representing an individual rather than a group

值得注意的是，由于参考数据库中没有可识别的同源物，一些蛋白质可能不会分配给任何参考猪。此外，有一种情况是，只有一种蛋白质被映射到根猪，称为单体，代表一个个体而不是一个群体

. To ensure those genes are not lost to FastOMA’s orthology inference, these singletons and unmapped sequences are combined into a FASTA file on which we run Linclust, the clustering tool from the MMseqs package

为了确保这些基因不会丢失给FastOMA的orthology推断，这些单例序列和未映射序列被合并到一个FASTA文件中，我们在该文件上运行Linclust，这是MMseqs软件包中的聚类工具

. This yields new query rootHOGs.

。这将产生新的查询根目录。

Critically, assigning proteins to rootHOGs (gene families) allows us to avoid unnecessary all-against-all comparisons of unrelated proteins (those without homology), thanks to the speed of OMAmer and Linclust. All the query rootHOGs are written as FASTA files to be used in the next step and can be handled in parallel..

至关重要的是，由于OMAmer和Linclust的速度，将蛋白质分配给根猪（基因家族）可以避免不相关蛋白质（那些没有同源性的蛋白质）的不必要的all与all比较。所有查询根目录都被编写为FASTA文件，以便在下一步中使用，并且可以并行处理。。

Notably, the OMA team provides regular updates to the OMA database, increasing the number and diversity of species included in the database used by OMAmer. This results in higher resolution for

值得注意的是，OMA团队定期更新OMA数据库，增加了OMAmer使用的数据库中包含的物种的数量和多样性。这将导致更高的分辨率

-mer-based grouping. As more taxa get included, we foresee FastOMA’s inference will improve as more sequences are placed into rootHOGs.

-基于mer的分组。随着越来越多的分类群被纳入，我们预计FastOMA的推断将随着更多序列被放入根猪而得到改善。

Step 2: FastOMA orthology inference

步骤2：FastOMA orthology推断

For every query rootHOG, FastOMA infers the nested structure of the HOG (as depicted in Fig.

对于每个查询rootHOG，FastOMA推断HOG的嵌套结构（如图所示）。

). The objective is to identify the genes that are grouped together at each taxonomic level as a HOG, which means they descended from a single gene at that specific level. Note that the number of HOGs at each level reflects the number of copies of the gene present in the ancestral species.

)。目的是鉴定在每个分类学水平上组合在一起的基因作为HOG，这意味着它们在该特定水平上来自单个基因。请注意，每个级别的猪数量反映了祖先物种中存在的基因拷贝数。

To achieve this, FastOMA follows a bottom-up approach by traversing the species tree. Starting from the leaves of the tree (extant species), each gene in the species’ proteome is treated as a HOG. At each level in the traversal, certain HOGs from the child level are combined. The determination of which HOGs will be merged is guided by a gene tree containing the proteins of species descending from this node.

。从树木（现存物种）的叶子开始，物种蛋白质组中的每个基因都被视为猪。在遍历的每个级别，子级别的某些猪被合并。确定哪些猪将被合并是由包含从该节点下降的物种蛋白质的基因树指导的。

The merging is done for all HOGs that descended from the same common ancestor by a speciation event. The entire process is detailed below:.

合并是通过物种形成事件对所有来自同一共同祖先的猪进行的。整个过程详细如下：。

Gene tree inference

All the proteins in HOGs at the child level are collectively used for generating a multiple sequence alignment (MSA) using the MAFFT package

使用MAFFT软件包，将HOG中儿童水平的所有蛋白质共同用于产生多序列比对（MSA）

. As part of the FastOMA Python script, the MSA undergoes column-wise trimming with a default threshold of 0.2, meaning that we remove columns of the MSA that have more than 80% gap elements (Supplementary Information

。作为FastOMA Python脚本的一部分，MSA会进行按列修剪，默认阈值为0.2，这意味着我们会删除MSA中间隙元素超过80%的列（补充信息

). Aligned sequences (rows in MSA) that exceed a default threshold of >50% gaps are subsequently removed. However, we keep them in the HOG, but they are not used for tree inference. Subsequently, we employ FastTree

)。随后删除超过默认阈值>50%间隙的比对序列（MSA中的行）。但是，我们将它们保存在HOG中，但它们不用于树推理。随后，我们使用FastTree

to infer the gene tree, and this tree is rooted using the midpoint approach.

推断基因树，并使用中点方法将该树植根。

To expedite the orthology inference process at deeper levels of the trees where the number of children is prohibitively high, we implement a subsampling approach, retaining only a specified number of proteins per HOG (Supplementary Figs.

为了在儿童数量过高的更深层次的树上加速正畸推理过程，我们实施了一种二次抽样方法，每头猪只保留特定数量的蛋白质（补充图）。

–

; by default, 20 proteins are randomly selected) used for the MSA and tree inference. The unsampled sequences will have the same fate as the rest of the proteins in the same group at the defined taxonomic level.

；默认情况下，随机选择20种蛋白质）用于MSA和树推断。在定义的分类学水平上，未采样的序列将与同一组中的其余蛋白质具有相同的命运。

Note that the subsampling strategy is key to the speed of FastOMA, and expectedly, there is a trade-off between accuracy and speed. Our benchmarking results indicate that FastOMA performs well with the subsampling approach, but users can change the degree of the subsampling in the parameter file.

请注意，子采样策略是FastOMA速度的关键，预计在准确性和速度之间会有一个权衡。我们的基准测试结果表明，FastOMA使用子采样方法表现良好，但用户可以更改参数文件中的子采样程度。

Duplication and speciation event labeling

复制和物种形成事件标记

Each internal node in the gene tree is classified as either a duplication or a speciation event using the species overlap method

使用物种重叠方法将基因树中的每个内部节点分类为重复或物种形成事件

. For each node in the gene tree, this involves calculating the ratio of the number of shared species between its two subtrees divided by the number of all species (union). If the ratio equals zero, the node is labeled as a speciation event; otherwise, it is labeled as a duplication event. When the species overlap ratio is less than 0.1 (as per default settings), indicating very low support for a duplication event, all leaves from the child subtree with the least number of proteins are excluded from merging decisions (described in ‘HOG merging’ section).

。如果比率等于零，则该节点被标记为物种形成事件；否则，它将被标记为复制事件。当物种重叠率小于0.1（根据默认设置），表明对复制事件的支持率非常低时，来自蛋白质数量最少的子树的所有叶子都被排除在合并决策之外（如“HOG合并”部分所述）。

In other words, these proteins will stay in the corresponding HOGs as in the previous taxonomic level, and only the taxonomic label of the HOG is updated to the current taxonomic level (assuming no other merging happens in another part of the gene tree for this HOG). This is done to ensure that errors in gene annotation or inaccurate tree inference only minimally affect the orthology inference..

换句话说，这些蛋白质将像以前的分类学水平一样保留在相应的HOG中，并且只有HOG的分类学标签被更新到当前的分类学水平（假设该HOG的基因树的另一部分没有发生其他合并）。这样做是为了确保基因注释中的错误或不准确的树推断仅对正畸推断产生最小的影响。。

HOG merging

HOG合并

Starting from the root of the gene tree, evidence of a speciation event (that is, the internal node annotated as a speciation event due to no species overlap) prompts the merging of the HOGs of the leaves descending from the nodes. This is achieved by constructing a HOG graph, where each node represents a HOG.

。这是通过构建HOG图来实现的，其中每个节点代表一个HOG。

An edge is introduced between HOG1 and HOG2 if protein 1 (located in HOG1) and protein 2 (in HOG2) coalesce at a speciation event in the gene tree. Subsequently, each connected component within this graph constitutes a HOG at the current level of the species tree. Furthermore, FastOMA has a mechanism to handle spuriously merged subHOGs; at the deeper taxonomy level, when genes within a subHOG coalesce at a duplication event in the gene tree, FastOMA splits the subHOG into two, ensuring copies of ancestral genes are not co-present in a subHOG..

如果蛋白质1（位于HOG1中）和蛋白质2（位于HOG2中）在基因树的物种形成事件中聚结，则在HOG1和HOG2之间引入边缘。随后，该图中的每个连接组件在种树的当前级别上构成一个HOG。此外，FastOMA有一种机制来处理虚假合并的子日志；在更深层的分类学水平上，当子日志中的基因在基因树中的复制事件中聚结时，FastOMA将子日志分成两个，以确保祖先基因的副本不会同时存在于子日志中。。

Inferring orthology relationship

推断矫形关系

Once the species tree traversal is complete, the nested structure of the query HOG is fully resolved. From the HOG structure inferred this way, all orthology and paralogy relationships can be efficiently deduced.

一旦物种树遍历完成，查询HOG的嵌套结构就会完全解析。从这种方式推断出的HOG结构，可以有效地推断出所有的直系同源和旁系同源关系。

Note on parallelization

关于并行化的注记

Scalability has been a major challenge in the field of orthology inference highlighted by the QfO community for many years

多年来，可扩展性一直是QfO社区强调的矫形推理领域的主要挑战

. FastOMA is optimized to process taxonomic levels in parallel (when possible) by inferring HOGs at all taxonomic levels, accounting for dependencies among child HOGs, that is, a node will be processed after all its child nodes are processed. To optimize parallelization efficiency by avoiding unnecessary overheads of Nextflow and Slurm management workflows, FastOMA groups approximately 150 small- to medium-sized query rootHOGs together, treating them as a single job.

通过在所有分类级别推断HOG，考虑子HOG之间的依赖关系，即在处理其所有子节点后，将处理节点，FastOMA被优化为并行（尽可能）处理分类级别。。

Conversely, large rootHOGs are processed individually (to infer nested structure of HOGs) for optimal performance using Python-future for which taxonomic parallelization is activated. The default rootHOG file size threshold for this purpose is 400,000 bytes, or ~500 proteins (Supplementary Information .

相反，使用Python future对大型roothog进行单独处理（以推断hog的嵌套结构），以获得最佳性能，为此Python future激活了分类并行化。为此目的，默认的rootHOG文件大小阈值是400000字节，或〜500个蛋白质（补充信息）。

FastOMA outputs

FastOMA输出

The main output of FastOMA is an OrthoXML file that stores HOGs and their nested structures, allowing to reconstruct their evolutionary histories. Furthermore, FastOMA reports the protein list in each rootHOG (gene family) in TSV format. A final FastOMA output is a list of proteins in strict orthologous groups, wherein all genes within the group are orthologous to each other, which can be used as marker genes for phylogenetic analyses.

FastOMA的主要输出是一个OrthoXML文件，用于存储猪及其嵌套结构，从而可以重建它们的进化历史。此外，FastOMA以TSV格式报告每个根猪（基因家族）中的蛋白质列表。最终的FastOMA输出是严格直系同源组中的蛋白质列表，其中该组中的所有基因彼此直系同源，可用作系统发育分析的标记基因。

. Besides, the user can store the gene trees and MSAs of the subsampled HOGs for all taxonomic levels.

此外，用户可以存储所有分类级别的二次采样猪的基因树和MSA。

Isoform selection

异构体选择

FastOMA is capable of handling proteomes that feature multiple protein isoforms for a gene due to alternative splicing. Users can provide an isoform file where each row lists comma-separated protein IDs associated with a gene. FastOMA selects the isoform with the highest ‘family_p’ score, the one with the best fit to known proteins in the reference rootHOG based on .

FastOMA能够处理由于选择性剪接而具有基因多种蛋白质同工型的蛋白质组。用户可以提供一个异构体文件，其中每行列出与基因相关的逗号分隔的蛋白质ID。FastOMA选择“family\u p”得分最高的同工型，即最适合参考根猪中已知蛋白质的同工型。

-mer content. For the evaluation of isoform selection, we used the UniProt reference proteomes and their splice information (

-。为了评估亚型选择，我们使用了UniProt参考蛋白质组及其剪接信息(

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota

FastOMA software

FastOMA软件

The FastOMA codebase is composed of multiple subpackages written in Python. FastOMA benefits from the Nextflow workflow to parallelize different steps and subpackages considering the dependencies modeled as a direct acyclic graph (Supplementary Information

FastOMA代码库由用Python编写的多个子包组成。FastOMA受益于Nextflow工作流，考虑到建模为直接非循环图的依赖关系，可以并行化不同的步骤和子包（补充信息

). The software is publicly available on GitHub (

)。该软件可在GitHub上公开获得(

https://github.com/DessimozLab/FastOMA

) and on DockerHub (

)在DockerHub上(

https://hub.docker.com/r/dessimozlab/fastoma

Time comparison on eukaryotic dataset

真核生物数据集的时间比较

We considered all 2,181 eukaryotic UniProt reference proteomes (accessed on 25 January 2023) and filtered them to keep those with a minimum BUSCO (benchmarking universal single-copy orthologs) completeness of 50%, resulting in 2,086 proteomes in total. We ran SonicParanoid, OrthoFinder and FastOMA on datasets with different sizes ranging from 10 to 2,086 species.

我们考虑了所有2181个真核生物UniProt参考蛋白质组（于2023年1月25日访问），并对其进行了过滤，以保持BUSCO（基准通用单拷贝直系同源物）完整性最低为50%的蛋白质组，总共产生2086个蛋白质组。我们在10到2086种不同大小的数据集上运行了SonicParanoid，OrthoFinder和FastOMA。

OrthoFinder 2.5.4 was run in two steps. First, to generate all-against-all sequence comparisons, we used the -op parameter to generate and execute command lines for Diamond. Then, the rest of OrthoFinder was conducted. SonicParanoid 2.0.4 was used with default parameters using 48 CPUs with a limit of 3 days wall clock.

OrthoFinder 2.5.4分两步运行。首先，为了生成所有序列与所有序列的比较，我们使用了-op参数来生成和执行Diamond的命令行。然后，进行了OrthoFinder的其余部分。SonicParanoid 2.0.4与默认参数一起使用，使用48个CPU，限制为3天的挂钟。

It is neither possible to parallelize SonicParanoid2 on different computation nodes nor to feed it with the result of Diamond; hence, we could not obtain compute time for the larger datasets during the mentioned time limit. For FastOMA, the NCBI tree was used by downloading via the ETE3 package.

既不可能在不同的计算节点上并行化SonicParanoid2，也不可能将Diamond的结果提供给它；因此，在上述时间限制内，我们无法获得较大数据集的计算时间。对于FastOMA，通过ETE3包下载使用NCBI树。

. The comparison of tools in terms of wall-clock time in hours is reported in Supplementary Fig.

补充图中报告了以小时为单位的挂钟时间工具的比较。

. The Diamond part of OrthoFinder and all steps of FastOMA use different nodes on the cluster, so the reported wall-clock time might have been affected by the availability of nodes at the time of each run. However, the CPU times reported in Fig.

。OrthoFinder的菱形部分和FastOMA的所有步骤都使用集群上的不同节点，因此报告的挂钟时间可能会受到每次运行时节点可用性的影响。然而，Fig.报告的CPU时间。

2摄氏度

are more accurate.

更准确。

Analysis on tree resolution

树分辨率分析

We ran FastOMA on both the TimeTree and the NCBI tree. For the TimeTree analysis, we uploaded the list of species names to the TimeTree webserver

我们在时间树和NCBI树上运行了FastOMA。为了进行时间树分析，我们将物种名称列表上传到了时间树网络服务器

(

https://timetree.org

). This resulted in a species tree with 1,757 leaves since some of the species were not available in TimeTree. We ran FastOMA with default parameters on the dataset of 1,757 proteomes and with both the TimeTree tree and NCBI tree as the species tree. We used pyHAM

)。这导致了一种树有1757片叶子，因为其中一些物种在TimeTree中不可用。我们在1757个蛋白质组的数据集上使用默认参数运行FastOMA，并将时间树和NCBI树作为物种树。我们用的是pyHAM

for calculating the implied gene losses.

用于计算隐含的基因损失。

To calculate the estimated proportion of proteomes composed of fragments, we ran OMArk

为了计算由片段组成的蛋白质组的估计比例，我们运行了OMArk

v0.3 on all proteomes. We used the BUSCO statistics downloaded from the UniProt website for the full eukaryotic dataset.

所有蛋白质组上的v0.3。我们使用从UniProt网站下载的BUSCO统计数据来获取完整的真核数据集。

We also conducted another analysis to study the impact of the species tree for the QfO dataset where five pairs of species are swapped. The results are provided in Supplementary Information

我们还进行了另一项分析，以研究物种树对QfO数据集的影响，其中五对物种被交换。结果在补充信息中提供

and Supplementary Figs.

和补充图。

–

, where FastOMA shows a moderate level of robustness. However, having an erroneous species tree impacted the orthology inference by introducing false positives.

，其中FastOMA表现出中等程度的鲁棒性。然而，错误的物种树通过引入假阳性影响了正畸推断。

To conclude, we highlight that the orthologous and paralogous genes are found using the species overlap method on the gene tree and the species tree is used to determine the order of comparisons, defining the HOG structure. Thus, a fully resolved species tree is not needed to infer orthology information with FastOMA.

。因此，不需要完全解析的物种树来推断FastOMA的正畸信息。

However, errors in the species tree can potentially propagate through the orthology inference process..

然而，物种树中的错误可能会通过正交推理过程传播。。

Benchmarking against the QfO reference proteome set

针对QfO参考蛋白质组集进行基准测试

We ran FastOMA on the 78 reference proteomes used in the QfO benchmark and the associated standard species trees as input. We then submitted the results to the QfO benchmarking service

我们对QfO基准中使用的78个参考蛋白质组和相关的标准物种树作为输入运行了FastOMA。然后，我们将结果提交给QfO基准测试服务

and obtained the results on the 11 available benchmarks. In these benchmarks, FastOMA is compared with several state-of-the-art methods that are available in the QfO public resource, including EnsemblCompara

并获得了11个可用基准的结果。在这些基准测试中，FastOMA与QfO公共资源中可用的几种最先进的方法进行了比较，包括EnsemblCompara

, Domainoid

Domainoid

, OrthoMCL

OrthoMCL

, Ortholnspector

，检验员

, sonicparanoid

，超声波沥青

, PANTHER

，黑豹

, OrthoFinder

，位置

, Hieranoid

希兰尼德

and the OMA family

和OMA家族

. QfO analysis is described in detail in Supplementary Information

QfO分析在补充信息中有详细描述

. The orthogroup benchmarking for the clade Bilateria

.进化枝双层的正交群基准

is provided in Supplementary Information

在补充信息中提供

Analysis of the QfO reference proteome set using InterProScan classification of protein families

To study the influence of the OMA database and OMAmer on the performance of FastOMA, we replaced the first part of the procedure, normally done by placing query genes into the OMA database rootHOGs with OMAmer, with InterProScan. We used InterProScan to group the QfO proteomes into gene families predefined by InterProScan.

为了研究OMA数据库和OMAmer对FastOMA性能的影响，我们替换了该过程的第一部分，通常是通过使用OMAmer和InterProScan将查询基因放入OMA数据库根目录中来完成的。我们使用InterProScan将QfO蛋白质组分组为InterProScan预定义的基因家族。

. To do so, we first ran InterProScan with the argument -appl Pfam on the QfO dataset, which grouped the proteins into InterProScan families

为此，我们首先在QfO数据集上使用参数appl Pfam运行了InterProScan，该参数将蛋白质分组为InterProScan家族

. Then, we created the rootHOG with those groups, maintaining the same InterProScan family identifier. Then, we ran the rest of FastOMA on these rootHOG FASTA files. The QfO benchmarking results are shown in Supplementary Information

然后，我们用这些组创建了rootHOG，并保持了相同的InterProScan家族标识符。然后，我们在这些rootHOG FASTA文件上运行了FastOMA的其余部分。QfO基准测试结果见补充信息

and Supplementary Figs.

和补充图。

–

. Note that a user can provide their own initial grouping of proteins to be used with FastOMA. This could be put in practice in two ways: (1) running the last two processes of FastOMA.nf (hog_rest and collect_subhog) on the user’s protein family in FASTA format or (2) providing group mapping of proteins in the OMAmer format..

。请注意，用户可以提供自己的蛋白质初始分组，以与FastOMA一起使用。这可以通过两种方式付诸实践：（1）以FASTA格式在用户的蛋白质家族上运行FastOMA.nf（hog\u rest和collect\u subhog）的最后两个过程，或者（2）以OMAmer格式提供蛋白质的组映射。。

Computations

计算

All the analyses were conducted on the high-performance computer cluster of the University of Lausanne that houses 96 computation nodes. Each node is equipped with two 24-core AMD (Advanced Micro Devices) CPUs, totaling 48 cores per node. Data were written and read on a 150 TB SSD (solid-state drive) scratch drive.

所有分析都是在洛桑大学的高性能计算机集群上进行的，该集群包含96个计算节点。每个节点配备两个24核AMD（Advanced Micro Devices）CPU，每个节点总共48核。数据在150 TB SSD（固态驱动器）刮擦驱动器上写入和读取。

For the QfO analysis, most steps of FastOMA needed less than 10 GB of memory, with a maximum of 32 GB..

对于QfO分析，FastOMA的大多数步骤需要少于10GB的内存，最大为32GB。。

Reporting summary

报告摘要

Further information on research design is available in the

有关研究设计的更多信息，请参阅

Nature Portfolio Reporting Summary

自然投资组合报告摘要

linked to this article.

链接到本文。

Data availability

数据可用性

UniProt reference proteomes and splice information (_additional.fasta.gz) were downloaded from

UniProt参考蛋白质组和剪接信息（\u additional.fasta.gz）下载自

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota

. The 2020 version of QfO proteomes was downloaded from the EBI repository at

。QfO蛋白质组的2020版已从EBI存储库下载，网址为

http://ftp.ebi.ac.uk/pub/databases/reference_proteomes/previous_releases/qfo_release-2020_04_with_updated_UP000008143/

. The OMAmer database used in this study is available at

。本研究中使用的OMAmer数据库可在

https://omabrowser.org/All/LUCA.h5

. The OMAmer database, an archive of FastOMA code, the TimeTree with annotation of internal nodes of 1,757 species in Newick format, the UniProt IDs and the inferred HOG for 1,757 eukaryotic species in OrthoXML format are all deposited on Zenodo at

。OMAmer数据库，FastOMA代码的存档，以Newick格式注释1757个物种内部节点的时间树，UniProt ID和以OrthoXML格式推断的1757个真核物种的HOG都保存在Zenodo上

https://doi.org/10.5281/zenodo.10403053

(ref.

（参考。

Code availability

代码可用性

FastOMA is free open-source software (Mozilla Public License 2.0) available via GitHub at

FastOMA是免费的开源软件（Mozilla Public License 2.0），可通过GitHub访问

https://github.com/DessimozLab/FastOMA

. We used the publicly available code for the QfO benchmarking test which is available via GitHub at

。我们使用了QfO基准测试的公开代码，该代码可通过GitHub在

https://github.com/qfo/benchmark-webservice

. A copy of the FastOMA software is available via Zenodo at

。FastOMA软件的副本可通过Zenodo获得，网址为

https://doi.org/10.5281/zenodo.10403053

(ref.

（参考。

References

参考文献

Lewin, H. A. et al. Earth BioGenome Project: sequencing life for the future of life.

Lewin，H.A.等人，《地球生物基因组计划：为生命的未来测序生命》。

Proc. Natl Acad. Sci. USA

Proc。国家科学院。滑雪。美国

115

, 4325–4333 (2018).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Fitch, W. M. Distinguishing homologous from analogous proteins.

Fitch，W.M。区分同源蛋白和类似蛋白。

Syst. Zool.

系统。佐尔。

, 99–113 (1970).

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Glover, N. et al. Advances and applications in the Quest for Orthologs.

Glover，N.等人，《寻找直系同源物的进展和应用》。

Mol. Biol. Evol.

Mol Biol。无法初始化Evol。

, 2157–2164 (2019).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Linard, B. et al. Ten years of collaborative progress in the Quest for Orthologs.

Linard，B.等人。十年来在寻找直系同源物方面的合作进展。

Mol. Biol. Evol

Mol Biol。无法初始化Evol的邮件组件。

https://doi.org/10.1093/molbev/msab098

(2021).

Altenhoff, A. M. et al. OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA Ecosystem.

Altenhoff，A.M.等人，《2024年的OMA orthology：原核生物覆盖率的提高，祖先和现存的GO富集，OMA生态系统中改进的同线性查看器等。

Nucleic Acids Res

核酸研究

https://doi.org/10.1093/nar/gkad1020

(2023).

Dessimoz, C. et al. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In

Dessimoz，C。等人。OMA，一个从完整基因组数据中鉴定直系同源物的综合自动化项目：简介和首次成就。在

RECOMB 2005 Workshop on Comparative Genomics

2005年比较基因组学研讨会

(eds McLysaght, A. & Huson, D. H.) 61–72 (Springer, 2005).

（Eds McLysaght，A&Huson，D.H.）61-72（斯普林格，2005年）。

Kirilenko, B. M. et al. Integrating gene annotation with orthology inference at scale.

Kirilenko，B.M.等人，在规模上将基因注释与正畸推断相结合。

Science

科学

380

, eabn3107 (2023).

，eabn3107（2023）。

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Rossier, V., Vesztrocy, A. W., Robinson-Rechavi, M. & Dessimoz, C. OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Rossier，V.，Vesztrocy，A.W.，Robinson-Rechavi，M。＆Dessimoz，C.OMAmer：亚家族的树驱动和无比对蛋白质分配优于最接近的序列方法。

Bioinformatics

生物信息学

https://doi.org/10.1093/bioinformatics/btab219

(2021).

Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time.

Steinegger，M。＆Söding，J。在线性时间内聚类巨大的蛋白质序列集。

Nat. Commun.

Nat.普通。

, 2542 (2018).

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Altenhoff, A. M. et al. Standardized benchmarking in the quest for orthologs.

Altenhoff，A.M.等人在寻找直系同源物时进行了标准化的基准测试。

Nat. Methods

自然方法

, 425–430 (2016).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics.

Emms，D。M。和Kelly，S。OrthoFinder：比较基因组学的系统发育正交推断。

Genome Biol.

基因组生物学。

, 238 (2019).

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Cosentino, S., Sriswasdi, S. & Iwasaki, W. SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.

Cosentino，S.，Sriswasdi，S。＆Iwasaki，W。SonicParanoid2：使用机器学习和语言模型进行快速，准确和全面的正畸推理。

Genome Biol.

基因组生物学。

, 195 (2024).

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools.

Schoch，C.L.等人，《NCBI分类学：管理、资源和工具的全面更新》。

Database

数据库

2020

, baaa062 (2020).

，baaa062（2020）。

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data.

Huerta-Cepas，J.，Serra，F。＆Bork，P。ETE 3：系统基因组数据的重建，分析和可视化。

Mol. Biol. Evol.

Mol Biol。无法初始化Evol。

, 1635–1638 (2016).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Kumar, S. et al. TimeTree 5: an expanded resource for species divergence times.

Kumar，S.等人，《时间树5：物种分化时间的扩展资源》。

Mol. Biol. Evol.

Mol Biol。无法初始化Evol。

, msac174 (2020).

，msac174（2020）。

Article

文章

Google Scholar

谷歌学者

Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk.

Nevers，Y.等人。使用OMArk对基因库注释进行质量评估。

Nat. Biotechnol.

国家生物技术。

https://doi.org/10.1038/s41587-024-02147-w

(2024).

Zajac, N. et al. Gene duplication and gain in the trematode

Zajac，N.等人。吸虫的基因复制和增益

Atriophallophorus winterbourni

冬滨藜

contributes to adaptation to parasitism.

有助于适应寄生虫。

Genome Biol

基因组生物学

, evab010 (2021).

，评估010（2021）。

Article

文章

CAS

中科院

Google Scholar

谷歌学者

Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Katoh，K。＆Standley，D.M。MAFFT多序列比对软件版本7：性能和可用性的改进。

Mol. Biol. Evol.

Mol Biol。无法初始化Evol。

, 772–780 (2013).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments.

Price，M.N.，Dehal，P.S。＆Arkin，A.P.FastTree 2-用于大型比对的近似最大似然树。

PLoS ONE

公共科学图书馆一号

, e9490 (2010).

。

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Huerta-Cepas, J., Dopazo, H., Dopazo, J. & Gabaldón, T. The human phylome.

Huerta Cepas，J.、Dopazo，H.、DopazoJ.和Gabaldón，T.人类门。

Genome Biol.

基因组生物学。

, R109 (2007).

，R109（2007年）。

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Vilella, A. J. et al. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates.

Vilella，A.J.等人，《EnsemblCompara基因树：脊椎动物中完整的，具有重复意识的系统发育树》。

Genome Res.

基因组研究。

, 327–335 (2009).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Persson, E., Kaduk, M., Forslund, S. K. & Sonnhammer, E. L. L. Domainoid: domain-oriented orthology inference.

Persson，E.，Kaduk，M.，Forslund，S.K。和Sonnhammer，E.L.L。Domainoid：面向领域的正畸推理。

BMC Bioinf.

BMC生物信息。

, 523 (2019).

Article

文章

Google Scholar

谷歌学者

Li, L., Stoeckert, C. J. Jr & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Li，L.，Stoeckert，C。J。Jr＆Roos，D。S。OrthoMCL：鉴定真核基因组的直系同源群。

Genome Res.

基因组研究。

, 2178–2189 (2003).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Nevers, Y. et al. OrthoInspector 3.0: open portal for comparative genomics.

Nevers，Y.等人，《OrthoInspector 3.0：比较基因组学的开放门户》。

Nucleic Acids Res.

核酸研究。

, D411–D418 (2019).

，D411–D418（2019）。

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Mi, H. et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API.

Mi，H。等人。PANTHER版本16：修订的家族分类，基于树的分类工具，增强子区域和广泛的API。

Nucleic Acids Res.

核酸研究。

, D394–D403 (2021).

，D394–D403（2021）。

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Schreiber, F. & Sonnhammer, E. L. L. Hieranoid: hierarchical orthology inference.

Schreiber，F。＆Sonnhammer，E.L.L。Hieranoid：层次矫形推理。

J. Mol. Biol.

J.分子生物学。

425

, 2072–2081 (2013).

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes.

Altenhoff，A.M.等人，《OMA独立：公共和定制基因组和转录组之间的正畸推断》。

Genome Res.

基因组研究。

, 1152–1163 (2019).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Altenhoff, A. M., Gil, M., Gonnet, G. H. & Dessimoz, C. Inferring hierarchical orthologous groups from orthologous gene pairs.

Altenhoff，A.M.，Gil，M.，Gonnet，G.H。和Dessimoz，C。从直系同源基因对推断等级直系同源群。

PLoS ONE

公共科学图书馆一号

, e53786 (2013).

，e53786（2013）。

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Train, C.-M., Glover, N. M., Gonnet, G. H., Altenhoff, A. M. & Dessimoz, C. Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference.

Train，C.-M.，Glover，N.M.，Gonnet，G.H.，Altenhoff，A.M。＆Dessimoz，C。Orthologous Matrix（OMA）算法2.0：对不对称进化率更稳健，层次直系同源组推理更具可扩展性。

Bioinformatics

生物信息学

, i75–i82 (2017).

，i75–i82（2017）。

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Train, C.-M., Pignatelli, M., Altenhoff, A. & Dessimoz, C. iHam & pyHam: visualizing and processing hierarchical orthologous groups.

Train，C.-M.，Pignatelli，M.，Altenhoff，A。＆Dessimoz，C。iHam＆pyHam：可视化和处理层次直系同源组。

Bioinformatics

生物信息学

https://doi.org/10.1093/bioinformatics/bty994

(2018).

Nevers, Y. et al. The Quest for Orthologs orthology benchmark service in 2022.

Nevers，Y.等人，《2022年寻求直系同源物矫形学基准服务》。

Nucleic Acids Res.

核酸研究。

, W623–W632 (2022).

，W623–W632（2022）。

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Dylus, D., Altenhoff, A., Majidian, S., Sedlazeck, F. J. & Dessimoz, C. Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree.

Dylus，D.，Altenhoff，A.，Majidian，S.，Sedlazeck，F。J。＆Dessimoz，C。使用Read2Tree直接从原始测序读数推断系统发育树。

Nat. Biotechnol

国家生物技术

https://doi.org/10.1038/s41587-023-01753-4

(2023).

Dylus, D. et al. How to build phylogenetic species trees with OMA.

Dylus，D。等人。如何用OMA构建系统发育物种树。

F1000Res.

F1000Res。

, 511 (2020).

Article

文章

PubMed

Google Scholar

谷歌学者

Altenhoff, A. & Dessimoz, C. Phylogenetic and functional assessment of orthologs inference projects and methods.

Altenhoff，A。＆Dessimoz，C。直系同源物推断项目和方法的系统发育和功能评估。

PLoS Comput. Biol.

PLoS计算机。生物。

, e1000262 (2009).

，e1000262（2009）。

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Cosentino, S. & Iwasaki, W. SonicParanoid: fast, accurate and easy orthology inference.

Cosentino，S。和Iwasaki，W。SonicParanoid：快速，准确和简单的矫形推断。

Bioinformatics

生物信息学

, 149–151 (2019).

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Zahn-Zabal, M., Dessimoz, C. & Glover, N. M. Identifying orthologs with OMA: a primer.

。

F1000Res.

F1000Res。

, 27 (2020).

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Emms, D. & Kelly, S. Benchmarking orthogroup inference accuracy: revisiting orthobench.

Emms，D。＆Kelly，S。Benchmarking orthogroup推断准确性：重访orthobench。

Genome Biol. Evol.

基因组生物学。无法初始化Evol。

, 2258–2266 (2020).

Article

文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Jones, P. et al. InterProScan 5: genome-scale protein function classification.

Jones，P.等人，《InterProScan 5：基因组规模的蛋白质功能分类》。

Bioinformatics

生物信息学

, 1236–1240 (2014).

Article

文章

PubMed

PubMed Central

公共医学中心

CAS

中科院

Google Scholar

谷歌学者

Blum, M. et al. The InterPro protein families and domains database: 20 years on.

Blum，M。等人。InterPro蛋白质家族和结构域数据库：20年过去了。

Nucleic Acids Res.

核酸研究。

, D344–D354 (2021).

，D344–D354（2021）。

Article

文章

PubMed

CAS

中科院

Google Scholar

谷歌学者

Majidian, S. et al. Orthology inference at scale with FastOMA.

Majidian，S.等人。使用FastOMA进行规模的正畸推断。

Zenodo

泽诺多

https://doi.org/10.5281/zenodo.10403053

(2023).

Download references

下载参考资料

Acknowledgements

致谢

We thank C. Train for updating PyHam, as well as B. Sipos and S. K. Bhurji for helpful feedback on FastOMA. This work was funded by the Swiss National Science Foundation (grant 205085) to C.D.

我们感谢C.Train更新PyHam，以及B.Sipos和S.K.Bhurji对FastOMA的有益反馈。这项工作由瑞士国家科学基金会（grant 205085）资助给C.D。

Author information

作者信息

Authors and Affiliations

作者和隶属关系

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland

瑞士洛桑洛桑大学计算生物学系

Sina Majidian, Yannis Nevers, Ali Yazdizadeh Kharrazi, Alex Warwick Vesztrocy, Stefano Pascarelli, David Moi, Natasha Glover & Christophe Dessimoz

Sina Majidian、Yannis Nevers、Ali Yazdizadeh Kharrazi、Alex Warwick Vesztrocy、Stefano Pascarelli、David Moi、Natasha Glover和Christophe Dessimoz

Swiss Institute of Bioinformatics, Lausanne, Switzerland

瑞士洛桑生物信息学研究所

Sina Majidian, Yannis Nevers, Alex Warwick Vesztrocy, Stefano Pascarelli, David Moi, Natasha Glover, Adrian M. Altenhoff & Christophe Dessimoz

西娜·马吉迪亚（Sina Majidian）、亚尼斯·内弗斯（Yannis Nevers）、亚历克斯·沃里克·维斯特罗西（Alex Warwick Vesztrocy）、斯特凡诺·帕斯卡雷利（Stefano Pascarelli）、戴维·莫伊（David Moi）、娜塔莎·格洛弗（Natasha Glover）、阿德里安·阿尔滕霍夫（

Department of Computer Science, ETH Zurich, Zurich, Switzerland

苏黎世理工大学计算机科学系，瑞士苏黎世

Adrian M. Altenhoff

阿德里安·M·阿尔滕霍夫

Authors

作者

Sina Majidian

你马吉迪安

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Yannis Nevers

Yannis Nevers公司

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Ali Yazdizadeh Kharrazi

阿里·亚兹迪扎德·哈拉齐

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Alex Warwick Vesztrocy

Alex Warwick Wastrocy

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Stefano Pascarelli

斯特凡诺·帕斯卡雷利

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

David Moi

大卫我

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Natasha Glover

娜塔莎·格洛弗

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Adrian M. Altenhoff

阿德里安·M·阿尔滕霍夫

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Christophe Dessimoz

克里斯托夫·德西莫兹

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

Google Scholar

谷歌学者

Contributions

捐款

S.M., A.M.A. and C.D. developed the method. S.M. and A.M.A. implemented the software. S.M., A.Y.K., Y.N., A.W.V., N.G., D.M. and S.P. contributed to the analysis. C.D. and S.M. wrote and edited the manuscript. All authors read and approved the final version of the manuscript.

S、 M.，A.M.A.和C.D.开发了这种方法。S、 M.和A.M.A.实施了该软件。S、 M.，A.Y.K.，Y.N.，A.W.V.，N.G.，D.M.和S.P.为分析做出了贡献。C、 D.和S.M.撰写并编辑了手稿。所有作者都阅读并批准了稿件的最终版本。

Corresponding author

通讯作者

Correspondence to

通信对象

Christophe Dessimoz

克里斯托夫·德西莫兹

Ethics declarations

道德宣言

Competing interests

相互竞争的利益

The authors declare no competing interests.

作者声明没有利益冲突。

Peer review

同行评审

Peer review information

同行评审信息

Nature Methods

thanks Michael Hiller, Bui Minh, Johannes Soeding and Thomas Wong for their contribution to the peer review of this work. Primary handling editor: Lin Tang, in collaboration with the

感谢Michael Hiller、Bui Minh、Johannes Soeding和Thomas Wong为这项工作的同行评审做出的贡献。主要处理编辑：Lin Tang，与

Nature Methods

自然方法

team.

团队。

Peer reviewer reports

同行评审报告

are available.

可用。

Additional information

其他信息

Publisher’s note

出版商注释

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature在已发布的地图和机构隶属关系中的管辖权主张方面保持中立。

Supplementary information

补充信息

Supplementary Information

补充信息

Supplementary Information 1–11, Table 1 and Figs. 1–25.

补充信息1-11，表1和图1-25。

Reporting Summary

报告摘要

Peer Review File

同行评审文件

Rights and permissions

权限和权限

Open Access

开放存取

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

本文根据知识共享署名4.0国际许可证进行许可，该许可证允许以任何媒体或格式使用，共享，改编，分发和复制，只要您对原始作者和来源给予适当的信任，提供知识共享许可证的链接，并指出是否进行了更改。

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit .

要查看此许可证的副本，请访问。

http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

重印和许可

About this article

关于本文

Cite this article

引用本文

Majidian, S., Nevers, Y., Yazdizadeh Kharrazi, A.

Majidian，S.，Nevers，Y.，Yazdizadeh Kharrazi，A。

et al.