用于大规模数据集中高阶上位性检测的分布式变换器-动脉网

Distributed transformer for high order epistasis detection in large-scale datasets

Nature 等信源发布 2024-06-25 13:19



可切换为仅中文







AbstractUnderstanding the genetic basis of complex diseases is one of the most important challenges in current precision medicine. To this end, Genome-Wide Association Studies aim to correlate Single Nucleotide Polymorphisms (SNPs) to the presence or absence of certain traits. However, these studies do not consider interactions between several SNPs, known as epistasis, which explain most genetic diseases.

摘要了解复杂疾病的遗传基础是当前精准医学中最重要的挑战之一。为此，全基因组关联研究旨在将单核苷酸多态性（SNP）与某些性状的存在或不存在相关联。但是，这些研究没有考虑几种SNP之间的相互作用，称为上位性，这可以解释大多数遗传疾病。

Analyzing SNP combinations to detect epistasis is a major computational task, due to the enormous search space. A possible solution is to employ deep learning strategies for genomic prediction, but the lack of explainability derived from the black-box nature of neural networks is a challenge yet to be addressed.

由于巨大的搜索空间，分析SNP组合以检测上位性是一项主要的计算任务。。

Herein, a novel, flexible, portable, and scalable framework for network interpretation based on transformers is proposed to tackle any-order epistasis. The results on various epistasis scenarios show that the proposed framework outperforms state-of-the-art methods for explainability, while being scalable to large datasets and portable to various deep learning accelerators.

在此，提出了一种新颖，灵活，可移植且可扩展的基于变压器的网络解释框架，以解决任何阶上位性问题。在各种上位性场景上的结果表明，所提出的框架在解释性方面优于最先进的方法，同时可扩展到大型数据集并可移植到各种深度学习加速器。

The proposed framework is validated on three WTCCC datasets, identifying SNPs related to genes known in the literature that have direct relationships with the studied diseases..

所提出的框架在三个WTCCC数据集上得到了验证，确定了与文献中已知的与所研究疾病直接相关的基因相关的SNP。。

IntroductionAdvancements in DNA sequencing in the past 40 years have paved the way from analyzing small sequences to mapping the entire human genome. This technological breakthrough has allowed for the emergence of Genome-Wide Association Studies (GWAS)1, a research approach that aims to unveil correlations between complex diseases and Single Nucleotide Polymorphisms (SNPs), a common type of genetic variation.

引言过去40年来DNA测序的进步为从分析小序列到绘制整个人类基因组铺平了道路。这项技术突破使得全基因组关联研究（GWAS）1得以出现，该研究方法旨在揭示复杂疾病与单核苷酸多态性（SNP）（一种常见的遗传变异类型）之间的相关性。

GWAS study a phenotype, a set of observable characteristics, such as a disease, and define the individuals of a population by the presence (case) or absence (control) of the studied traits. The rationale for this methodology lies in assuming that common diseases have common underlying influential genetic variants across a population2.

GWAS研究表型，一组可观察到的特征，例如疾病，并通过研究特征的存在（病例）或不存在（对照）来定义人群中的个体。这种方法的基本原理在于假设常见疾病在整个人群中具有共同的潜在影响遗传变异2。

Some examples of the success of GWAS include the association between the IL-12/IL-23 pathway and the development of Crohn’s Disease3, as well as the discovery of the PTPN22 gene’s influence in autoimmune diseases4.The approach for GWAS makes a crucial assumption: SNPs are independently correlated to the studied phenotype.

GWAS成功的一些例子包括IL-12/IL-23途径与克罗恩病发展之间的关联3，以及PTPN22基因在自身免疫性疾病中的影响的发现4。GWAS的方法做出了一个至关重要的假设：SNP与研究的表型独立相关。

Therefore, SNPs can be tested individually for statistical relevance to the disease, while neglecting gene-environment and gene-gene interactions, known as the “missing heritability” problem in the literature5. The combinatorial effect that arises when two or more SNPs interact is known as epistasis and may play a fundamental role on the missing heritability problem.

因此，可以单独测试SNP与疾病的统计相关性，同时忽略基因-环境和基因-基因相互作用，这在文献中被称为“遗传力缺失”问题5。。

Research on epistasis has focused on the detection of SNP interactions to explain complex diseases, such as Late Onset Alzheimer’s Disease6.Finding the optimal interacting SNP combination to explain a disease implies the exhaustive evaluation of all possible cases, which presents a current computational challenge.

上位性研究的重点是检测SNP相互作用以解释复杂疾病，例如迟发性阿尔茨海默氏病6。寻找最佳的相互作用SNP组合来解释疾病意味着对所有可能的病例进行详尽的评估，这是目前的计算挑战。

As an example, on WTCCC datasets, as ma.

例如，在WTCCC数据集上，作为ma。

(1)

where D is the embedding size. The inner product $QY^T$ is a measure of a SNP’s importance to predict the current label. Similar embeddings to represent a SNP t and a label are mapped to similar queries and keys. As a consequence, $Q{Y_t}^T$ should have a large value. Conversely, different embeddings lead to a small product, denoting a non-existent relationship between the SNP and the label.

其中D是嵌入大小。内积（QY ^ T）是衡量SNP预测当前标签重要性的指标。表示SNP t和标签的类似嵌入被映射到类似的查询和键。因此，\（Q{Y\u t}^ t\）应该有一个大值。相反，不同的嵌入会产生一个小产品，表明SNP和标签之间不存在关系。

In Eq. (1), the softmax function is given by$$\begin{aligned} Softmax(QY_{t}^T /\sqrt{D}) = \frac{exp(QY_{t}^T/\sqrt{D})}{\sum _t exp(QY_{t}^T/\sqrt{D})}, \end{aligned}$$.

在等式（1）中，softmax函数由$$\ begin{aligned}softmax（QY\ut}^ t/\ sqrt{D}）=\ frac{exp（QY\ut}^ t/\ sqrt{D}）}{\ sum \u t exp（QY\ut}^ t/\ sqrt{D}）}、\ end{aligned}$$给出。

(2)

where exp(.) denotes the exponential function. Applying softmax to $QY^T/\sqrt{D}$ outputs a probability distribution over the SNPs, known as attention scores, which are used to combine the value vectors. As interacting SNPs should have a large $Q{Y_t}^T/\sqrt{D}$ value, the corresponding attention score should also be large.

其中exp（.）表示指数函数。。由于相互作用的SNP应该具有较大的\（Q{Y\u t}^ t/\ sqrt{D}\）值，因此相应的注意力得分也应该较大。

Therefore, keeping the SNPs with the highest attention scores after training provides a method to identify potential epistatic interactions. An exhaustive search can be performed afterwards on the chosen SNPs to find the optimal SNP combination.While this approach works, it has some drawbacks. In epistatic datasets, it is unlikely that many SNPs have a true correlation to the label.

因此，在训练后保持注意力得分最高的SNP提供了一种识别潜在上位性相互作用的方法。之后可以对所选SNP进行详尽的搜索，以找到最佳的SNP组合。。在上位性数据集中，许多SNP不太可能与标签真正相关。

Therefore, calculating attention simultaneously between all SNPs and the label may hinder the identification of epistatic interactions if most SNPs are noisy. To overcome this problem and boost the transformer’s prediction power, a possible solution is to split the key vector (which represents the SNPs) into several partitions, $Y_i$, and calculate attention between the query and a partition (the query cannot be split because it represents a single token, the patient’s label).

因此，如果大多数SNP都有噪音，那么同时计算所有SNP和标签之间的注意力可能会阻碍上位相互作用的识别。为了克服这个问题并提高变压器的预测能力，一个可能的解决方案是将关键向量（代表SNP）分成几个分区\（Y\u i \），并计算查询和分区之间的注意力（查询无法分割，因为它代表一个标记，即患者的标签）。

As each partition has a smaller number of SNPs, noise is reduced, increasing the chances of identifying true epistatic SNPs. However, there is no guarantee that a single partition holds all possible interacting SNPs. Therefore, attention should be calculated between combinations of partitions, allowing for all possible subsets of SNPs to be evaluated together.Figure 4 provides an example of this strategy.

由于每个分区的SNP数量较少，因此噪声会降低，从而增加识别真正上位性SNP的机会。但是，不能保证单个分区包含所有可能的交互SNP。因此，应该在分区的组合之间计算注意力，从而可以一起评估所有可能的SNP子集。图4提供了此策略的示例。

In this example, the key vector is split in three partitions and mixed in combinations of two, resulting in three different options (1 and 2, 1 and 3, 2 and 3). Attention sc.

在本例中，关键向量被分成三个分区，并以两个分区的组合进行混合，从而产生三个不同的选项（1和2、1和3、2和3）。注意sc。

(3)

where $\odot$ represents the Hadamard product (element-wise product), ${h_i}^L$ is the output of the i-th token from the last Transformer layer, L, and $\nabla {h_i}^L$ is given by$$\begin{aligned} \nabla {h_i}^L = \frac{\partial y^c}{\partial {h_i}^L}, \end{aligned}$$

其中\（\ odot \）表示阿达玛乘积（元素乘积），\（{h\u i}^ L \）是最后一个变压器层L的第i个令牌的输出，并且\（\ nabla{h\u i}^ L \）由$$\ begin{aligned}\ nabla{h\u i}^ L=\ frac{\ partial y ^ c}{\ partial{h\u i}^ L}、\ end{aligned}给出$$

(4)

where $y^c$ is the transformer’s final output for class c. Therefore, $\nabla {h_i}^L$ illustrates a partial linearization from ${h_i}^L$ that captures the importance of the i-th token to a target class c. Attentive CAT is then calculated as$$\begin{aligned} {AttCAT_i}^L = ({\alpha _i}^L \cdot {CAT_i}^L)_H, \end{aligned}$$.

其中\（y ^ c \）是c类变压器的最终输出。因此，\（\nabla{h\u i}^ L \）说明了从\（{h\u i}^ L \）捕获第i个标记对目标c类的重要性的部分线性化。然后，注意猫被计算为$$\ begin{aligned}{AttCAT\u i}^ L=（{\ alpha i}^ L \ cdot{CAT\u i}^ L）\u h，end{aligned}$$。

(5)

where ${\alpha _i}^L$ denotes the attention scores of the i-th token at the L-th layer. This result is averaged over the attention heads, H.For the proposed framework, only one transformer layer exists, with a single encoder. After training, $\nabla {h_i}^L$ is calculated for each SNP between the transformer’s final output and the encoder’s output, as well as attention scores.

其中\（{\ alpha}^ L \）表示第L层第i个标记的注意力得分。这个结果在注意头H上取平均值。对于所提出的框架，只有一个变压器层存在，只有一个编码器。训练后，计算变压器最终输出和编码器输出之间的每个SNP的\（\ nabla{h\u i}^ L \），以及注意力得分。

While Attentive CAT suggests a element-wise multiplication between attention scores and gradients, for the proposed framework, element-wise sum is also calculated. Furthermore, for these calculations, both gradients and attention scores are scaled from 0 to 1, to mitigate differences in the order of magnitude of both metrics.For element-wise sum and multiplication, averaging along the attention heads is not necessary, as the proposed network architecture works with a single attention head.

虽然注意力猫建议在注意力得分和梯度之间进行元素乘法，但对于所提出的框架，还计算了元素总和。此外，对于这些计算，梯度和注意力分数都从0缩放到1，以减轻两个指标的数量级差异。对于元素级求和和和乘法，不需要沿注意头求平均，因为所提出的网络体系结构使用单个注意头。

In addition to these two metrics, both attention scores and gradients can also be employed separately, adding to the framework’s flexible configuration. A hyperparameter search is done to analyze the optimal network parameters, as well as which of these four interpretation metrics provides the best detection power.Software and hardwareThe transformer model is implemented and trained using TensorFlow.

除了这两个指标之外，注意力得分和梯度也可以分别使用，从而增加了框架的灵活配置。进行超参数搜索以分析最佳网络参数，以及这四个解释指标中哪一个提供最佳检测能力。软件和硬件使用TensorFlow实现和训练变压器模型。

Depending on the used hardware, different TensorFlow versions are employed. Most of the experiments are devised on the LUMI supercomputer, on nodes with 8 AMD MI250X GPUs (TensorFlow 2.11, 128 GB memory). For scalability and comparison purposes, the model is also trained on systems with Intel PVC (TensorFlow 2.12, 48 GB memory), NVIDIA A100 (TensorFlow 2.12, 80 GB memory), Google TPU V4 (TensorFlow 2.12, 32 GB memory) and GraphCore IPU GC-200 (TensorFlow 2.6.3, 900 MB memory).Dataset generationSy.

根据使用的硬件，使用不同的TensorFlow版本。大多数实验是在LUMI超级计算机上设计的，节点上有8个AMD MI250X GPU（TensorFlow 2.11128 GB内存）。出于可扩展性和比较目的，该模型还可以在具有Intel PVC（TensorFlow 2.12，48 GB内存）、NVIDIA A100（TensorFlow 2.12，80 GB内存）、Google TPU V4（TensorFlow 2.12，32 GB内存）和GraphCore IPU GC-200（TensorFlow 2.6.3900 MB内存）的系统上进行训练。数据集生成。

Data availability

数据可用性

The source code of this work is available on: https://github.com/hiperbio/episdet-transformer.

这项工作的源代码位于：https://github.com/hiperbio/episdet-transformer.

ReferencesVisscher, P. M. et al. 10 Years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 101, 5–22. https://doi.org/10.1016/j.ajhg.2017.06.005 (2017).Article

参考文献Visscher，P.M。等人。GWAS发现的10年：生物学，功能和翻译。上午J。嗯。Genet。101，5-22。https://doi.org/10.1016/j.ajhg.2017.06.005（2017年）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Hemminki, K., Försti, A. & Bermejo, J. L. The ‘common disease-common variant’hypothesis and familial risks. PLoS ONE 3, e2504 (2008).Article

Hemminki，K.，Försti，A。＆Bermejo，J.L.“常见疾病-常见变异”假说和家族风险。《公共科学图书馆·综合》第3期，e2504（2008）。文章

ADS

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Wang, K. et al. Diverse genome-wide association studies associate the il12/il23 pathway with crohn disease. Am. J. Hum. Genet. 84, 399–405 (2009).Article

Wang，K。等人。多种全基因组关联研究将il12/il23途径与克罗恩病联系起来。上午J。嗯。Genet。84399-405（2009）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Siminovitch, K. A. Ptpn22 and autoimmune disease. Nat. Genet. 36, 1248–1249 (2004).Article

Siminovitch，K.A。Ptpn22和自身免疫性疾病。纳特·吉内特。361248-1249（2004）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Maher, B. Personal genomes: The case of the missing heritability. Nature 456, 18–21. https://doi.org/10.1038/456018a (2008).Article

Maher，B。个人基因组：遗传力缺失的情况。自然456，18-21。https://doi.org/10.1038/456018a（2008年）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Turton, J. C. et al. Investigating Statistical Epistasis in Complex Disorders. J. Alzheimer’s Dis 25, 635–644. https://doi.org/10.3233/JAD-2011-110197 (2011). Publisher: IOS Press.Mattson, D. L. & Liang, M. From gwas to functional genomics-based precision medicine. Nat. Rev. Nephrol.

特顿，J.C。等人。调查复杂疾病中的统计上位性。J、阿尔茨海默病25635-644。https://doi.org/10.3233/JAD-2011-110197（2011年）。出版商：IOS出版社。马特森，D.L.&Liang，M。从gwas到基于功能基因组学的精准医学。自然修订版Nephrol。

13 (2017).Ponte-Fernández, C., González-Domínguez, J. & Martín, M. J. Fast search of third-order epistatic interactions on cpu and gpu clusters. Int. J. High Perform. Comput. Appl. 34, 20–29 (2020).Article .

13（2017）。Ponte Fernández，C.，González Domínguez，J。＆Martín，M.J。快速搜索cpu和gpu集群上的三阶上位相互作用。国际J.高性能计算机。应用。34，20-29（2020）。文章。

Google Scholar

谷歌学者

Marques, D. et al. Unlocking personalized healthcare on modern cpus/gpus: Three-way gene interaction study. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 146–156 IEEE, (2022).Nobre, R., Ilic, A., Santander-Jiménez, S. & Sousa, L. Tensor-accelerated fourth-order epistasis detection on gpus.

Marques，D。等人。在现代CPU/GPU上解锁个性化医疗：三向基因相互作用研究。2022年IEEE国际并行和分布式处理研讨会（IPDPS），146–156 IEEE，（2022年）。Nobre，R.，Ilic，A.，Santander Jiménez，S。＆Sousa，L。Tensor在GPU上加速了四阶上位性检测。

In Proceedings of the 51st International Conference on Parallel Processing, 1–11 (2022).Ponte-Fernández, C., González-Domínguez, J. & Martín, M. J. Fiuncho: A program for any-order epistasis detection in cpu clusters. J. Supercomput. 78, 15338–15357 (2022).Article .

第51届国际并行处理会议论文集，1-11（2022）。Ponte Fernández，C.，González Domínguez，J.＆Martín，M.J.Fiuncho：cpu集群中任意阶上位性检测的程序。J、超级计算机。7815338-15357（2022）。文章。

Google Scholar

谷歌学者

Nobre, R., Ilic, A., Santander-Jiménez, S. & Sousa, L. Fourth-order exhaustive epistasis detection for the xpu era. In Proceedings of the 50th International Conference on Parallel Processing, 1–10 (2021).Ribeiro, G., Neves, N., Santander-Jiménez, S. & Ilic, A. Hedacc: Fpga-based accelerator for high-order epistasis detection.

Nobre，R.，Ilic，A.，Santander Jiménez，S。＆Sousa，L。xpu时代的四阶详尽上位性检测。第50届并行处理国际会议论文集，1-10（2021）。Ribeiro，G.，Neves，N.，Santander Jiménez，S。＆Ilic，A。Hedacc：基于Fpga的高阶上位性检测加速器。

In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (2021).Niel, C., Sinoquet, C., Dina, C. & Rocheleau, G. A survey about methods dedicated to epistasis detection. Front. Genet. 6, 285 (2015).Article .

2021年，IEEE第29届现场可编程定制计算机（FCCM）国际年会（2021年）。Niel，C.，Sinoquet，C.，Dina，C。＆Rocheleau，G。关于上位性检测方法的调查。。基因。6285（2015）。文章。

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Pérez-Enciso, M. & Zingaretti, L. M. A guide on deep learning for complex trait genomic prediction. Genes 10 (2019).Mieth, B. et al. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom. Bioinf. 3, lqab065 (2021).Liu, Y.

佩雷斯·恩西索（Pérez Enciso），M。和辛加雷蒂（Zingaretti），L.M。复杂性状基因组预测的深度学习指南。基因10（2019）。Mieth，B。等人。DeepCOMBI：用于全基因组关联研究中分析和发现的可解释人工智能。NAR Genom。生物信息。3，lqab065（2021）。刘，Y。

et al. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front. Genet. 10, 1091. https://doi.org/10.3389/fgene.2019.01091 (2019).Article .

使用大豆的深度卷积神经网络进行表型预测和全基因组关联研究。。基因。101091年。https://doi.org/10.3389/fgene.2019.01091（2019年）。文章。

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Graça, M., Marques, D., Santander-Jiménez, S., Sousa, L. & Ilic, A. Interpreting high order epistasis using sparse transformers. In ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies (2023).Marchini, J., Donnelly, P. & Cardon, L.

Graça，M.，Marques，D.，Santander Jiménez，S.，Sousa，L。＆Ilic，a。使用稀疏变压器解释高阶上位性。在ACM/IEEE国际互联健康会议：应用、系统和工程技术（2023）中。马奇尼，J.，唐纳利，P.＆Cardon，L。

R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 413–417 (2005).Article .

R、。纳特·吉内特。37413-417（2005）。文章。

CAS

中科院

PubMed

Google Scholar

谷歌学者

González-Seoane, B., Ponte-Fernández, C., González-Domínguez, J. & Martín, M. J. Pytoxo: A python tool for calculating penetrance tables of high-order epistasis models. BMC Bioinformatics 23, 1–13 (2022).Article

Gonzalez-Seoane，B.，Ponte-Fernández，C.，Gonzalez-Dominguez，J.&Martin，M.J.。Pytoxo：计算高阶上位模型穿透表的Python工具。BMC生物信息学23，1-13（2022年）。文章

Google Scholar

谷歌学者

Jing, P.-J. & Shen, H.-B. MACOED: A multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics 31, 634–641. https://doi.org/10.1093/bioinformatics/btu702 (2015).Article

Jing，P.-J.＆Shen，H.-B。MACOED：一种用于全基因组关联研究中SNP上位性检测的多目标蚁群优化算法。生物信息学31634-641。https://doi.org/10.1093/bioinformatics/btu702（2015年）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Jia, Z., Tillman, B., Maggioni, M. & Scarpazza, D. P. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019).Jouppi, N. et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings.

贾，Z.，蒂尔曼，B.，马吉奥尼，M。和斯卡帕扎，D.P。通过微标记剖析graphcore ipu体系结构。arXiv预印本arXiv:1912.03413（2019）。Jouppi，N。等人。Tpu v4：用于机器学习的光学可重构超级计算机，具有嵌入式硬件支持。

In Proceedings of the 50th Annual International Symposium on Computer Architecture, 1–14 (2023).Gomes, W. et al. Ponte vecchio: A multi-tile 3d stacked processor for exascale computing. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, 42–44 IEEE, (2022).Choquette, J., Gandhi, W., Giroux, O., Stam, N.

第50届计算机体系结构国际研讨会论文集，1-14（2023）。戈麦斯，W。等人。Ponte vecchio：用于exascale计算的多块3d堆叠处理器。2022年IEEE国际固态电路会议（ISSCC），第65卷，42–44 IEEE，（2022年）。乔奎特，J.，甘地，W.，吉鲁，O.，斯塔姆，N。

& Krashinsky, R. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41, 29–35 (2021).Article .

&Krashinsky，R。Nvidia a100 tensor core gpu：性能与创新。IEEE Micro 41，29–35（2021）。文章。

Google Scholar

谷歌学者

Zwinger, T., Heikonen, J. & Manninen, P. Lumi supercomputer for european researchers. Copernicus Meetings (2023).Wellcome Sanger Institute. Wellcome trust case control consortium. [Online; visited July-2023].Roth, G. A. et al. Global burden of cardiovascular diseases and risk factors, 1990–2019: Update from the gbd 2019 study.

Zwinger，T.，Heikonen，J。＆Manninen，P。Lumi欧洲研究人员超级计算机。哥白尼会议（2023）。惠康桑格研究所。惠康信托案件控制联盟。。Roth，G.A.等人，《1990-2019年全球心血管疾病负担和危险因素：gbd 2019研究的更新》。

J. Am. Coll. Cardiol. 76, 2982–3021 (2020).Article .

J、美国科罗拉多州。心脏病。762982-3021（2020）。文章。

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Finckh, A. et al. Global epidemiology of rheumatoid arthritis. Nat. Rev. Rheumatol. 18, 591–602 (2022).PubMed

Finckh，A.等人，《类风湿性关节炎的全球流行病学》。风湿病杂志。18591-602（2022）。PubMed出版社

Google Scholar

谷歌学者

Ferré, M. P. B., Boscá-Watts, M. M. & Pérez, M. M. Crohn’s disease. Medicina Clinica (English Edition) 151, 26–33 (2018).Article

费雷，M.P.B.，博斯-瓦茨，M.M。和佩雷斯，M.M。克罗恩病。Medicina Clinica（英文版）151,26-33（2018）。文章

Google Scholar

谷歌学者

Pers, T. H., Timshel, P. & Hirschhorn, J. N. SNPsnap: A Web-based tool for identification and annotation of matched SNPs. Bioinformatics 31, 418–420 (2014).Article

Pers，T.H.，Timshel，P。＆Hirschhorn，J.N。SNPsnap：一种基于网络的工具，用于识别和注释匹配的SNP。。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Tokuhiro, S. et al. An intronic snp in a runx1 binding site of slc22a4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nat. Genet. 35, 341–348 (2003).Article

Tokuhiro，S。等人。slc22a4 runx1结合位点的内含子snp编码有机阳离子转运蛋白，与类风湿性关节炎有关。纳特·吉内特。35341-348（2003）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Ji, J. D. et al. Association of stat4 polymorphism with rheumatoid arthritis and systemic lupus erythematosus: A meta-analysis. Mol. Biol. Rep. 37, 141–147 (2010).Article

Ji，J.D.等人。stat4多态性与类风湿性关节炎和系统性红斑狼疮的关联：荟萃分析。分子生物学。代表37141-147（2010）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Briggs, F. et al. Supervised machine learning and logistic regression identifies novel epistatic risk factors with ptpn22 for rheumatoid arthritis. Genes Immunity 11, 199–208 (2010).Article

Briggs，F。等人。监督机器学习和逻辑回归识别ptpn22对类风湿性关节炎的新型上位危险因素。基因免疫11199-208（2010）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Holmdahl, R. Association of mhc and rheumatoid arthritis: Why is rheumatoid arthritis associated with the mhc genetic region? an introduction. Arth. Res. Ther. 2, 1–2 (2000).Article

Holmdahl，R。mhc与类风湿性关节炎的关联：为什么类风湿性关节炎与mhc遗传区域相关？简介。亚瑟。Res.Ther。2，1-2（2000）。文章

Google Scholar

谷歌学者

Connelly, J. J. et al. Genetic and functional association of fam5c with myocardial infarction. BMC Med. Genet. 9 (2008).Hägg, S. et al. Multi-organ expression profiling uncovers a gene module in coronary artery disease involving transendothelial migration of leukocytes and lim domain binding 2: the stockholm atherosclerosis gene expression (stage) study.

康奈利，J.J。等人。fam5c与心肌梗塞的遗传和功能关联。BMC医学基因。9（2008年）。Hägg，S.等人。多器官表达谱揭示了冠状动脉疾病中涉及白细胞跨内皮迁移和lim结构域结合的基因模块2：斯德哥尔摩动脉粥样硬化基因表达（stage）研究。

PLoS Genet. 5, e1000754 (2009).Article .

PLoS Genet。5，e1000754（2009）。第[UNK]条。

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Turner, A. W. et al. Functional interaction between col4a1/col4a2 and smad3 risk loci for coronary artery disease. Atherosclerosis 242, 543–552 (2015).Article

Turner，A.W.等人。冠状动脉疾病的col4a1/col4a2和smad3风险基因座之间的功能相互作用。动脉粥样硬化242543-552（2015）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Cummings, F. J. et al. Confirmation of the role of atg16l1 as a crohn’s disease susceptibility gene. Inflamm. Bowel Dis. 13, 941–946 (2007).Article

Cummings，F.J.等人证实了atg16l1作为克罗恩病易感基因的作用。发炎。。13941-946（2007）。文章

PubMed

Google Scholar

谷歌学者

Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).Article

Franke，A。等人。全基因组荟萃分析将确诊的克罗恩病易感基因座数量增加到71个。纳特·吉内特。421118-1125（2010）。文章

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Stoll, M. et al. Genetic variation in dlg5 is associated with inflammatory bowel disease. Nat. Genet. 36, 476–480 (2004).Article

Stoll，M。等人。dlg5的遗传变异与炎症性肠病有关。纳特·吉内特。36476-480（2004）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Martinez-Chamorro, A. et al. Epistatic interaction between tlr4 and nod2 in patients with crohn’s disease: Relation with risk and phenotype in a spanish cohort. Immunobiology 221, 927–933 (2016).Article

Martinez-Chamorro，A。等人。克罗恩病患者tlr4和nod2之间的上位相互作用：与西班牙队列中的风险和表型的关系。免疫生物学221927-933（2016）。文章

CAS

中科院

PubMed

Google Scholar

谷歌学者

Pang, B., Nijkamp, E. & Wu, Y. N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat. 45, 227–248 (2020).Article

。J、教育。行为。Stat.45227–248（2020）。文章

Google Scholar

谷歌学者

Feng, T. & Zhu, X. Genome-wide searching of rare genetic variants in wtccc data. Hum. Genet. 128, 269–280 (2010).Article

Feng，T。＆Zhu，X。wtccc数据中罕见遗传变异的全基因组搜索。嗯，Genet。128269-280（2010）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Li, H. et al. Complex-disease networks of trait-associated single-nucleotide polymorphisms (SNPs) unveiled by information theory. J. Am. Med. Inform. Assoc. 19, 295–305 (2012).Article

Li，H。等人。信息论揭示的性状相关单核苷酸多态性（SNP）的复杂疾病网络。J、上午医疗通知。协会第19295-305号（2012年）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Jiang, Y. et al. Meta-analysis of 125 rheumatoid arthritis-related single nucleotide polymorphisms studied in the past two decades. PLoS ONE 7, e51571 (2012).Article

Jiang，Y。等。过去二十年研究的125种类风湿关节炎相关单核苷酸多态性的荟萃分析。PLoS ONE 7，e51571（2012）。文章

ADS

CAS

中科院

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Phuong, M. & Hutter, M. Formal algorithms for transformers. arXiv preprint arXiv:2207.09238 (2022).Qiang, Y. et al. Attcat: Explaining transformers via attentive class activation tokens. Adv. Neural. Inf. Process. Syst. 35, 5052–5064 (2022).

Phuong，M。和Hutter，M。变压器的形式算法。arXiv预印本arXiv:2207.09238（2022）。Qiang，Y.等人。Attcat：通过专注的类激活标记解释变形金刚。高级神经。Inf.流程。系统。355052-5064（2022）。

Google Scholar

谷歌学者

Urbanowicz, R. J. et al. GAMETES: A fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5, 16. https://doi.org/10.1186/1756-0381-5-16 (2012).Article

Urbanowicz，R.J.等人，《配子：一种快速、直接的算法，用于生成具有随机结构的纯、严格、上位性模型。生物数据挖掘5,16。https://doi.org/10.1186/1756-0381-5-16（2012年）。文章

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Download referencesAcknowledgementsThis work was supported by European Union HE Research and Innovation programme under grant agreement No 101092877 (SYCLOPS), and FCT (Fundação para a Ciência e a Tecnologia, Portugal) through the UIDB/50021/2020 project and the UI/BD/154603/2022 research grant.

下载参考文献致谢这项工作得到了欧盟HE研究与创新计划（第101092877号赠款协议（SYCLOPS））和FCT（葡萄牙技术基金会）通过UIDB/50021/2020项目和UI/BD/154603/2022研究资助的支持。

The research presented in this paper has benefited from the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053, and the FCT+Google Advanced Computing Project (CPCA-IAC/AV/478750/2022). Finally, we acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC (Finland) and the LUMI consortium through a EuroHPC Benchmark Access call (EHPC-BEN-2023B01-002).Author informationAuthor notesThese authors contributed equally: Miguel Graça, Ricardo Nobre, Leonel Sousa and Aleksandar Ilic.Authors and AffiliationsINESC-ID, Instituto Superior Técnico, 1000-029, Lisbon, PortugalMiguel Graça, Ricardo Nobre, Leonel Sousa & Aleksandar IlicAuthorsMiguel GraçaView author publicationsYou can also search for this author in.

本文介绍的研究受益于挪威研究委员会根据合同270053提供的Exascale计算探索实验基础设施（eX3）和FCT+谷歌高级计算项目（CPCA-IAC/AV/478750/2022）。最后，我们感谢EuroHPC联合承诺授予该项目访问EuroHPC超级计算机LUMI的权限，该超级计算机由CSC（芬兰）和LUMI财团通过EuroHPC基准访问呼叫（EHPC-BEN-2023B01-002）托管。作者信息作者注意到这些作者做出了同样的贡献：米格尔·格拉萨（MiguelGraça），里卡多·诺布雷（RicardoNobre），莱昂内尔·索萨（LeonelSousa）和亚历山达尔·艾里克（AleksandarIlic）。作者和附属机构ID，Instituto Superior Técnico，1000-029，里斯本，PortugalMiguel Graça，Ricardo Nobre，Leonel Sousa＆Aleksandar Ilicauthors Miguel GraçaView作者出版物您也可以在中搜索这位作者。

PubMed Google ScholarRicardo NobreView author publicationsYou can also search for this author in

PubMed谷歌学术评论作者出版物您也可以在

PubMed Google ScholarLeonel SousaView author publicationsYou can also search for this author in

PubMed Google ScholarLeonel SousaView作者出版物您也可以在

PubMed Google ScholarAleksandar IlicView author publicationsYou can also search for this author in

PubMed Google ScholarAleksandar IlicView作者出版物您也可以在

PubMed Google ScholarContributionsAll authors contributed equally to this work: M.G., A.L., and L.S. designed the experiments; M.G., R.N., L.S., and A.L. performed the experiments and analyzed the data; M.G., R.N., L.S., and A.L. wrote the manuscript. All authors reviewed the manuscript.Corresponding authorCorrespondence to.

PubMed谷歌学术贡献所有作者都对这项工作做出了同样的贡献：M.G.，A.L。和L.S.设计了实验；M、 G.，R.N.，L.S。和A.L.进行了实验并分析了数据；M、 G.，R.N.，L.S。和A.L.撰写了手稿。所有作者都审阅了手稿。对应作者对应。

Miguel Graça.Ethics declarations

米格尔·格拉萨。道德宣言

Competing interests

相互竞争的利益

The authors declare no competing interests.

作者声明没有利益冲突。

Additional informationPublisher's noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information.Rights and permissions

Additional informationPublisher的noteSpringer Nature在已发布地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息。权限和权限

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

开放获取本文是根据知识共享署名4.0国际许可证授权的，该许可证允许以任何媒体或格式使用，共享，改编，分发和复制，只要您对原始作者和来源给予适当的信任，提供知识共享许可证的链接，并指出是否进行了更改。

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/..

要查看此许可证的副本，请访问http://creativecommons.org/licenses/by/4.0/..

Reprints and permissionsAbout this articleCite this articleGraça, M., Nobre, R., Sousa, L. et al. Distributed transformer for high order epistasis detection in large-scale datasets.

转载和许可本文引用本文Graça，M.，Nobre，R.，Sousa，L。等人。分布式变压器，用于大规模数据集中的高阶上位性检测。

Sci Rep 14, 14579 (2024). https://doi.org/10.1038/s41598-024-65317-5Download citationReceived: 24 February 2024Accepted: 19 June 2024Published: 25 June 2024DOI: https://doi.org/10.1038/s41598-024-65317-5Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.

Sci Rep 1414579（2024）。https://doi.org/10.1038/s41598-024-65317-5Download引文接收日期：2024年2月24日接受日期：2024年6月19日发布日期：2024年6月25日OI：https://doi.org/10.1038/s41598-024-65317-5Share本文与您共享以下链接的任何人都可以阅读此内容：获取可共享链接对不起，本文目前没有可共享的链接。。

Provided by the Springer Nature SharedIt content-sharing initiative

由Springer Nature SharedIt内容共享计划提供

KeywordsBioinformaticsMachine learningHigh performance computing

关键词信息机器学习高性能计算

CommentsBy submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

评论通过提交评论，您同意遵守我们的条款和社区指南。如果您发现有虐待行为或不符合我们的条款或准则，请将其标记为不合适。

全球产业链接平台

重庆市渝北区金星科技大厦A区5楼512室

联系电话：023-67139735（重庆）

关于我们

产品服务