EN
登录

基于深度学习的生存和基因重要性预测表示方法的稳健评估

Robust evaluation of deep learning-based representation methods for survival and gene essentiality prediction on bulk RNA-seq data

Nature 等信源发布 2024-07-24 20:23

可切换为仅中文


AbstractDeep learning (DL) has shown potential to provide powerful representations of bulk RNA-seq data in cancer research. However, there is no consensus regarding the impact of design choices of DL approaches on the performance of the learned representation, including the model architecture, the training methodology and the various hyperparameters.

。然而,关于DL方法的设计选择对学习表示的性能的影响,包括模型体系结构,训练方法和各种超参数,目前还没有达成共识。

To address this problem, we evaluate the performance of various design choices of DL representation learning methods using TCGA and DepMap pan-cancer datasets and assess their predictive power for survival and gene essentiality predictions. We demonstrate that baseline methods achieve comparable or superior performance compared to more complex models on survival predictions tasks.

为了解决这个问题,我们使用TCGA和DepMap泛癌数据集评估了DL表示学习方法的各种设计选择的性能,并评估了它们对生存和基因重要性预测的预测能力。我们证明,与更复杂的生存预测任务模型相比,基线方法取得了相当或更高的性能。

DL representation methods, however, are the most efficient to predict the gene essentiality of cell lines. We show that auto-encoders (AE) are consistently improved by techniques such as masking and multi-head training. Our results suggest that the impact of DL representations and of pretraining are highly task- and architecture-dependent, highlighting the need for adopting rigorous evaluation guidelines.

然而,DL表示方法是预测细胞系基因重要性的最有效方法。。我们的研究结果表明,DL表示和预训练的影响高度依赖于任务和体系结构,突出了采用严格评估指南的必要性。

These guidelines for robust evaluation are implemented in a pipeline made available to the research community..

这些稳健评估指南在研究界可用的管道中实施。。

IntroductionPrecision medicine and the development of new therapies require accurate disease diagnosis and outcome prediction. The field of omics research has experienced an unprecedented data revolution fueled by high-throughput technologies, enabling the generation of high-dimensional omics data at an exponential pace.

引言精准医学和新疗法的发展需要准确的疾病诊断和结果预测。在高通量技术的推动下,组学研究领域经历了前所未有的数据革命,使高维组学数据的生成以指数速度增长。

This wealth of data provides interesting opportunities to unravel the molecular landscape of diseases, including cancer, and emphasizes the need for robust computational approaches to extract meaningful insights. In particular, RNA sequencing (RNA-seq) is now ubiquitous in molecular biology and oncology1 and was shown to be the most informative omics modality for predicting phenotypes of interest such as patient survival2 or gene essentiality in cell lines3.In parallel, deep learning-based representation learning approaches have shown remarkable potential in analyzing complex data, ranging from images to texts4,5,6.

这些丰富的数据为揭示包括癌症在内的疾病的分子格局提供了有趣的机会,并强调需要强大的计算方法来提取有意义的见解。特别是,RNA测序(RNA-seq)现在在分子生物学和肿瘤学中无处不在,并且被证明是预测感兴趣的表型(例如患者存活率2或细胞系中的基因必要性3)的最具信息量的组学模式。同时,基于深度学习的表征学习方法在分析从图像到文本的复杂数据方面显示出巨大的潜力4,5,6。

These methods, powered by artificial neural networks, excel at capturing intricate patterns, detecting subtle relationships, and making accurate predictions. Applying deep representation learning techniques (DRL) to RNA-seq data for cancer research holds the potential to revolutionize our understanding of cancer progression, classification, and treatment response.Therefore, the integration of deep learning-based approaches within the field of omics research holds immense promise for advancing our understanding of cancer biology7.

这些方法由人工神经网络提供支持,擅长捕捉复杂的模式,检测微妙的关系,并做出准确的预测。将深度表征学习技术(DRL)应用于癌症研究的RNA-seq数据,有可能彻底改变我们对癌症进展,分类和治疗反应的理解。因此,在组学研究领域整合基于深度学习的方法对于提高我们对癌症生物学的理解具有巨大的希望7。

Nonetheless, despite the vast potential of DRL algorithms and demonstrated success in vision and Natural Language Processing (NLP) domains, they still face challenges in surpassing traditional tree-based methods on tabular data8. Importantly, their application to omics data remains underexplored when considering gen.

尽管如此,尽管DRL算法具有巨大的潜力,并且在视觉和自然语言处理(NLP)领域取得了成功,但它们在超越基于表格数据的传统基于树的方法方面仍然面临挑战8。重要的是,在考虑gen时,它们在组学数据中的应用仍未得到充分探索。

In the case of the per-cohort OS prediction task, the previous procedure has to be adapted in order to ensure a fair comparison with the non-pre-trained case. We therefore took the union of the top 2,000 most variable genes for each of the 11 cohorts selected in the downstream task to make sure relevant genes per indication were also selected in the final features rather than genes solely linked to cancer type that would be considered when looking at the genes’ variances across the whole TCGA dataset.

对于每个队列的OS预测任务,必须调整之前的程序,以确保与未经预训练的案例进行公平比较。因此,我们对下游任务中选择的11个队列中的每一个队列进行了前2000个最可变基因的联合,以确保每个适应症的相关基因也在最终特征中被选择,而不是在查看整个TCGA数据集中基因的变异时会考虑的仅与癌症类型相关的基因。

This resulted in a set of 5046 unique gene identifiers..

这产生了一组5046个独特的基因标识符。。

For the gene essentiality task, we took the top 5000 most variable genes in TCGA after intersection with the genes present within the CCLE dataset, similarly to DeepDEP’s procedure of selecting genes with a standard deviation superior to 1 in TCGA.

对于基因重要性任务,我们在与CCLE数据集中存在的基因相交后,选择了TCGA中前5000个最可变的基因,类似于DeepDEP在TCGA中选择标准偏差优于1的基因的过程。

The TCGA data used for pretraining was normalized within each fold with mean standard scaling and learned statistics were saved for potential usage on the downstream datasets (pretraining experiments).Repeated holdout cross-validation frameworkIn this study, we aim to compare the performance of different representation learning algorithms on downstream survival and gene essentiality prediction tasks using bulk RNA-seq data.

。重复保持交叉验证框架在这项研究中,我们旨在使用大量RNA-seq数据比较不同表示学习算法在下游生存和基因重要性预测任务中的性能。

Each representation model is trained and used to transform the input expression data before feeding the learned low-dimensional embeddings to a task-specific prediction model, fitted for each representation model tested. To achieve a comprehensive evaluation, we adopt a validation pipeline that focuses on exploring the learning algorithm's variability to diverse hyperparameter settings.

每个表示模型都经过训练并用于转换输入的表达数据,然后将学习到的低维嵌入提供给特定于任务的预测模型,该模型适用于每个测试的表示模型。为了实现全面的评估,我们采用了一个验证管道,重点是探索学习算法对不同超参数设置的可变性。

Our validation pipeline involves a repeated holdout cross-validation approach45 in which the dataset is repeatedly split in two to create pairs of training and test sets, comprising 80% and 20% of the original data respectively (Supplementary Fig. S1). For experiments without pretraining, the training sets are used to select jointly the optimal HPs for the representation and prediction models by performing a fivefold cross-validation for a given set of HPs.

。对于没有预训练的实验,训练集用于通过对给定的一组HPs进行五倍交叉验证,共同选择表示和预测模型的最佳HPs。

The HP tuning is performed using a Tree-structured Parzen Estimator (TPE Sampler) implemented by Optuna56 with a fixed budget of 50 iterations. Then, we select the set of HPs with the best average performance over the validation folds on the downstream tasks to train the representation and prediction models on the whole training set before evaluating it on the test set.

HP调优是使用Optuna56实现的树结构Parzen估计器(TPE采样器)执行的,固定预算为50次迭代。然后,我们选择在下游任务的验证折叠上具有最佳平均性能的HPs集,以在整个训练集上训练表示和预测模型,然后在测试集上对其进行评估。

This procedure is repeated 10 times to generate a distribution of scores over the different test sets, providing robust performance assessments com.

此过程重复10次,以生成不同测试集的分数分布,从而提供可靠的性能评估com。

Data availability

数据可用性

The cancer TCGA data was downloaded from recount3 https://rna.recount.bio/, the associated clinical data from TCGA-CDR hosted on https://gdc.cancer.gov and the cell lines datasets from the DepMap portal https://depmap.org/portal/. The code and the processed data used in our study are available on GitHub: https://github.com/owkin/drl-evaluation.

癌症TCGA数据是从recount3下载的https://rna.recount.bio/,来自TCGA-CDR的相关临床数据托管在https://gdc.cancer.gov以及来自DepMap门户的细胞系数据集https://depmap.org/portal/.我们研究中使用的代码和处理后的数据可在GitHub上获得:https://github.com/owkin/drl-evaluation.

ReferencesStark, R., Grzelak, M. & Hadfield, J. RNA sequencing: The teenage years. Nat. Rev. Genet. 20, 631–656 (2019).Article

ReferencesStark,R.,Grzelak,M。&Hadfield,J。RNA测序:青少年时期。Genet自然Rev。20631-656(2019)。文章

CAS

中科院

PubMed

PubMed

Google Scholar

谷歌学者

Vale-Silva, L. A. & Rohr, K. Long-term cancer survival prediction using multimodal deep learning. Sci. Rep. 11, 13505 (2021).Article

Vale Silva,L.A。&Rohr,K。使用多模式深度学习进行长期癌症生存预测。科学。代表1113505(2021)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Chiu, Y.-C. et al. Predicting and characterizing a cancer dependency map of tumors with deep learning. Sci. Adv. 7, eabh1275 (2021).Article

Chiu,Y.-C.等人。通过深度学习预测和表征肿瘤的癌症依赖图。科学。广告7,eabh1275(2021)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).Article

Bengio,Y.,Courville,A。和Vincent,P。表征学习:回顾和新观点。IEEE Trans。模式肛门。马赫。因特尔。351798-1828(2013)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational Linguistics, 2019).

Devlin,J.,Chang,M.-W.,Lee,K。和Toutanova,K。BERT:深度双向变压器的语言理解预训练。在过程中。计算语言学协会北美分会2019年会议:人类语言技术,第1卷(长短论文),4171-4186(计算语言学协会,2019)。

https://doi.org/10.18653/v1/N19-1423.Misra, I. & Van Der Maaten, L. Self-Supervised Learning of Pretext-Invariant Representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6706–6716 (IEEE, 2020). https://doi.org/10.1109/CVPR42600.2020.00674.Chaudhary, K., Poirion, O.

https://doi.org/10.18653/v1/N19-1423.Misra。2020年IEEE/CVF计算机视觉和模式识别会议(CVPR),6706-6716(IEEE,2020)。https://doi.org/10.1109/CVPR42600.2020.00674.Chaudhary,K.,Poirion,O。

B., Lu, L. & Garmire, L. X. Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res. 24, 1248–1259 (2018).Article .

B、 ,Lu,L。&Garmire,L。X。基于深度学习的多组学整合有力地预测了肝癌的生存率。临床。癌症研究241248-1259(2018)。文章。

CAS

中科院

PubMed

PubMed

Google Scholar

谷歌学者

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data?. Mach. Learn. https://doi.org/10.48550/ARXIV.2207.08815 (2022).Article

Grinsztajn,L.,Oyallon,E。和Varoquaux,G。为什么基于树的模型在表格数据上仍然优于深度学习?。马赫。学习。https://doi.org/10.48550/ARXIV.2207.08815。文章

Google Scholar

谷歌学者

Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400-416.e11 (2018).Article

Liu,J.等人。一个综合的TCGA泛癌临床数据资源,用于推动高质量的生存结果分析。细胞173400-416.e11(2018)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Gönen, M. et al. A community challenge for inferring genetic predictors of gene essentialities through analysis of a functional screen of cancer cell lines. Cell Syst. 5, 485-497.e3 (2017).Article

Gönen,M.等人。通过分析癌细胞系的功能筛选来推断基因重要性的遗传预测因子的社区挑战。细胞系统。5485-497.e3(2017)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Zhakparov, D. et al. Assessing different feature selection methods applied to a bulk RNA sequencing dataset with regard to biomedical relevance, https://doi.org/10.3929/ETHZ-B-000565782 (2023).Liu, Y. et al. Post-modified non-negative matrix factorization for deconvoluting the gene expression profiles of specific cell types from heterogeneous clinical samples based on RNA-sequencing data.

Zhakparov,D.等人评估了在生物医学相关性方面应用于大量RNA测序数据集的不同特征选择方法,https://doi.org/10.3929/ETHZ-B-000565782(2023年)。Liu,Y.等人。基于RNA测序数据,对来自异质临床样品的特定细胞类型的基因表达谱进行解卷积的后修饰非负矩阵分解。

J. Chemom. 32, e2929 (2018).Article .

J、Chemom。32E2929(2018)。文章。

Google Scholar

谷歌学者

Chen, R. et al. Large-scale bulk RNA-seq analysis defines immune evasion mechanism related to mast cell in gliomas. Front. Immunol. 13, 914001 (2022).Article

Chen,R。等人。大规模批量RNA-seq分析定义了与胶质瘤中肥大细胞相关的免疫逃避机制。正面。免疫。13914001(2022)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Wei, Q. et al. Molecular subtypes of lung adenocarcinoma patients for prognosis and therapeutic response prediction with machine learning on 13 programmed cell death patterns. J. Cancer Res. Clin. Oncol. 149, 11351–11368 (2023).Article

Wei,Q。等。肺腺癌患者的分子亚型,用于13种程序性细胞死亡模式的机器学习预后和治疗反应预测。J、 癌症研究临床。Oncol公司。14911351-11368(2023)。文章

CAS

中科院

PubMed

PubMed

Google Scholar

谷歌学者

Sauta, E. et al. Combining gene mutation with transcriptomic data improves outcome prediction in myelodysplastic syndromes. Blood 142, 1863–1863 (2023).Article

Sauta,E.等人将基因突变与转录组数据相结合,可改善骨髓增生异常综合征的预后预测。血液1421863-1863(2023)。文章

Google Scholar

谷歌学者

Li, Q. et al. XA4C: eXplainable representation learning via autoencoders revealing critical genes. PLoS Comput. Biol. 19, e1011476 (2023).Article

Li,Q。等人。XA4C:通过揭示关键基因的自动编码器进行可解释的表征学习。PLoS计算机。生物学杂志19,e1011476(2023)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

De Weerd, H. A. et al. Representational learning from healthy multi-tissue human RNA-Seq data such that latent space arithmetics extracts disease modules. bioRxiv https://doi.org/10.1101/2023.10.03.560661 (2023).Article

De Weerd,H.A.等人。从健康的多组织人类RNA-Seq数据中进行代表性学习,以便潜在空间算法提取疾病模块。生物十四https://doi.org/10.1101/2023.10.03.560661(2023年)。文章

Google Scholar

谷歌学者

Withnell, E., Zhang, X., Sun, K. & Guo, Y. XOmiVAE: An interpretable deep learning model for cancer classification using high-dimensional omics data. Brief. Bioinform. 22, bbab315 (2021).Article

Withnell,E.,Zhang,X.,Sun,K。&Guo,Y。XOmiVAE:使用高维组学数据进行癌症分类的可解释深度学习模型。简介。生物信息。22,bbab315(2021)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

He, D., Liu, Q., Wu, Y. & Xie, L. A context-aware deconfounding autoencoder for robust prediction of personalized clinical drug response from cell-line compound screening. Nat. Mach. Intell. 4, 879–892 (2022).Article

He,D.,Liu,Q.,Wu,Y。&Xie,L。一种上下文感知的解构自动编码器,用于从细胞系化合物筛选中稳健预测个性化临床药物反应。自然马赫数。因特尔。。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Chen, J. et al. Deep transfer learning of cancer drug responses by integrating bulk and single-cell RNA-seq data. Nat. Commun. 13, 6494 (2022).Article

Chen,J.等人。通过整合大量和单细胞RNA-seq数据对癌症药物反应进行深度转移学习。国家公社。136494(2022)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Dincer, A. B., Celik, S., Hiranuma, N. & Lee, S.-I. DeepProfile: Deep learning of cancer molecular profiles for precision medicine. bioRxiv https://doi.org/10.1101/278739 (2018).Article

Dincer,A.B.,Celik,S.,Hiranuma,N。&Lee,S.-I。DeepProfile:精准医学癌症分子谱的深度学习。生物十四https://doi.org/10.1101/278739(2018年)。文章

Google Scholar

谷歌学者

Rampášek, L., Hidru, D., Smirnov, P., Haibe-Kains, B. & Goldenberg, A. Dr.VAE: Improving drug response prediction via modeling of drug perturbation effects. Bioinformatics 35, 3743–3751 (2019).Article

Rampášek,L.,Hidru,D.,Smirnov,P.,Haibe Kains,B。&Goldenberg,A.VAE博士:通过药物扰动效应的建模改进药物反应预测。生物信息学353743-3751(2019)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Shen, H. et al. Miscell: An efficient self-supervised learning approach for dissecting single-cell transcriptome. iScience 24, 103200 (2021).Article

Shen,H。et al。Miscell:一种用于解剖单细胞转录组的有效自我监督学习方法。iScience 24103200(2021)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Han, W. et al. Self-supervised contrastive learning for integrative single cell RNA-Seq data analysis. bioRxiv https://doi.org/10.1101/2021.07.26.453730v1 (2021).Article

。生物十四https://doi.org/10.1101/2021.07.26.453730v1(2021年)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Li, X. et al. Network embedding-based representation learning for single cell RNA-seq data. Nucleic Acids Res. 45, e166 (2017).Article

Li,X。等人。基于网络嵌入的单细胞RNA-seq数据表示学习。核酸研究45,e166(2017)。文章

ADS

广告

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).Article

Theodoris,C.V。等人。迁移学习可以在网络生物学中进行预测。。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Cui, H. et al. scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv https://doi.org/10.1101/2023.04.30.538439 (2023).Article

Cui,H.等人。scGPT:利用生成人工智能建立单细胞多组学的基础模型。bioRxivhttps://doi.org/10.1101/2023.04.30.538439(2023年)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Shen, H. et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience 26, 106536 (2023).Article

Shen,H.等人,《从大规模转录组进行单细胞破译的生成性预训练》,《科学》26106536(2023)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinform. 21, 119 (2020).Article

标准机器学习方法在转录组学数据的表型预测方面优于深度表征学习。BMC生物信息。21119(2020)。文章

Google Scholar

谷歌学者

Cantini, L. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat. Commun. 12, 124 (2021).Article

Cantini,L.等人。癌症研究的联合多组学降维方法的基准测试。国家公社。12124(2021)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Bengio, Y. & Grandvalet, Y. No unbiased estimator of the variance of K-fold cross-validation. In Advances in Neural Information Processing Systems Vol. 16 (eds Thrun, S. et al.) (MIT Press, 2003).

Bengio,Y。和Grandvalet,Y。没有K倍交叉验证方差的无偏估计。《神经信息处理系统的进展》第16卷(eds Thrun,S.等人)(麻省理工学院出版社,2003年)。

Google Scholar

谷歌学者

Nadeau, C. & Bengio, Y. Inference for the generalization error. Mach. Learn. 52, 239–281 (2003).Article

Nadeau,C。&Bengio,Y。推断泛化误差。马赫。学习。52239-281(2003)。文章

Google Scholar

谷歌学者

Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. https://doi.org/10.1038/s41576-021-00434-9 (2021).Article

Whalen,S.,Schreiber,J.,Noble,W.S。&Pollard,K.S。在基因组学中应用机器学习的陷阱中导航。Genet自然Rev。https://doi.org/10.1038/s41576-021-00434-9(2021年)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255, https://doi.org/10.1109/CVPR.2009.5206848 (2009).Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XIV.

Deng,J。等人。ImageNet:一个大规模的分层图像数据库。在2009年IEEE计算机视觉和模式识别会议上,248-255,https://doi.org/10.1109/CVPR.2009.5206848。Kryshtafovych,A.,Schwede,T.,Topf,M.,Fidelis,K。&Moult,J。蛋白质结构预测方法的关键评估(CASP)-第十四轮。

Proteins Struct. Funct. Bioinform. 89, 1607–1617 (2021).Article .

蛋白质结构。函数。生物信息。891607-1617(2021)。文章。

CAS

中科院

Google Scholar

谷歌学者

Althubaiti, S. et al. DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration. bioRxiv https://doi.org/10.1101/2021.03.02.433454 (2021).Article

Althubaiti,S。et al。DeepMOCCA:泛癌预后模型通过图形注意和多组学数据整合来识别个性化预后标志物。生物十四https://doi.org/10.1101/2021.03.02.433454(2021年)。文章

Google Scholar

谷歌学者

Zhang, X., Xing, Y., Sun, K. & Guo, Y. OmiEmbed: A unified multi-task deep learning framework for multi-omics data. Cancers 13, 3047 (2021).Article

Zhang,X.,Xing,Y.,Sun,K。&Guo,Y。OmiEmbed:多组学数据的统一多任务深度学习框架。癌症133047(2021)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).Article

Lopez,R.,Regier,J.,Cole,M.B.,Jordan,M.I。&Yosef,N。单细胞转录组学的深度生成建模。自然方法151053-1058(2018)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Fang, Z., Zheng, R. & Li, M. scMAE: A masked autoencoder for single-cell RNA-seq clustering. Bioinformatics https://doi.org/10.1093/bioinformatics/btae020 (2024).Article

Fang,Z.,Zheng,R。&Li,M。scMAE:用于单细胞RNA-seq聚类的掩蔽自动编码器。生物信息学https://doi.org/10.1093/bioinformatics/btae020(2024年)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Yoon, J., Zhang, Y., Jordon, J. & van der Schaar, M. VIME: Extending the success of self- and semi-supervised learning to tabular domain. In Proc. of the 34th International Conference on Neural Information Processing Systems (Curran Associates Inc., 2020).Arslan, M., Guzel, M., Demirci, M.

Yoon,J.,Zhang,Y.,Jordon,J.&van der Schaar,M.VIME:将自我和半监督学习的成功扩展到表格领域。在过程中。第34届神经信息处理系统国际会议(Curran Associates Inc.,2020)。阿尔斯兰,M.,古泽尔,M.,德米尔西,M。

& Ozdemir, S. SMOTE and Gaussian noise based sensor data augmentation. In 2019 4th International Conference on Computer Science and Engineering (UBMK), 1–5 (IEEE, 2019). https://doi.org/10.1109/UBMK.2019.8907003.Huang, Z. et al. Deep learning-based cancer survival prognosis from RNA-seq data: Approaches and evaluations.

&Ozdemir,S。SMOTE和基于高斯噪声的传感器数据增强。2019年第四届国际计算机科学与工程会议(UBMK),1-5(IEEE,2019)。https://doi.org/10.1109/UBMK.2019.8907003.Huang,Z.等人。基于RNA-seq数据的基于深度学习的癌症生存预后:方法和评估。

BMC Med. Genom. 13, 41 (2020).Article .

BMC医学。13, 41 (2020).第条。

Google Scholar

谷歌学者

Multiple Myeloma DREAM Consortium et al. Multiple myeloma DREAM challenge reveals epigenetic regulator PHF19 as marker of aggressive disease. Leukemia 34, 1866–1874 (2020).Article

多发性骨髓瘤DREAM联盟等。多发性骨髓瘤DREAM挑战揭示表观遗传调节因子PHF19是侵袭性疾病的标志物。白血病341866-1874(2020)。文章

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Filiot, A. et al. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv https://doi.org/10.1101/2023.07.21.23292757 (2023).Article

Filiot,A.等人。使用掩蔽图像建模缩放组织病理学的自我监督学习。medRxiv公司https://doi.org/10.1101/2023.07.21.23292757(2023年)。文章

Google Scholar

谷歌学者

Varoquaux, G. & Colliot, O. Evaluating machine learning models and their diagnostic value. In Machine Learning for Brain Disorders Vol. 197 (ed. Colliot, O.) 601–630 (Springer US, 2023).Chapter

Varoquaux,G。&Colliot,O。评估机器学习模型及其诊断价值。《大脑疾病的机器学习》第197卷(编辑:Colliot,O.)601-630(Springer US,2023)。第章

Google Scholar

谷歌学者

Lachmann, A. et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat. Commun. 9, 1366 (2018).Article

Lachmann,A.等人。大规模挖掘来自人类和小鼠的公开可用RNA-seq数据。国家公社。91366(2018)。文章

ADS

广告

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Barretina, J. et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).Article

Barretina,J。等人,《癌细胞系百科全书》能够对抗癌药物敏感性进行预测建模。自然483603-607(2012)。文章

ADS

广告

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Wilks, C. et al. recount3: Summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 22, 323 (2021).Article

Wilks,C。等人叙述了3:大规模RNA-seq表达和剪接的摘要和查询。基因组生物学。。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Harrell, F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).Article

Harrell,F.E.,Lee,K.L。&Mark,D.B。多变量预后模型:开发模型,评估假设和充分性以及测量和减少误差的问题。《统计医学》15361-387(1996)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Dempster, J. M. et al. Extracting biological insights from the project Achilles genome-scale CRISPR screens in cancer cell lines. bioRxiv https://doi.org/10.1101/720243 (2019).Article

Dempster,J.M.等人从癌细胞系中的Achilles基因组规模CRISPR筛选项目中提取生物学见解。生物十四https://doi.org/10.1101/720243(2019年)。文章

Google Scholar

谷歌学者

Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).Article

Liberzon,A。等人。分子签名数据库(MSigDB)3.0。生物信息学271739-1740(2011)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Rosenski, J., Shifman, S. & Kaplan, T. Predicting gene knockout effects from expression data. BMC Med. Genom. 16, 26 (2023).Article

Rosenski,J.,Shifman,S。&Kaplan,T。从表达数据预测基因敲除效应。BMC医学基因组。16、26(2023年)。文章

Google Scholar

谷歌学者

Ma, J. et al. Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients. Nat. Cancer 2, 233–244 (2021).Article

Ma,J。等人。很少有镜头学习可以创建药物反应的预测模型,这些模型可以从高通量筛选转化为个体患者。《自然癌症》2233-244(2021)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Hou, J. et al. Distance correlation application to gene co-expression network analysis. BMC Bioinform. 23, 81 (2022).Article

Hou,J。等人。距离相关在基因共表达网络分析中的应用。BMC生物信息。23,81(2022)。文章

CAS

中科院

Google Scholar

谷歌学者

Paton, V. et al. Assessing the impact of transcriptomics data analysis pipelines on downstream functional enrichment results. bioRxiv https://doi.org/10.1101/2023.09.13.557538 (2023).Article

Paton,V。等人。评估转录组学数据分析管道对下游功能富集结果的影响。生物十四https://doi.org/10.1101/2023.09.13.557538(2023年)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. Mach. Learn. https://doi.org/10.48550/ARXIV.1907.10902 (2019).Article

Akiba,T.,Sano,S.,Yanase,T.,Ohta,T。&Koyama,M。Optuna:下一代超参数优化框架。马赫。学习。https://doi.org/10.48550/ARXIV.1907.10902(2019年)。文章

Google Scholar

谷歌学者

Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).Article

Hinton,G.E。和Salakhutdinov,R.R。用神经网络降低数据的维数。科学313504-507(2006)。文章

ADS

广告

MathSciNet

MathSciNet

CAS

中科院

PubMed

PubMed

Google Scholar

谷歌学者

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Preprint at http://arxiv.org/abs/1412.6980 (2017).Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Mach. Learn. https://doi.org/10.48550/ARXIV.1312.6114 (2013).Article

Kingma,D.P.&Ba,J.Adam:一种随机优化方法。预印于http://arxiv.org/abs/1412.6980(2017年)。Kingma,D.P。和Welling,M。自动编码变分贝叶斯。马赫。学习。https://doi.org/10.48550/ARXIV.1312.6114(2013年)。文章

Google Scholar

谷歌学者

Ramirez, R. et al. Prediction and interpretation of cancer survival using graph convolution neural networks. Methods 192, 120–130 (2021).Article

Ramirez,R。等人。使用图卷积神经网络预测和解释癌症存活率。方法192120-130(2021)。文章

CAS

中科院

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Perez, L. & Wang, J. The effectiveness of data augmentation in image classification using deep learning. Comput. Vis. Pattern Recognit. https://doi.org/10.48550/ARXIV.1712.04621 (2017).Article

Perez,L。&Wang,J。使用深度学习的图像分类中数据增强的有效性。计算机。。模式识别。https://doi.org/10.48550/ARXIV.1712.04621(2017年)。文章

Google Scholar

谷歌学者

Faraggi, D. & Simon, R. A neural network model for survival data. Stat. Med. 14, 73–82 (1995).Article

Faraggi,D。和Simon,R。生存数据的神经网络模型。《统计医学》14,73-82(1995)。文章

CAS

中科院

PubMed

PubMed

Google Scholar

谷歌学者

Katzman, J. et al. DeepSurv: Personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Med. Res. Methodol. 18, 24 (2018).Article

Katzman,J.等人,《DeepSurv:使用cox比例风险深度神经网络的个性化治疗推荐系统》。BMC Med。Res。Methodol。18,24(2018)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Download referencesAcknowledgementsThe results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. We thank Oussama Tchita, Omar Darwiche Domingues and Thomas Chaigneau for their valuable contributions and comments to strengthen our pipeline and coding best practices.

下载参考文献致谢此处显示的结果全部或部分基于TCGA研究网络生成的数据:https://www.cancer.gov/tcga.我们感谢Oussama Tchita、Omar Darwiche Domingues和Thomas Chaigneau为加强我们的管道和编码最佳实践所做的宝贵贡献和评论。

We thank Gilles Wainrib for initial ideas and discussions, Nicolas Loiseau for his advice and statistical expertise, Floriane Montanari, Benoît Schmauch, Gilles Wainrib and Jean-Philippe Vert for their detailed proofreading and insightful comments.FundingThis study has been funded by Owkin, Inc., New York, NY, USA.Author informationAuthor notesThese authors contributed equally: Baptiste Gross, Antonin Dauvin, Vincent Cabeli.

我们感谢Gilles Wainrib的初步想法和讨论,感谢Nicolas Loiseau的建议和统计专业知识,感谢Floriane Montanari,Benoît Schmauch,Gilles Wainrib和Jean-Philippe Vert的详细校对和有见地的评论。资助这项研究由美国纽约州纽约市的Owkin,Inc.资助。作者信息作者注意到这些作者贡献均等:巴蒂斯特·格罗斯,安东宁·道文,文森特·卡贝利。

Eric Y. Durand and Alberto Romagnoni jointly supervised the work.Authors and AffiliationsOwkin, Inc., New York, NY, USABaptiste Gross, Antonin Dauvin, Vincent Cabeli, Virgilio Kmetzsch, Jean El Khoury, Gaëtan Dissez, Khalil Ouardini, Simon Grouard, Alec Davi, Regis Loeb, Christian Esposito, Louis Hulot, Ridouane Ghermi, Michael Blum, Yannis Darhi, Eric Y.

埃里克·杜兰德(EricY.Durand)和阿尔贝托·罗马尼奥尼(AlbertoRomagnoni)共同监督了这项工作。作者和附属机构Sowkin,Inc.,纽约,纽约,美国巴蒂斯特·格罗斯(Usabactiste Gross),安东宁·道文(AntoninDauvin),文森特·卡贝利(Vincent Cabeli),维吉利奥·科梅茨奇(VirgilioKmetzsch),让·埃尔·霍里(Jean El Khoury),加坦·迪塞兹(Gaëtan Dissez),哈利勒·瓦迪尼(Khali。

Durand & Alberto RomagnoniAuthorsBaptiste GrossView author publicationsYou can also search for this author in.

Durand&Alberto RomagnoniauthorsBaptisteGrossview作者出版物您也可以在中搜索这位作者。

PubMed Google ScholarAntonin DauvinView author publicationsYou can also search for this author in

PubMed Google ScholarAntonin DauvinView作者出版物您也可以在

PubMed Google ScholarVincent CabeliView author publicationsYou can also search for this author in

PubMed Google ScholarVincent CabeliView作者出版物您也可以在

PubMed Google ScholarVirgilio KmetzschView author publicationsYou can also search for this author in

PubMed Google ScholarVirgilio KmetzschView作者出版物您也可以在

PubMed Google ScholarJean El KhouryView author publicationsYou can also search for this author in

PubMed Google ScholarJean El KhouryView作者出版物您也可以在

PubMed Google ScholarGaëtan DissezView author publicationsYou can also search for this author in

PubMed Google ScholarGaëtan DissezView作者出版物您也可以在

PubMed Google ScholarKhalil OuardiniView author publicationsYou can also search for this author in

PubMed Google ScholarKhalil OuardiniView作者出版物您也可以在

PubMed Google ScholarSimon GrouardView author publicationsYou can also search for this author in

PubMed Google ScholarSimon GrouardView作者出版物您也可以在

PubMed Google ScholarAlec DaviView author publicationsYou can also search for this author in

PubMed Google ScholarAlec DavidView作者出版物您也可以在

PubMed Google ScholarRegis LoebView author publicationsYou can also search for this author in

PubMed Google ScholarRegis LoebView作者出版物您也可以在

PubMed Google ScholarChristian EspositoView author publicationsYou can also search for this author in

PubMed Google ScholarChristian EspositoView作者出版物您也可以在

PubMed Google ScholarLouis HulotView author publicationsYou can also search for this author in

PubMed Google ScholarLouis HulotView作者出版物您也可以在

PubMed Google ScholarRidouane GhermiView author publicationsYou can also search for this author in

PubMed Google ScholarRidouane GhermiView作者出版物您也可以在

PubMed Google ScholarMichael BlumView author publicationsYou can also search for this author in

PubMed Google Scholarmamichael BlumView作者出版物您也可以在

PubMed Google ScholarYannis DarhiView author publicationsYou can also search for this author in

PubMed Google ScholarYannis DarhiView作者出版物您也可以在

PubMed Google ScholarEric Y. DurandView author publicationsYou can also search for this author in

PubMed谷歌学者Y.DurandView作者出版物您也可以在

PubMed Google ScholarAlberto RomagnoniView author publicationsYou can also search for this author in

PubMed Google ScholarAlberto RomagnoniView作者出版物您也可以在

PubMed Google ScholarContributionsB.G., An.D., V.C, V.K., M.B., Y.D and A.R designed the evaluation framework and the different tasks; An.D, B.G, V.K, G.D, V.C, J.E.K, K.O, S.G., Al.D, L.H., R.G., R.L and C.E wrote the code to develop the evaluation pipeline; B.G, An.D, V.C, V.K, J.E.K, G.D, S.G, K.O., Al.D, R.L, C.E, Y.D, A.R analyzed the results; Y.D., A.R.

PubMed谷歌学术贡献b。G、 ;An.D,B.G,V.K,G.D,V.C,J.E.K,K.O,S.G.,Al.D,L.H.,R.G.,R.L和C.E编写了开发评估管道的代码;B、 G,An.D,V.C,V.K,J.E.K,G.D,S.G,K.O.,Al.D,R.L,C.E,Y.D,A.R分析了结果;Y、 D.,A.R。

and E.D. coordinated and supervised the work; B.G., An.D., V.C, Y.D, E.D. and A.R wrote the paper with the assistance and feedback of all the other co-authors. All authors reviewed and approved the final manuscript.Corresponding authorCorrespondence to.

和E.D.协调和监督工作;B、 。所有作者都审查并批准了最终稿件。。

Baptiste Gross.Ethics declarations

巴蒂斯特·格罗斯(BaptisteGross)。道德宣言

Competing interests

相互竞争的利益

All authors are employees of Owkin, Inc., New York, NY, USA.

所有作者都是美国纽约州纽约市Owkin,Inc.的员工。

Additional informationPublisher's noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information.Rights and permissions

Additional informationPublisher的noteSpringer Nature在已发布地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息。权限和权限

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material.

开放获取本文是根据知识共享署名非商业性NoDerivatives 4.0国际许可证授权的,该许可证允许以任何媒介或格式进行任何非商业性使用,共享,分发和复制,只要您对原始作者和来源给予适当的信任,提供知识共享许可证的链接,并指出您是否修改了许可材料。

You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

根据本许可证,您无权共享源自本文或其部分的改编材料。本文中的图像或其他第三方材料包含在文章的知识共享许可证中,除非该材料的信用额度中另有说明。如果材料未包含在文章的知识共享许可中,并且您的预期用途不受法律法规的许可或超出许可用途,则您需要直接获得版权所有者的许可。

To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/..

要查看此许可证的副本,请访问http://creativecommons.org/licenses/by-nc-nd/4.0/..

Reprints and permissionsAbout this articleCite this articleGross, B., Dauvin, A., Cabeli, V. et al. Robust evaluation of deep learning-based representation methods for survival and gene essentiality prediction on bulk RNA-seq data.

转载和许可本文引用本文Gross,B.,Dauvin,A.,Cabeli,V。等人。对大量RNA-seq数据的生存和基因重要性预测的基于深度学习的表示方法的稳健评估。

Sci Rep 14, 17064 (2024). https://doi.org/10.1038/s41598-024-67023-8Download citationReceived: 15 April 2024Accepted: 08 July 2024Published: 24 July 2024DOI: https://doi.org/10.1038/s41598-024-67023-8Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.

科学报告1417064(2024)。https://doi.org/10.1038/s41598-024-67023-8Download引文接收日期:2024年4月15日接受日期:2024年7月8日发布日期:2024年7月24日OI:https://doi.org/10.1038/s41598-024-67023-8Share本文与您共享以下链接的任何人都可以阅读此内容:获取可共享链接对不起,本文目前没有可共享的链接。复制到剪贴板。

Provided by the Springer Nature SharedIt content-sharing initiative

由Springer Nature SharedIt内容共享计划提供

KeywordsRNAseqRepresentation learningDeep learningSurvival predictionGene essentialityBenchmarking

关键词RNaseqrepresentation learning深度学习生存预测基因本质标记

CommentsBy submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

。如果您发现有虐待行为或不符合我们的条款或准则,请将其标记为不合适。