EN
登录

Nature:随机生存森林预测多种生理危险因素对全因死亡率的综合影响

Random survival forest for predicting the combined effects of multiple physiological risk factors on all-cause mortality

Nature 等信源发布 2024-07-06 14:30

可切换为仅中文


AbstractUnderstanding the combined effects of risk factors on all-cause mortality is crucial for implementing effective risk stratification and designing targeted interventions, but such combined effects are understudied. We aim to use survival-tree based machine learning models as more flexible nonparametric techniques to examine the combined effects of multiple physiological risk factors on mortality.

摘要了解风险因素对全因死亡率的综合影响对于实施有效的风险分层和设计有针对性的干预措施至关重要,但这种综合影响尚未得到充分研究。我们的目标是使用基于生存树的机器学习模型作为更灵活的非参数技术来检查多种生理风险因素对死亡率的综合影响。

More specifically, we (1) study the combined effects between multiple physiological factors and all-cause mortality, (2) identify the five most influential factors and visualize their combined influence on all-cause mortality, and (3) compare the mortality cut-offs with the current clinical thresholds.

更具体地说,我们(1)研究多种生理因素与全因死亡率之间的综合影响,(2)确定五个最有影响的因素,并可视化它们对全因死亡率的综合影响,以及(3)比较死亡率截止值与目前的临床阈值。

Data from the 1999–2014 NHANES Survey were linked to National Death Index data with follow-up through 2015 for 17,790 adults. We observed that the five most influential factors affecting mortality are the tobacco smoking biomarker cotinine, glomerular filtration rate (GFR), plasma glucose, sex, and white blood cell count.

1999-2014年NHANES调查的数据与全国死亡指数数据相关联,并对17790名成年人进行了2015年的随访。我们观察到,影响死亡率的五个最有影响的因素是吸烟生物标志物可替宁,肾小球滤过率(GFR),血糖,性别和白细胞计数。

Specifically, high mortality risk is associated with being male, active smoking, low GFR, elevated plasma glucose levels, and high white blood cell count. The identified mortality-based cutoffs for these factors are mostly consistent with relevant studies and current clinical thresholds. This approach enabled us to identify important cutoffs and provide enhanced risk prediction as an important basis to inform clinical practice and develop new strategies for precision medicine..

具体而言,高死亡率风险与男性,主动吸烟,GFR低,血糖水平升高和白细胞计数高有关。确定的这些因素的基于死亡率的临界值与相关研究和当前的临床阈值基本一致。这种方法使我们能够确定重要的临界值,并提供增强的风险预测,作为告知临床实践和开发精准医学新策略的重要基础。。

IntroductionUnderstanding the relative importance of risk factors and their combined effects on all-cause mortality is key for risk stratification and helping design targeted interventions1,2. However, little is known about the combined effects of multiple physiological factors on mortality risk3,4.

引言了解风险因素的相对重要性及其对全因死亡率的综合影响是风险分层和帮助设计有针对性的干预措施的关键1,2。然而,关于多种生理因素对死亡风险的综合影响知之甚少3,4。

Survival analyses based on left truncated and right censored data are often conducted using linear models such as Cox proportional hazards (CPH) model and its extensions. In our previous study, we used CPH models to assess non-linear associations between all-cause mortality and each physiological indicator.

基于左截断和右删失数据的生存分析通常使用线性模型进行,例如Cox比例风险(CPH)模型及其扩展。在我们之前的研究中,我们使用CPH模型来评估全因死亡率与每个生理指标之间的非线性关联。

We did this by discretizing the physiological indicator into nine quantiles and by using a weighted sum of cubic polynomials (spline)5. While these models offer valuable insights into the relationship between individual risk factors and mortality, their ability to measure the effects of multiple factors is limited to additive relationships6,7.

我们通过将生理指标离散为九个分位数并使用三次多项式的加权和(样条)5来做到这一点。虽然这些模型为个体风险因素与死亡率之间的关系提供了有价值的见解,但它们衡量多种因素影响的能力仅限于加性关系6,7。

This limitation may restrict the capture of complex combined effects of multiple risk factors. In contrast, models such as survival trees and random survival forests (RSF) are alternatives. They help to identify the most influential factors leading to increased mortality risk8. These models can also account for complex correlations, detect interactions and non-linear associations between multiple risk factors9,10, and maintain high predictive power11.

这种限制可能会限制捕捉多种风险因素的复杂综合影响。相比之下,生存树和随机生存森林(RSF)等模型是替代方案。他们有助于确定导致死亡风险增加的最有影响的因素8。这些模型还可以解释复杂的相关性,检测多个风险因素之间的相互作用和非线性关联9,10,并保持较高的预测能力11。

Such capabilities are crucial for providing needed information to guide clinical prioritization and improve patient outcomes through early interventions and tailored treatments.Furthermore, it is crucial to visualize the combined effects of risk factors to enable effective risk stratification, providing a direct understanding of the intricate patterns of diverse factors.

这些能力对于提供所需信息以指导临床优先次序并通过早期干预和量身定制的治疗改善患者预后至关重要。此外,至关重要的是要可视化风险因素的综合影响,以实现有效的风险分层,从而直接了解各种因素的复杂模式。

While RSF mode.

而RSF模式。

Data availability

数据可用性

Data for the study was collected by the Centers of Disease Control and Prevention and our curated data is publicly available on Kaggle (https://www.kaggle.com/datasets/nguyenvy/nhanes-19882018?select=dictionary_nhanes.csv), figshare (https://figshare.com/articles/dataset/NHANES_1988-2018/21743372), and Hugging Face (https://huggingface.co/datasets/nguyenvy/cleaned_nhanes_1988_2018).

这项研究的数据是由疾病控制和预防中心收集的,我们的精选数据可以在Kaggle上公开获得(https://www.kaggle.com/datasets/nguyenvy/nhanes-19882018?select=dictionary_nhanes.csv),figshare(https://figshare.com/articles/dataset/NHANES_1988-2018/21743372)和拥抱的脸(https://huggingface.co/datasets/nguyenvy/cleaned_nhanes_1988_2018)。

The analytic code used in this report is publicly available on GitHub (https://github.com/zhaobuterry/Random-Survival-Forest-for-Predicting-the-Combined-Effects-of-Multiple-Physiological-Risk-Factors)..

本报告中使用的分析代码可在GitHub上公开获得(https://github.com/zhaobuterry/Random-Survival-Forest-for-Predicting-the-Combined-Effects-of-Multiple-Physiological-Risk-Factors)。。

ReferencesBrown, D. W., Giles, W. H. & Greenlund, K. J. Blood pressure parameters and risk of fatal stroke, NHANES II mortality study. Am. J. Hypertens. 20(3), 338–341 (2007).Article

参考文献Brown,D.W.,Giles,W.H。和Greenlund,K.J。血压参数和致命中风风险,NHANES II死亡率研究。Am.J.Hypertens。20(3),338-341(2007)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Beauchamp, A. et al. Inequalities in cardiovascular disease mortality: The role of behavioural, physiological and social risk factors. J. Epidemiol. Commun. Health 64(6), 542–548 (2010).Article

Beauchamp,A.等人,《心血管疾病死亡率的不平等:行为、生理和社会风险因素的作用》。J、 流行病。Commun公司。健康64(6),542-548(2010)。文章

Google Scholar

谷歌学者

Richard, A. et al. Effects of leisure-time and occupational physical activity on total mortality risk in NHANES III according to sex, ethnicity, central obesity, and age. J. Phys. Act. Health 12(2), 184–192 (2015).Article

Richard,A.等人。根据性别,种族,中心性肥胖和年龄,休闲时间和职业体育活动对NHANES III总死亡风险的影响。J、 物理。行动。健康12(2),184-192(2015)。文章

MathSciNet

MathSciNet

PubMed

PubMed

Google Scholar

谷歌学者

Odden, M. C. et al. Uric acid levels, kidney function, and cardiovascular mortality in US adults: National Health and Nutrition Examination Survey (NHANES) 1988–1994 and 1999–2002. Am. J. Kidney Dis. 64(4), 550–557 (2014).Article

Odden,M.C.等人,《美国成年人的尿酸水平、肾功能和心血管死亡率:国家健康与营养检查调查(NHANES)1988-1994年和1999-2002年》。美国肾脏病杂志。64(4),550–557(2014)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Nguyen, V. K. et al. Characterising the relationships between physiological indicators and all-cause mortality (NHANES): A population-based cohort study. Lancet Healthy Longevity 2(10), e651–e662 (2021).Article

Nguyen,V.K.等人。表征生理指标与全因死亡率(NHANES)之间的关系:一项基于人群的队列研究。柳叶刀健康长寿2(10),e651-e662(2021)。文章

PubMed

PubMed

Google Scholar

谷歌学者

McDonald, G. C. Ridge regression. Wiley Interdiscipl. Rev.: Comput. Stat. 1(1), 93–100 (2009).Article

。Wiley Interdiscipl公司。版本:Comput。Stat.1(1),93–100(2009)。文章

Google Scholar

谷歌学者

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Methodol.) 58(1), 267–288 (1996).Article

Tibshirani,R。通过套索回归收缩和选择。J、 R.统计社会:Ser。B(方法。)58(1),267-288(1996)。文章

MathSciNet

MathSciNet

Google Scholar

谷歌学者

Jung, S. Y. et al. Breast cancer risk and insulin resistance: Post genome-wide gene-environment interaction study using a random survival forest. Cancer Res. 79(10), 2784–2794 (2019).Article

Jung,S.Y.等人,《乳腺癌风险和胰岛素抵抗:使用随机生存森林的全基因组后基因-环境相互作用研究》。癌症研究79(10),2784-2794(2019)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Ishwaran, H. & Lu, M. Random survival forests. In Wiley StatsRef: Statistics Reference Online 1–13 (2014).Qiu, W. et al. Interpretable machine learning prediction of all-cause mortality. Commun. Med. 2(1), 125 (2022).Article

Ishwaran,H。&Lu,M。随机生存森林。在Wiley StatsRef:统计参考在线1-13(2014)中。邱,W。等。全因死亡率的可解释机器学习预测。Commun公司。医学2(1),125(2022)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Hamidi, O. et al. Identifying important risk factors for survival in kidney graft failure patients using random survival forests. Iran. J. Public Health 45(1), 27 (2016).PubMed

Hamidi,O.等人。使用随机生存森林确定肾移植失败患者生存的重要危险因素。伊朗。J、 公共卫生45(1),27(2016)。PubMed出版社

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Paluszyńska, A. Understanding random forests with randomForestExplainer. In The Comprehensive R Archive Network (2023).Ehrlinger, J. ggRandomForests: Exploring random forest survival. arXiv:1612.08974 (2016).Ehrlinger, J. & Blackstone, E. H. ggRandomForests: Survival with Random Forests (Springer, 2019)..

Paluszyńska,A。使用randomForestExplainer了解随机森林。在综合R档案网络(2023年)中。Ehrlinger,J。ggRandomForests:探索随机森林生存。arXiv:1612.08974(2016)。。。

Google Scholar

谷歌学者

Ehrlinger, J. ggrandomforests: Visually exploring a random forest for regression. arXiv:1501.07196 (2015).Ehrlinger, J., ggRandomForests: Random forests for regression. arXiv:1501.07196 (2016).Benowitz, N. L. et al. Optimal serum cotinine levels for distinguishing cigarette smokers and nonsmokers within different racial/ethnic groups in the United States between 1999 and 2004.

Ehrlinger,J。ggrandomforests:视觉探索随机森林进行回归。arXiv:1501.07196(2015)。Ehrlinger,J.,ggRandomForests:回归的随机森林。arXiv:1501.07196(2016)。Benowitz,N.L.等人。1999年至2004年间,在美国不同种族/族裔群体中区分吸烟者和不吸烟者的最佳血清可替宁水平。

Am. J. Epidemiol. 169(2), 236–248 (2009).Article .

美国流行病学杂志。169(2),236-248(2009)。文章。

PubMed

PubMed

Google Scholar

谷歌学者

Kim, S. Overview of cotinine cutoff values for smoking status classification. Neurosci. Nicotine 2019, 419–431 (2019).Article

Kim,S。吸烟状况分类的可替宁临界值概述。神经科学。尼古丁2019419-431(2019)。文章

Google Scholar

谷歌学者

Tresca, A. J. Normal White Blood Cell (WBC) Count (2022, accessed 10 Jun 2023). https://www.verywellhealth.com/white-blood-cell-wbc-count-1942660.Higuera, V. What Is a White Blood Cell (WBC) Count? (2022, accessed 10 Jun 2023). https://www.healthline.com/health/wbc-count.Wongvibulsin, S., Wu, K.

Tresca,A.J。正常白细胞(WBC)计数(2022年,2023年6月10日访问)。https://www.verywellhealth.com/white-blood-cell-wbc-count-1942660.Higuera,V。什么是白细胞(WBC)计数?(2022年,2023年6月10日访问)。https://www.healthline.com/health/wbc-count.Wongvibulsin,S.,Wu,K。

C. & Zeger, S. L. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med. Res. Methodol. 20(1), 1–14 (2020).Article .

C、 &Zeger,S.L。使用随机森林进行生存,纵向和多变量(RF-SLAM)数据分析的临床风险预测。BMC医学研究方法。20(1),1-14(2020)。文章。

Google Scholar

谷歌学者

Huang, H.-X. et al. Associations of plasma glucagon levels with estimated glomerular filtration rate, albuminuria and diabetic kidney disease in patients with type 2 diabetes mellitus. Diabetes Metabol. J. 45(6), 868–879 (2021).Article

Huang,H.-X.等人。2型糖尿病患者血浆胰高血糖素水平与估计肾小球滤过率、蛋白尿和糖尿病肾病的关系。糖尿病代谢。J、 45(6),868-879(2021)。文章

Google Scholar

谷歌学者

Nguyen, V. K. et al. Harmonized US National Health and Nutrition Examination Survey 1988–2018 for high throughput exposome-health discovery. MedRxiv 2023, 896 (2023).

Nguyen,V.K.等人,《1988-2018年美国国家健康与营养检查协调调查》,用于高通量暴露体健康发现。MedRxiv 2023896(2023)。

Google Scholar

谷歌学者

Gordon, L. & Olshen, R. A. Tree-structured survival analysis. Cancer Treatment Rep. 69(10), 1065–1069 (1985).

Gordon,L。&Olshen,R.A。树结构生存分析。癌症治疗代表69(10),1065-1069(1985)。

Google Scholar

谷歌学者

Kom, E. L., Graubard, B. I. & Midthune, D. Time-to-event analysis of longitudinal follow-up of a survey: Choice of the time-scale. Am. J. Epidemiol. 145(1), 72–80 (1997).Article

Kom,E.L.,Graubard,B.I。和Midthune,D。调查纵向随访的时间-事件分析:时间尺度的选择。美国流行病学杂志。145(1),72-80(1997)。文章

Google Scholar

谷歌学者

Thiébaut, A. C. & Bénichou, J. Choice of time-scale in Cox’s model analysis of epidemiologic cohort data: A simulation study. Stat. Med. 23(24), 3803–3820 (2004).Article

Thiébaut,A.C。和Bénichou,J。流行病学队列数据Cox模型分析中时间尺度的选择:一项模拟研究。《统计医学》23(24),3803–3820(2004)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Pencina, M. J., Larson, M. G. & D’Agostino, R. B. Choice of time scale and its effect on significance of predictors in longitudinal studies. Stat. Med. 26(6), 1343–1359 (2007).Article

Pencina,M.J.,Larson,M.G。&D'Agostino,R.B。时间尺度的选择及其对纵向研究中预测因子重要性的影响。《统计医学》26(6),1343–1359(2007)。文章

MathSciNet

MathSciNet

PubMed

PubMed

Google Scholar

谷歌学者

Toloşi, L. & Lengauer, T. Classification with correlated features: Unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986–1994 (2011).Article

Toloşi,L。和Lengauer,T。具有相关特征的分类:特征排名和解决方案的不可靠性。生物信息学27(14),1986-1994(2011)。文章

PubMed

PubMed

Google Scholar

谷歌学者

Strobl, C. et al. Conditional variable importance for random forests. BMC Bioinform. 9(1), 1–11 (2008).Article

Strobl,C.等人,《随机森林的条件变量重要性》。BMC生物信息。9(1),1-11(2008)。文章

Google Scholar

谷歌学者

Gregorutti, B., Michel, B. & Saint-Pierre, P. Correlation and variable importance in random forests. Stat. Comput. 27(3), 659–678 (2017).Article

Gregorutti,B.,Michel,B。&Saint-Pierre,P。随机森林中的相关性和可变重要性。统计计算机。27(3),659-678(2017)。文章

MathSciNet

MathSciNet

Google Scholar

谷歌学者

Darst, B. F., Malecki, K. C. & Engelman, C. D. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet. 19(1), 65 (2018).Article

Darst,B.F.,Malecki,K.C。&Engelman,C.D。使用随机森林中的递归特征消除来解释高维数据中的相关变量。BMC基因。19(1),65(2018)。文章

PubMed

PubMed

PubMed Central

公共医学中心

Google Scholar

谷歌学者

Download referencesFundingWork performed at Technical University Denmark was supported by the start package grant NNF22OC0075778 of the Novo Nordisk Foundation. JAC and VKN were supported by the grant from the National Institute of Health (R01ES028802).Author informationAuthors and AffiliationsSchool for Environment and Sustainability, University of Michigan, Ann Arbor, MI, USABu ZhaoDepartment of Environmental Health Sciences, School of Public Health, University of Michigan, Ann Arbor, MI, USAVy Kim Nguyen, Justin A.

下载参考文献丹麦技术大学的资助工作得到了诺和诺德基金会的start软件包资助NNF22OC0075778的支持。JAC和VKN得到了美国国立卫生研究院(R01ES028802)的资助。作者信息作者和附属机构密歇根大学环境与可持续发展学院,密歇根州安娜堡,USABu Zhao密歇根大学公共卫生学院环境健康科学系,密歇根州安娜堡,USAVy Kim Nguyen,Justin A。

Colacino & Olivier JollietDepartment of Biomedical Informatics, Harvard Medical School, Boston, MA, USAVy Kim NguyenSchool of Environment, Tsinghua University, Beijing, ChinaMing XuQuantitative Sustainability Assessment, Department of Environmental and Resource Engineering, Technical University of Denmark, Kongens Lyngby, DenmarkOlivier JollietAuthorsBu ZhaoView author publicationsYou can also search for this author in.

Colacino&Olivier JollietDepartment of Biomedical Informatics,Harvard Medical School,Boston,MA,USAVy Kim NguyenSchool of Environment,清华大学,北京,ChinaMing Xu Quantitative Sustainability Assessment,Department of Environment and Resource Engineering,Technical University of Denmark,Kongens Lyngby,DenmarkOlivier JollietAuthorsBu ZhaoView author Publications你也。

PubMed Google ScholarVy Kim NguyenView author publicationsYou can also search for this author in

PubMed Google ScholarVy Kim NguyenView作者出版物您也可以在

PubMed Google ScholarMing XuView author publicationsYou can also search for this author in

PubMed Google ScholarMing XuView作者出版物您也可以在

PubMed Google ScholarJustin A. ColacinoView author publicationsYou can also search for this author in

PubMed Google ScholarJustin A.ColacinoView作者出版物您也可以在

PubMed Google ScholarOlivier JollietView author publicationsYou can also search for this author in

PubMed Google ScholarOlivier JollietView作者出版物您也可以在

PubMed Google ScholarContributionsOlivier Jolliet and Bu Zhao designed research; Vy Nguyen conducted the data collection and preprocessing; Bu Zhao analyzed the data; Olivier Jolliet, Justin Colacino, and Ming Xu conducted data interpretation; Bu Zhao, Vy Nguyen, Ming Xu, Justin Colacino, and Olivier Jolliet wrote the paper.Corresponding authorsCorrespondence to.

PubMed谷歌学术贡献Solivier Jolliet和Bu Zhao设计的研究;Vy Nguyen进行了数据收集和预处理;布赵分析了数据;Olivier Jolliet,Justin Colacino和Ming Xu进行了数据解释;布昭、阮维、徐明、贾斯汀·科拉西奥和奥利维尔·乔利特撰写了这篇论文。通讯作者通讯。

Bu Zhao or Olivier Jolliet.Ethics declarations

Bu Zhao或Olivier Jolliet。道德宣言

Competing interests

相互竞争的利益

The authors declare no competing interests.

作者声明没有利益冲突。

Additional informationPublisher's noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information.Rights and permissions

Additional informationPublisher的noteSpringer Nature在已发布地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息。权限和权限

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

开放获取本文是根据知识共享署名4.0国际许可证授权的,该许可证允许以任何媒体或格式使用,共享,改编,分发和复制,只要您对原始作者和来源给予适当的信任,提供知识共享许可证的链接,并指出是否进行了更改。

The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

本文中的图像或其他第三方材料包含在文章的知识共享许可中,除非在材料的信用额度中另有说明。如果材料未包含在文章的知识共享许可证中,并且您的预期用途未被法律法规允许或超出允许的用途,则您需要直接获得版权所有者的许可。

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/..

要查看此许可证的副本,请访问http://creativecommons.org/licenses/by/4.0/..

Reprints and permissionsAbout this articleCite this articleZhao, B., Nguyen, V.K., Xu, M. et al. Random survival forest for predicting the combined effects of multiple physiological risk factors on all-cause mortality.

转载和许可本文引用本文Zhao,B.,Nguyen,V.K.,Xu,M。等人的随机生存森林,用于预测多种生理风险因素对全因死亡率的综合影响。

Sci Rep 14, 15566 (2024). https://doi.org/10.1038/s41598-024-66261-0Download citationReceived: 08 February 2024Accepted: 01 July 2024Published: 06 July 2024DOI: https://doi.org/10.1038/s41598-024-66261-0Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.

Sci Rep 1415566(2024)。https://doi.org/10.1038/s41598-024-66261-0Download引文接收日期:2024年2月8日接受日期:2024年7月1日发布日期:2024年7月6日OI:https://doi.org/10.1038/s41598-024-66261-0Share本文与您共享以下链接的任何人都可以阅读此内容:获取可共享链接对不起,本文目前没有可共享的链接。复制到剪贴板。

Provided by the Springer Nature SharedIt content-sharing initiative

由Springer Nature SharedIt内容共享计划提供

KeywordsRandom survival forestsSurvival treeAll-cause mortalityPhysiological factorsRisk visualization

关键词随机生存森林病毒树所有导致死亡的生理因素风险可视化

CommentsBy submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

评论通过提交评论,您同意遵守我们的条款和社区指南。如果您发现有虐待行为或不符合我们的条款或准则,请将其标记为不合适。