商务合作
动脉网APP
可切换为仅中文
AbstractThe volume of public proteomics data is rapidly increasing, causing a computational challenge for large-scale reanalysis. Here, we introduce quantms (https://quant,ms.org/), an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 83 public ProteomeXchange datasets, comprising 29,354 instrument files from 13,132 human samples, to quantify 16,599 proteins based on 1.03 million unique peptides.
摘要公共蛋白质组学数据量迅速增加,对大规模再分析造成了计算挑战。在这里,我们介绍quantms(https://quant。我们使用quantms重新分析了83个公共蛋白质组交换数据集,其中包括来自13132个人类样本的29354个仪器文件,以基于103万个独特肽量化16599个蛋白质。
quantms is based on standard file formats improving the reproducibility, submission and dissemination of the data to ProteomeXchange..
quantms基于标准文件格式,提高了数据的可重复性,提交和传播到ProteomeXchange。。
MainIn recent years, the field of proteomics has seen unprecedented growth in publicly available datasets, with a trend toward studies that analyze a more substantial number of samples. As of December 2023, the number of public datasets stored in the PRIDE database1 exceeded 25,000, including a remarkable increase in large datasets containing more than 100 instrument files, from 100 in 2014 to 4,435 submissions in 2024.
Main近年来,蛋白质组学领域在公开可用的数据集方面取得了前所未有的增长,并且有研究分析更多样本的趋势。截至2023年12月,PRIDE数据库1中存储的公共数据集数量超过25000个,其中包含100多个仪器文件的大型数据集显着增加,从2014年的100个增加到2024年的4435个提交。
In parallel, a range of transformative improvements in proteomic data processing software has enabled a deeper and more precise look into the proteome. Reprocessing old data with such new tools, therefore, yields additional biological and biomedical insights2,3. However, the increased size of individual datasets presents a significant computational bottleneck, making it challenging to reanalyze large experiments on conventional workstations.
与此同时,蛋白质组学数据处理软件的一系列变革性改进使人们能够对蛋白质组进行更深入和更精确的研究。因此,用这种新工具重新处理旧数据会产生额外的生物学和生物医学见解2,3。然而,单个数据集规模的增加带来了一个重大的计算瓶颈,使得在传统工作站上重新分析大型实验具有挑战性。
The automated analysis of publicly accessible quantitative proteomics data is further limited by the lack of metadata that characterizes the phenotypes, the samples and the instrument operation. Although some of these challenges are tackled in earlier studies4,5,6, many research groups still cannot perform automated large-scale quantitative analysis in the cloud and on distributed architectures.
由于缺乏表征表型,样品和仪器操作的元数据,公众可获得的定量蛋白质组学数据的自动分析进一步受到限制。尽管其中一些挑战在早期的研究中得到了解决[4,5,6],但许多研究小组仍然无法在云和分布式体系结构中进行自动化的大规模定量分析。
To address this challenge, the field requires scalable bioinformatics solutions that leverage sample metadata to automatically quantify peptides and proteins, perform absolute or differential-expression analysis and provide extensive quality control output.Here we introduce quantms (https://quantms.org), an open-source cloud-based pipeline for massively parallel proteomic data reanalysis.
为了应对这一挑战,该领域需要可扩展的生物信息学解决方案,该解决方案利用样本元数据自动量化肽和蛋白质,执行绝对或差异表达分析,并提供广泛的质量控制输出。这里我们介绍quantms(https://quantms.org),一个基于云的开源管道,用于大规模并行蛋白质组数据重新分析。
It supports three major types of experiment—data-dependent acquisition label-free (DDA-LFQ), isobaric tandem mass tag (TMT)-based (DDA-plex.
它支持三种主要类型的实验数据相关采集无标签(DDA-LFQ),基于等压串联质量标签(TMT)的(DDA-plex)。
Data availability
数据可用性
The datasets reanalyzed in the present study can be searched on the quantms web page (https://quantms.org/datasets). In addition, all the results can be found in the PRIDE database FTP (http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/). Source data are provided with this paper.
本研究中重新分析的数据集可以在quantms网页上搜索(https://quantms.org/datasets)。此外,所有结果都可以在PRIDE数据库FTP中找到(http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/)。本文提供了源数据。
Code availability
代码可用性
All software, algorithms and tools are available on GitHub: quantms at https://github.com/bigbio/quantms and pmultiqc at https://github.com/bigbio/pmultiqc. The full documentation of quantms is available at https://quantms.readthedocs.io/en/latest/.
所有软件、算法和工具都可以在GitHub上找到:quantmshttps://github.com/bigbio/quantms和pmultiqchttps://github.com/bigbio/pmultiqc.https://quantms.readthedocs.io/en/latest/.
ReferencesPerez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 50, D543–D552 (2022).Article
参考文献Perez-Riverol,Y.等人,《2022年的PRIDE数据库资源:基于质谱的蛋白质组学证据中心》。核酸研究50,D543–D552(2022)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Levitsky, L. I. et al. Massive proteogenomic reanalysis of publicly available proteomic datasets of human tissues in search for protein recoding via adenosine-to-inosine RNA editing. J. Proteome Res. 22, 1695–1711 (2023).Article
Levitsky,L.I.等人。对公开可用的人体组织蛋白质组学数据集进行大规模蛋白质基因组学重新分析,以寻找通过腺苷到肌苷RNA编辑的蛋白质重新编码。J、 蛋白质组研究221695-1711(2023)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Jarnuczak, A. F. et al. An integrated landscape of protein expression in human cancer. Sci. Data 8, 115 (2021).Article
Jarnuczak,A.F.等人,《人类癌症中蛋白质表达的综合景观》。科学。数据8115(2021)。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Feng, J. et al. Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nat. Biotechnol. 35, 409–412 (2017).Article
Feng,J.等人。Firmiana:建立一个用于数据处理和分析的一站式蛋白质组云平台。美国国家生物技术公司。35409-412(2017)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Choi, M. et al. MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasets. Nat. Methods 17, 981–984 (2020).Article
Choi,M.等人。MassIVE.quant:基于定量质谱的蛋白质组学数据集的社区资源。自然方法17981-984(2020)。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Vaudel, M. et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 33, 22–24 (2015).Article
Vaudel,M。等人PeptideShaker能够重新分析MS衍生的蛋白质组学数据集。美国国家生物技术公司。33,22-24(2015)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).Article
Wilkinson,M.D.等人,《科学数据管理和管理的公平指导原则》。科学。数据31160018(2016)。文章
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Ewels, P. A. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278 (2020).Article
Ewels,P.A.等人,《社区管理生物信息学管道的nf核心框架》。美国国家生物技术公司。38276-278(2020)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Dai, C. et al. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat. Commun. 12, 5854 (2021).Article
Dai,C.等人。用于多组学整合和大数据分析的蛋白质组学样本元数据表示。国家公社。。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Wang, L. H. et al. pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Commun. Mass Spectrom. 21, 2985–2991 (2007).Article
Wang,L.H.等人。pFind 2.0:通过串联质谱法鉴定肽和蛋白质的软件包。快速通讯。质谱。212985-2991(2007)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).Article
Kong,A.T.,Leprevost,F.V.,Avtonomov,D.M.,Mellacheruvu,D。和Nesvizhskii,A.I。MSFragger:基于质谱的蛋白质组学中的超快和全面的肽鉴定。自然方法14513-520(2017)。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).Article
Di Tommaso,P。等人Nextflow实现了可重复的计算工作流。美国国家生物技术公司。35316-319(2017)。文章
PubMed
PubMed
Google Scholar
谷歌学者
Savitski, M. M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol. Cell Proteom. 14, 2394–2404 (2015).Article
Savitski,M.M.,Wilhelm,M.,Hahne,H.,Kuster,B。&Bantscheff,M。大型蛋白质组学数据集中蛋白质错误发现率估计的可扩展方法。分子细胞蛋白质组学。142394-2404(2015)。文章
CAS
中科院
Google Scholar
谷歌学者
Choi, M. et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 30, 2524–2526 (2014).Article
Choi,M。等人。MSstats:用于基于定量质谱的蛋白质组学实验的统计分析的R包。生物信息学302524-2526(2014)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Pfeuffer, J. et al. OpenMS 3 enables reproducible analysis of large-scale mass spectrometry data. Nat. Methods 21, 365–367 (2024).Article
Pfeuffer,J。等人。OpenMS 3能够对大规模质谱数据进行可重复的分析。自然方法21365-367(2024)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Fermin, D., Avtonomov, D., Choi, H. & Nesvizhskii, A. I. LuciPHOr2: site localization of generic post-translational modifications from tandem mass spectrometry data. Bioinformatics 31, 1141–1143 (2015).Article
Fermin,D.,Avtonomov,D.,Choi,H。&Nesvizhskii,A.I。LuciPHOr2:来自串联质谱数据的通用翻译后修饰的位点定位。生物信息学311141-1143(2015)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Lazear, M. R. Sage: an open-source tool for fast proteomics searching and quantification at scale. J. Proteome Res. 22, 3652–3659 (2023).Article
Lazear,M.R.Sage:用于快速蛋白质组学搜索和大规模定量的开源工具。J、 蛋白质组研究223652-3659(2023)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).Article
Demichev,V.,Messner,C.B.,Vernardis,S.I.,Lilley,K.S。&Ralser,M.DIA-NN:神经网络和干扰校正能够在高通量下实现深度蛋白质组覆盖。自然方法17,41-44(2020)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Bai, M. et al. LFQ-based peptide and protein intensity differential expression analysis. J. Proteome. Res. 22, 2114–2123 (2023).Article
Bai,M。等人。基于LFQ的肽和蛋白质强度差异表达分析。J、 蛋白质组。第22、2114–2123号决议(2023年)。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Lautenbacher, L. et al. ProteomicsDB: toward a FAIR open-source resource for life-science research. Nucleic Acids Res. 50, D1541–D1552 (2022).Article
Lautenbacher,L.等人,《蛋白质组学数据库:走向生命科学研究的公平开源资源》。核酸研究50,D1541–D1552(2022)。文章
CAS
中科院
PubMed
PubMed
Google Scholar
谷歌学者
Wang, M., Herrmann, C. J., Simonovic, M., Szklarczyk, D. & von Mering, C. Version 4.0 of PaxDb: protein abundance data, integrated across model organisms, tissues, and cell-lines. Proteomics 15, 3163–3168 (2015).Article
。蛋白质组学153163-3168(2015)。文章
CAS
中科院
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Download referencesAcknowledgementsY.P.-R. was funded by the EU H2020 project EPIC-XS (grant no. 823839), Wellcome grants (nos 208391/Z/17/Z, 223745/Z/21/Z) and EMBL core funding. M.B. and C.D. were funded by the National Key Research and Development Program of China (grant no. 2018YFA0507504).
下载referencesAcknowledgementsY。P、 -R.由欧盟H2020项目EPIC-XS(拨款号823839),惠康拨款(编号208391/Z/17/Z,223745/Z/21/Z)和EMBL核心资金资助。M、 B.和C.D.由中国国家重点研究发展计划(批准号2018YFA0507504)资助。
V.D. was supported by the Federal Ministry of Education and Research (BMBF), as part of the National Research Initiatives for Mass Spectrometry in Systems Medicine (’MSCoreSy’), under grant agreement no. 161L0221.FundingOpen access funding provided by European Molecular Biology Laboratory (EMBL).Author informationAuthor notesThese authors contributed equally: Chengxin Dai, Julianus Pfeuffer.Authors and AffiliationsChongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, ChinaChengxin Dai, Hong Wang, Ping Zheng & Mingze BaiState Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, ChinaChengxin Dai & Mingze BaiAlgorithmic Bioinformatics, Freie Universität Berlin, Berlin, GermanyJulianus PfeufferScience for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Stockholm, SwedenLukas KällDepartment of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, GermanyTimo Sachsenberg & Oliver KohlbacherInstitute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, GermanyTimo Sachsenberg & Oliver KohlbacherCharité—Universitätsmedizin Berlin, Berlin, GermanyVadim DemichevInstitute for Translational Bioinformatics, University Hospital Tübingen, Tübingen, GermanyOliver KohlbacherEuropean Molecular.
五、 D.得到了联邦教育和研究部(BMBF)的支持,作为国家系统医学质谱研究计划(“MSCoreSy”)的一部分,根据第161L0221号赠款协议。资助开放获取资金由欧洲分子生物学实验室(EMBL)提供。作者信息作者注意到这些作者做出了同样的贡献:戴承新,朱利安娜普费弗。作者和单位重庆邮电大学大数据生物智能重点实验室重庆市戴成新、王红、郑平和明泽百州蛋白质组学重点实验室,北京蛋白质组研究中心,国家蛋白质科学中心(北京),北京生命组学研究所,北京,中国成新戴和明泽百州生物信息学,柏林弗雷大学,德国朱利安斯·普弗斯生命科学实验室,化学、生物技术与健康工程科学学院,瑞典皇家理工学院,斯德哥尔摩蒂宾根大学生物信息学和医学信息学研究所,蒂宾根,蒂宾根,德国蒂莫·萨赫森堡和奥利弗·科尔巴赫·哈里特大学,柏林,德国图宾根大学医院德国瓦迪姆·德米切文转化生物信息学研究所,图宾根,GermanyOliver KohlbacherEuropean Molecular。
PubMed Google ScholarJulianus PfeufferView author publicationsYou can also search for this author in
PubMed Google ScholarJulianus PfeufferView作者出版物您也可以在
PubMed Google ScholarHong WangView author publicationsYou can also search for this author in
PubMed Google ScholarHong WangView作者出版物您也可以在
PubMed Google ScholarPing ZhengView author publicationsYou can also search for this author in
PubMed Google ScholarPing ZhengView作者出版物您也可以在
PubMed Google ScholarLukas KällView author publicationsYou can also search for this author in
PubMed Google ScholarLukas KällView作者出版物您也可以在
PubMed Google ScholarTimo SachsenbergView author publicationsYou can also search for this author in
PubMed Google ScholarTimo SachsenbergView作者出版物您也可以在
PubMed Google ScholarVadim DemichevView author publicationsYou can also search for this author in
PubMed Google ScholarVadim DemichevView作者出版物您也可以在
PubMed Google ScholarMingze BaiView author publicationsYou can also search for this author in
PubMed Google Scholarmamingze BaiView作者出版物您也可以在
PubMed Google ScholarOliver KohlbacherView author publicationsYou can also search for this author in
PubMed Google ScholarOliver KohlbacherView作者出版物您也可以在
PubMed Google ScholarYasset Perez-RiverolView author publicationsYou can also search for this author in
PubMed谷歌学术资产Perez RiverolView作者出版物您也可以在
PubMed Google ScholarContributionsC.D., J.P. and Y.P.-R. developed the quantms workflow. H.W., C.D., J.P. and Y.P.-R. developed the pmultiqc library and web application, P.Z., H.W., C.D. and Y.P.-R. developed the quantms.org web page to present the results of the quantitative analysis.
PubMed谷歌学术贡献中心。D、 ,J.P.和Y.P.-R.开发了quantms工作流程。H、 W.,C.D.,J.P.和Y.P.-R.开发了pmultiqc库和web应用程序,P.Z.,H.W.,C.D.和Y.P.-R.开发了quantms.org网页来展示定量分析的结果。
T.S. and J.P. developed the algorithms and tools in OpenMS for DDA-plex and LFQ-DDA workflow. V.D. contributed to the development of the algorithm parallelization of the DIA-NN tool and the LFQ-DIA workflow. C.D., J.P. and Y.P.-R. performed the annotations of the datasets and the data analysis. L.K.
T、 S.和J.P.在OpenMS中为DDA plex和LFQ-DDA工作流开发了算法和工具。五、 D.为DIA-NN工具和LFQ-DIA工作流程的算法并行化开发做出了贡献。C、 D.,J.P.和Y.P.-R.对数据集进行了注释和数据分析。五十、 K。
designed and developed the stand-in decoy. C.D., J.P., T.S., V.D., M.B., O.K. and Y.P.-R. wrote the paper and contributed to the design of the workflow and quantms.org project.Corresponding authorCorrespondence to.
设计并开发了替身诱饵。C、 D.,J.P.,T.S.,V.D.,M.B.,O.K.和Y.P.-R.撰写了这篇论文,并为工作流程和quantms.org项目的设计做出了贡献。对应作者对应。
Yasset Perez-Riverol.Ethics declarations
Yasset Perez Riverol。道德宣言
Competing interests
相互竞争的利益
The authors declare no competing interests.
作者声明没有利益冲突。
Peer review
同行评审
Peer review information
同行评审信息
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Allison Doerr, in collaboration with the Nature Methods team.
Nature Methods感谢匿名审稿人对这项工作的同行评审做出的贡献。同行评审报告可用。主要处理编辑:Allison Doerr,与Nature Methods团队合作。
Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationSupplementary InformationSupplementary Notes 1–10, Figs. 1–24 and Tables 1–10.Reporting SummaryPeer Review FileSupplementary Data 1Source data for the figures shown in the Supplementary Notes.Source dataSource Data Fig.
Additional informationPublisher的注释Springer Nature在已发布的地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息补充说明1-10,图1-24和表1-10。报告摘要同行评审文件补充数据1补充说明中所示数字的源数据。源数据源数据图。
2Statistical source data.Rights and permissions.
2统计源数据。权限和权限。
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
开放获取本文是根据知识共享署名4.0国际许可证授权的,该许可证允许以任何媒体或格式使用,共享,改编,分发和复制,只要您对原始作者和来源给予适当的信任,提供知识共享许可证的链接,并指出是否进行了更改。
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
本文中的图像或其他第三方材料包含在文章的知识共享许可中,除非在材料的信用额度中另有说明。如果材料未包含在文章的知识共享许可中,并且您的预期用途不受法律法规的许可或超出许可用途,则您需要直接获得版权所有者的许可。
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/..
要查看此许可证的副本,请访问http://creativecommons.org/licenses/by/4.0/..
Reprints and permissionsAbout this articleCite this articleDai, C., Pfeuffer, J., Wang, H. et al. quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data.
转载和许可本文引用本文Dai,C.,Pfeuffer,J.,Wang,H。等人。quantms:基于云的定量蛋白质组学管道可以重新分析公共蛋白质组学数据。
Nat Methods (2024). https://doi.org/10.1038/s41592-024-02343-1Download citationReceived: 12 May 2023Accepted: 03 June 2024Published: 04 July 2024DOI: https://doi.org/10.1038/s41592-024-02343-1Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.
Nat方法(2024)。https://doi.org/10.1038/s41592-024-02343-1Download引文接收日期:2023年5月12日接收日期:2024年6月3日发布日期:2024年7月4日OI:https://doi.org/10.1038/s41592-024-02343-1Share本文与您共享以下链接的任何人都可以阅读此内容:获取可共享链接对不起,本文目前没有可共享的链接。复制到剪贴板。
Provided by the Springer Nature SharedIt content-sharing initiative
由Springer Nature SharedIt内容共享计划提供