EN
登录

新的CZI虚拟细胞模型结合了语言和基因表达训练

New CZI Virtual Cell Model Combines Language, Gene Expression Training

GenomeWeb 等信源发布 2024-12-13 10:20

可切换为仅中文


NEW YORK – Researchers at the Chan Zuckerberg Initiative earlier this week launched a new artificial intelligence-based virtual cell model that has been trained on textual descriptions of gene networks as well as single-cell transcriptomics data.

纽约——本周早些时候,Chan Zuckerberg Initiative的研究人员推出了一种新的基于人工智能的虚拟细胞模型,该模型经过了基因网络文本描述和单细胞转录组学数据的训练。

The model, dubbed scGenePT, combines approaches taken by two existing foundational virtual cell models: scGPT, developed by Bo Wang's group at the University of Toronto, and GenePT, from researchers at Stanford University. Ana-Maria Istrate, senior research scientist at CZI, led the team that came up with scGenePT, and the researchers posted a preprint of their work to BioRxiv in October..

该模型被称为scGenePT,它结合了两种现有的基础虚拟细胞模型所采用的方法:由多伦多大学王波小组开发的scGPT和斯坦福大学研究人员开发的GenePT。CZI的高级研究科学家安娜·玛丽亚·伊斯特拉特(AnaMariaIstrate)领导了scGenePT的团队,研究人员于10月向BioRxiv发布了他们工作的预印本。。

The CZI researchers began by training a model on single-cell gene expression data in the manner of scGPT, which can provide a basis for predictions about cell type annotations and help normalize data. They also added text-based data through National Center for Biotechnology Information (NCBI) gene card and UniProt protein summaries, an approach also taken by GenePT, and added gene function annotations from the UniProt Gene Ontology..

CZI研究人员首先以scGPT的方式训练单细胞基因表达数据模型,该模型可以为细胞类型注释的预测提供基础,并有助于数据标准化。他们还通过国家生物技术信息中心(NCBI)基因卡和UniProt蛋白质摘要添加了基于文本的数据,GenePT也采用了这种方法,并从UniProt基因本体中添加了基因功能注释。。

'A lot of foundation models use one modality,' Istrate said, namely gene expression counts from single-cell RNA-seq. 'But there's a whole other realm of info you have about genes, published in the research literature. The question we had was, 'Can you use that?' We found that, yes, it's possible … incorporating this prior knowledge might help us improve performance,' she said, suggesting that the ceiling for performance on particular tasks could be higher than previously thought.

Istrate说:“许多基础模型使用一种模式,即单细胞RNA-seq的基因表达计数。”但是,关于基因还有一个完整的其他领域的信息,发表在研究文献中。我们的问题是,‘你能用这个吗?’她说,我们发现,是的,这是可能的……结合这些先验知识可能有助于我们提高绩效,这表明特定任务的绩效上限可能比以前想象的要高。

.

.

ScGenePT joins a growing list of so-called 'foundational' AI models trained on lots of biological data that can then be used to generate predictions. Tools like scGPT and Geneformer have been trained on millions of single-cell gene expressions profiles. When fed new data, they can use that training to perform various tasks, such as annotating cell types or simulating the effects of gene knockout on transcriptome-wide expression.

ScGenePT加入了越来越多的所谓“基础”人工智能模型的行列,这些模型基于大量生物数据进行训练,然后可用于生成预测。scGPT和Geneformer等工具已经在数百万个单细胞基因表达谱上进行了训练。。

.

.

Using text to predict cell gene expression patterns has been tried before by GenePT using a concept similar to ChatGPT. ScGenePT's algorithm incorporates this into the gene expression-based model as prior knowledge, said Christina Theodoris, a researcher at the Gladstone Institutes who developed the Geneformer AI model, which is similar to scGPT.

GenePT以前使用类似于ChatGPT的概念尝试过使用文本来预测细胞基因表达模式。格莱斯顿研究所(Gladstone Institutes)的研究人员克里斯蒂娜·西奥多里斯(ChristinaTheodoris)说,ScGenePT的算法将其作为先验知识整合到基于基因表达的模型中,该研究人员开发了类似于scGPT的Geneformer AI模型。

'This allows the model to start from a baseline that is informed by prior research on gene functions.'.

“这使得模型可以从先前对基因功能的研究所提供的基线开始。”。

For in silico perturbation experiments, Istrate's team found that text alone was not as powerful as single-cell gene expression data alone but that including it helped AI models outperform other models that had 'hard-coded' biological knowledge. They benchmarked scGenePT against GEARS (graph-enhanced gene activation and repression simulator) from Jure Lescovec's lab at Stanford, a deep-learning model for predicting gene perturbation that is based on gene regulatory network graphs..

对于计算机微扰实验,Istrate的团队发现,单独的文本不如单独的单细胞基因表达数据强大,但包括它有助于AI模型优于其他具有“硬编码”生物学知识的模型。他们将scGenePT与斯坦福大学Jure Lescovec实验室的GEARS(图形增强的基因激活和抑制模拟器)进行了基准测试,这是一种基于基因调控网络图的预测基因扰动的深度学习模型。。

Specifically, language helps most in cases where the AI model has to predict the effect of two gene perturbations where neither of the genes had been seen during training.

ScGenePT is available for researchers to use through CZI's Virtual Cell platform, launched earlier this week. It includes AI cell models developed by Istrate and other CZI researchers, as well as other leading models, including scGPT. 'Researchers can use the initial scGenePT and other models for biological tasks, such as predicting protein localization, annotating cell types, and integrating multiple batches of data,' CZI said in a statement..

。它包括由Istrate和其他CZI研究人员开发的AI细胞模型,以及包括scGPT在内的其他领先模型CZI在一份声明中说,研究人员可以将最初的scGenePT和其他模型用于生物学任务,例如预测蛋白质定位,注释细胞类型以及整合多批数据。。

CZI also issued a request for proposals to build new foundational AI models using its graphics processing unit cluster, which Istrate used to build and train scGenePT.

CZI还发布了一份提案邀请函,要求使用其图形处理单元集群构建新的基础AI模型,该集群用于构建和训练scGenePT。

With language proving to be helpful to gene expression data in building better cell models, Istrate suggested that other data types could also boost performance. 'We haven't done experiments with this, but you could include protein information for protein-coding genes' or even add imaging data, she said.

由于语言被证明有助于基因表达数据构建更好的细胞模型,Istrate建议其他数据类型也可以提高性能她说:“我们还没有对此进行实验,但你可以为蛋白质编码基因包含蛋白质信息,甚至可以添加成像数据。”。

'If you can get a representation of a gene from a specific modality, whether images or protein, you can think about incorporating it,' she said. .

她说:“如果你能从特定的形态(无论是图像还是蛋白质)中获得基因的表示,你可以考虑将其整合。”。