人工智能模型正在改变基因组学研究，虚拟细胞只是开始-动脉网

AI Models Are Transforming Genomics Research, and Virtual Cells Are Just the Beginning

GenomeWeb 等信源发布 2024-10-17 15:09



可切换为仅中文







NEW YORK – Alexander Bick was initially skeptical that using artificial intelligence algorithms in his rare disease research was going to work but felt he had little choice but to try it out.

纽约——亚历山大·比克（AlexanderBick）最初对在他的罕见病研究中使用人工智能算法是否可行表示怀疑，但他觉得除了尝试之外别无选择。

One of the diseases he studies is RUNX1 familial platelet disorder, a blood condition observed in just a couple hundred people in the US. Pathogenic mutations in the transcription factor RUNX1 affect hematopoietic stem cell differentiation resulting in a higher risk for blood cancers, among other symptoms such as prolonged bleeding.

他研究的疾病之一是RUNX1家族性血小板疾病，这是一种在美国仅几百人中观察到的血液疾病。转录因子RUNX1的致病突变会影响造血干细胞分化，从而导致血癌风险增加，以及出血时间延长等其他症状。

His lab at Vanderbilt University Medical Center and their collaborators had recently turned to single-cell studies of the disease looking for differentially expressed genes to suggest drug targets. .

他在范德比尔特大学医学中心的实验室及其合作者最近转向该疾病的单细胞研究，寻找差异表达的基因以提示药物靶标。

However, not only were there very few patients to draw samples from, but those samples could only be obtained by a painful bone marrow biopsy. 'When you use [differential gene expression] for a small number of samples, the results are just not reliable,' Bick said. 'The differences you see are not what you're interested in.'.

然而，不仅只有极少数患者可以提取样本，而且这些样本只能通过痛苦的骨髓活检获得。”比克说，当你对少量样本使用（差异基因表达）时，结果就不可靠了你看到的差异并不是你感兴趣的。”。

Enter: Geneformer. At a November 2023 meeting hosted by the Chan Zuckerberg Initiative, which was funding his work on RUNX1, Bick saw a presentation by Christina Theodoris, a researcher at the Gladstone Institutes who had created one of a new breed of AI tools that were fed massive amounts of biological data, in the same way ChatGPT has 'read' all of the internet.

输入：Geneformer。在2023年11月由Chan Zuckerberg Initiative主办的会议上，Bick看到了Gladstone Institutes研究员克里斯蒂娜·西奥多里斯（ChristinaTheodoris）的演讲，克里斯蒂娜·西奥多里斯（ChristinaTheodoris）创建了一种新型人工智能工具，该工具可以提供大量的生物数据，就像ChatGPT“读取”所有互联网一样。

Geneformer had been trained on every public human single-cell gene expression dataset that Theodoris had access to in 2021 — from about 30 million cells. CZI then further trained, or 'fine-tuned,' the AI model with a subset of data from its CZ CellxGene Census, a package of tens of millions of single-cell transcriptomes.

Geneformer已经接受了Theodoris在2021年获得的每个公共人类单细胞基因表达数据集的培训，这些数据来自大约3000万个细胞。然后，CZI利用其CZ CellxGene普查（一个包含数千万个单细胞转录组的软件包）的数据子集进一步训练或“微调”AI模型。

Among other capabilities, Geneformer can distinguish between cell states, including healthy and diseased. Moreover, it can also simulate the effects of up- or down-regulation, or even knockout, of a particular gene and predict whether it makes a diseased cell look more like a healthy one or vice versa.

除其他功能外，Geneformer可以区分细胞状态，包括健康和患病。此外，它还可以模拟特定基因上调或下调甚至敲除的影响，并预测它是否会使患病细胞看起来更像健康细胞，反之亦然。

In a May 2023 Nature paper introducing the model, Theodoris showed how in silico experiments using the tool were able to identify hundreds of genes whose loss was predicted to cause a shift from healthy to diseased states in cardiomyocytes..

在2023年5月《自然》杂志的一篇介绍该模型的论文中，Theodoris展示了使用该工具进行的计算机实验如何能够识别数百个基因，这些基因的丢失预计会导致心肌细胞从健康状态转变为患病状态。。

Bick and his collaborators considered what Geneformer could do to help them study hematopoietic stem cells and decided to try it out. 'The field of single-cell computational methods is moving and developing so rapidly that I am just generally skeptical of all new methods until we try them in our hands,' he said.

他说，单细胞计算方法的领域发展如此迅速，以至于我对所有新方法都持怀疑态度，直到我们亲自尝试。

'So, it was a general sense of 'How much could this tool actually help me execute my science?''.

“因此，这是一种普遍的感觉，‘这个工具能在多大程度上帮助我执行我的科学？’。

A year later, the researchers are able to run in silico genome-wide perturbation studies in about 24 hours with the help of graphics processing units (GPUs), hardware that accelerates the use of the AI model. 'To do that in hematopoietic stem cells, which are not easy to come by, would be hundreds of thousands of dollars and many, many, many months,' Bick said.

一年后，研究人员能够在图形处理单元（GPU）的帮助下，在大约24小时内进行计算机全基因组扰动研究，这是一种加速人工智能模型使用的硬件。

Using CZI's refined Geneformer model was free, on the other hand, and not just because he is a CZI grantee. .

。

Not only is Bick's team getting lists of genes to target in wet lab experiments, 'we're finding things that are different from if we'd just looked at differential gene expression,' he said, and the predictions are looking pretty good.

比克的团队不仅在湿实验室实验中获得了要靶向的基因列表，他说，“我们正在发现与仅仅观察差异基因表达不同的东西”，而且预测看起来相当不错。

'We're starting to see really exciting results,' he said. 'The success rate is not 100 percent, not even 50 percent, but for every five genes, one seems to be working in an experimental system,' he said. Collaborators are taking existing drugs that target those genes and finding that they can change the state of an in vitro hematopoietic cell.

他说，我们开始看到真正令人兴奋的结果他说，成功率不是100%，甚至不是50%，但每五个基因中就有一个似乎在实验系统中起作用。。

'These are genes we wouldn't have thought to test without these models,' he said..

他说：“如果没有这些模型，我们不会想到要测试这些基因。”。。

'Get on the train'

“上车”

Simulated perturbation experiments are just one of the many uses of Geneformer — it can also annotate cell types and predict genes central to a gene network — and Bick is just one of more than 30,000 users who have downloaded the model. Geneformer itself is just one of many new 'foundational' AI models trained on heaps of single-cell gene expression data, while researchers train other foundational models across genomics.

模拟扰动实验只是Geneformer的众多用途之一-它还可以注释细胞类型并预测基因网络的核心基因-而Bick只是下载该模型的30000多名用户之一。Geneformer本身只是在大量单细胞基因表达数据上训练的许多新的“基础”AI模型之一，而研究人员则在基因组学上训练其他基础模型。

The AI era has definitively arrived in the life sciences, and some scientists are hoping it will herald a grand unifying vision of cellular and molecular biology, perhaps a squishier version of what particle physicists enjoy with their Standard Model. In a preprint posted to ArXiv last month, a host of high-profile genomics researchers including Head of Genentech Research Aviv Regev, CZI Head of Science Stephen Quake, and the University of Washington's Jay Shendure, joined by AI tool pioneers such as Theodoris, outlined their vision for how AI could help build 'virtual cells' that could generate 'universal representations of biological entities across scales … facilitating interpretable in silico experiments to predict and understand their behavior using virtual instruments.'.

人工智能时代已明确进入生命科学领域，一些科学家希望它将预示着细胞和分子生物学的伟大统一愿景，也许是粒子物理学家在其标准模型中所享受的更为简单的版本。在上个月发布给ArXiv的一份预印本中，包括基因泰克研究负责人Aviv Regev、CZI科学负责人Stephen Quake和华盛顿大学Jay Shendure在内的一大批备受瞩目的基因组学研究人员，以及Theodoris等人工智能工具先驱，概述了他们的愿景，即人工智能如何帮助构建“虚拟细胞”，从而产生“跨尺度的生物实体的通用表示……促进可解释的计算机实验，以使用虚拟仪器预测和理解它们的行为。”。

'I think of this like learning to speak 'computation with a biology accent,' or 'biology with a computational accent,'' Regev told GenomeWeb. 'This trend is already naturally occurring in early-career researchers, and as the traditional boundaries between fields continue to erode, we will have more creativity and discovery.' .

雷格夫告诉GenomeWeb：“我认为这就像学习说‘带生物学口音的计算’或‘带计算口音的生物学’。“这种趋势已经在职业生涯早期的研究人员中自然而然地发生，随着领域之间传统界限的不断侵蚀，我们将拥有更多的创造力和发现力。”。

The use of AI will even change how biological science is conducted from hypothesis generation to data analysis. 'Bottom line: Over the next decade, we'll see biology change from being 90 percent experimental and 10 percent computational to 80 computational and 20 percent experimental,' Quake said.

人工智能的使用甚至将改变生物科学从假设生成到数据分析的方式。“底线：在未来十年，我们将看到生物学从90%的实验性和10%的计算性转变为80%的计算性和20%的实验性，”奎克说。

There are potential drawbacks, though. 'We may have to forgo our ability to build fully mechanistic models,' the ArXiv preprint authors wrote, noting that such models have been 'one of the hallmarks of scientific discovery in biology.'

不过，这也有潜在的缺陷。”“我们可能不得不放弃建立完全机械模型的能力，”ArXiv预印本作者写道，并指出此类模型已成为“生物学科学发现的标志之一”

'There are so many things for which a cell avatar and cell oracle can be useful and impactful, even if it lacks in other ways,' Regev said. 'Just like running genetic screens with cells and animals does not give a direct mechanism but tells us a lot about biology, a well-performing virtual cell can teach us a lot, and then for other purposes, we can use other approaches.'.

雷格夫说：“细胞化身和细胞甲骨文在很多方面都是有用和有影响力的，即使它在其他方面缺乏。”就像用细胞和动物进行基因筛选并没有给出直接的机制，而是告诉我们很多生物学知识一样，一个表现良好的虚拟细胞可以教会我们很多东西，然后为了其他目的，我们可以使用其他方法。”。

To fully realize the promise of AI, even more data on cellular behavior of all types — epigenetic, functional, interactional — are needed, experts say. Whether those data can be taken from researcher-directed, hypothesis-driven studies, or if they need to be purpose-generated to optimize their utility for AI models, isn't clear.

专家表示，为了充分实现人工智能的前景，需要更多关于所有类型的细胞行为（表观遗传，功能，相互作用）的数据。。

And, as in other fields, AI threatens to take over some of the mid-level work assigned to researchers in training. However, the potential rewards may be irresistible. .

而且，与其他领域一样，人工智能可能会接管分配给研究人员的一些中级培训工作。然而，潜在的回报可能是不可抗拒的。

'Woe to those who ignore it,' said Garry Nolan, a researcher at Stanford University who has used AI tools in his own lab. He has cofounded a startup, Cellformatica, that uses ChatGPT-like AI to generate hypotheses based on uploaded data, including outlining the experiments one might need to test them — until now, the task of human scientists..

斯坦福大学（Stanford University）的研究员加里·诺兰（Garry Nolan）在自己的实验室中使用了人工智能工具，他说：“那些忽视它的人真倒霉。”他与人共同创立了一家初创公司Cellformatica，该公司使用类似ChatGPT的人工智能，根据上传的数据生成假设，包括概述可能需要测试它们的实验——到目前为止，这是人类科学家的任务。。

'It's inevitable. I don't know what else to say, except get on the train before you're left at the station,' he said. 'And it's moving so fast. Every other week, I feel like the work we've done has been enabled with another tool.'

“这是不可避免的。“我不知道还能说什么，除了在你离开车站之前上车，”他说它移动得如此之快。每隔一周，我觉得我们所做的工作都有了另一个工具。”

Other AI-based cell models that Bick and his team could have used include scGPT from Bo Wang's Lab at the University of Toronto; scBERT, a model from researchers at China's Tencent AI lab and Shanghai Jiao Tong University, which takes the same fundamental approach as Google's bidirectional encoder representations from transformers (BERT) model; single-cell Variational Inference (scVI), a model developed by Nir Yosef's lab at the University of California, Berkeley; and Universal Cell Embeddings (UCE), developed by Stanford researchers in Jure Leskovec's and Quake's respective Stanford labs in collaboration with CZI.

比克和他的团队本可以使用的其他基于人工智能的细胞模型包括多伦多大学王波实验室的scGPT；；单细胞变分推理（scVI），一种由加州大学伯克利分校Nir Yosef实验室开发的模型；和通用细胞嵌入（UCE），由斯坦福大学研究人员在Jure Leskovec和Quake各自的斯坦福实验室与CZI合作开发。

The emergence of these models represents the 'biological data revolution and AI model revolution coming together,' Quake said. 'It couldn't have been done very long ago.' It also means that what's happening with virtual cells isn't much different from what's going on elsewhere in the world. 'It's very much riding on the coattails of the AI revolution,' he said, such as text analysis and image generation..

地震说，这些模型的出现代表着“生物数据革命和人工智能模型革命的结合”这不可能是很久以前的事了。”这也意味着虚拟细胞的情况与世界其他地方的情况没有太大不同他说，这在很大程度上依赖于人工智能革命的助推器，例如文本分析和图像生成。。

The transformer AI architecture that large language models (LLMs) like ChatGPT are based on was introduced in 2017 by researchers at Google, leading directly to BERT, OpenAI's GPT-4, and other 'foundation' AI models, which are trained on a broad set of data and can respond to a wide range of queries.

2017年，谷歌研究人员引入了ChatGPT等大型语言模型（LLM）所基于的transformer AI架构，直接导致了BERT，OpenAI的GPT-4和其他“基础”AI模型，这些模型在广泛的数据集上进行训练，可以响应广泛的查询。

Moreover, they can be refined with additional data to take on more narrow tasks..

此外，它们可以通过额外的数据进行改进，以承担更狭窄的任务。。

These have enabled ChatGPT and other 'generative' AI tools that have driven news headlines over the past couple of years and provided punchlines and wacky illustrations for scientific conference presentations. Geneformer is also based on a transformer architecture and is a generative AI; however, instead of a list of questions to ask at a panel discussion, for example, it might produce a gene expression profile for a cell without a key gene.

这些工具使得ChatGPT和其他“生成性”人工智能工具在过去几年中成为新闻头条，并为科学会议演示文稿提供了妙语和古怪的插图。Geneformer也基于变压器架构，是一种生成AI；然而，例如，它可能会产生一个没有关键基因的细胞的基因表达谱，而不是在小组讨论中提出一系列问题。

Broadly speaking, LLMs are well suited for genomics, said George Vacek, global head of genomics alliances at Nvidia, whose GPUs are often used to make training and use of AI models faster. 'DNA is the language of life, with nucleotides encoding information, so LLMs can use an analogous approach for studying biological problems,' he said..

Nvidia基因组学联盟全球负责人乔治·瓦切克（GeorgeVacek）说，从广义上讲，LLM非常适合基因组学，其GPU通常用于加快人工智能模型的训练和使用速度他说：“DNA是生命的语言，核苷酸编码信息，因此法学硕士可以使用类似的方法来研究生物学问题。”。。

LLMs, foundational models, and generative AI all fall under deep learning models, a subset of machine-learning methods that is distinct from those used in classifier models such as random forests, which are currently applied in diagnostics and other clinical fields.

LLM，基础模型和生成AI都属于深度学习模型，这是机器学习方法的一个子集，不同于目前应用于诊断和其他临床领域的分类器模型（如随机森林）中使用的方法。

'We've seen amazing success in proteins with LLMs,' Quake said. 'There's been seminal, huge impact on protein design and understanding structure. It has raised our expectations that hopefully we can do something in the world of cells.' Earlier this month, two Google DeepMind researchers won shares of the Nobel Prize for chemistry for their work on AlphaFold, an AI algorithm that predicts protein structure based on amino acid sequence.

“我们已经看到LLM在蛋白质方面取得了惊人的成功，”奎克说对蛋白质设计和理解结构产生了开创性的巨大影响。本月早些时候，两名谷歌DeepMind研究人员因其AlphaFold（一种基于氨基酸序列预测蛋白质结构的人工智能算法）的研究而获得诺贝尔化学奖。

DNABert is another LLM for genomics, which has been trained on the human genome reference sequence. 'It really understands genetic sequence,' Vacek said, adding that it's helpful for tasks such as identifying functional variants. Like virtual cell models, there are many flavors of DNA LLMs, including Grover, from Anna Poetsch's lab at the Dresden University of Technology in Germany, and regLM, developed by Genentech.

DNABert是另一位基因组学法学硕士，已经接受了人类基因组参考序列的培训。”瓦切克说，它确实了解基因序列，并补充说，它有助于识别功能变异等任务。与虚拟细胞模型一样，DNA LLM也有多种风格，包括来自德国德累斯顿理工大学安娜·波茨实验室的格罗弗（Grover）和基因泰克（Genentech）开发的regLM。

Applications include sequence design, such as promoters and enhancers, and predicting the fitness of a variant..

应用包括序列设计，例如启动子和增强子，以及预测变体的适应性。。

As key as transformer models and GPU acceleration have been to developing foundation AI models, they're not powerful unless they have enormous amounts of data to train on. Geneformer is a special case as it was trained 'from scratch' on data from 30 million single cells. 'That was all the publicly available data we could identify at that time,' Theodoris said.

变压器模型和GPU加速对于开发基础AI模型至关重要，但除非它们有大量数据可供训练，否则它们就不强大。Geneformer是一个特例，因为它是“从头开始”训练3000万个单细胞数据的西奥多里斯说，这是当时我们能够确定的所有公开可用数据。

More recently, she has retrained the model on approximately 100 million cells. With a broad training base, others can now use smaller amounts of their own data to fine-tune the model for specific applications..

最近，她在大约一亿个细胞上重新训练了该模型。有了广泛的培训基础，其他人现在可以使用少量自己的数据来微调特定应用程序的模型。。

The fine-tuning can be done multiple times. The specific tool Bick used was CZI's model of Geneformer, which was further trained on CZ CellxGene data using the Census, a collaboration with bioinformatics firm TileDB to fit the data together in a way that makes it easier to port data over to AI models.

微调可以进行多次。Bick使用的特定工具是CZI的Geneformer模型，该模型使用人口普查进一步训练了CZ CellxGene数据，人口普查是与生物信息学公司TileDB合作的，以使数据更容易移植到AI模型。

Bick also fine-tuned the CZI Geneformer model with gene expression data from 10,000 cells that his team had analyzed. CZI has trained other AI models with the 70 million-cell Census, including scGPT, UCE, and scVI, and provides access to these tools for free to interested researchers as part of its philanthropic mission.

比克还用他的团队分析的10000个细胞的基因表达数据微调了CZI Geneformer模型。CZI通过7000万细胞普查培训了其他人工智能模型，包括scGPT，UCE和scVI，并为感兴趣的研究人员免费提供这些工具，作为其慈善使命的一部分。

Some models, such as UCE, are designed to work without fine-tuning, an approach called zero-shot learning. 'You can embed whatever data you want,' Quake said. That means it can handle cell types from organisms that it has never seen before — say, octopus — and still perform reasonably well. 'Hopefully, that sets the standard for other people making models going forward,' Quake said.

一些模型，如UCE，被设计为不需要微调就可以工作，这种方法被称为零次学习奎克说，你可以嵌入任何你想要的数据。这意味着它可以处理从未见过的生物（例如章鱼）的细胞类型，并且仍然表现相当好。”地震说，希望这能为其他制作模型的人树立标准。

Once a foundational AI model has been trained with the relevant data, it needs a task to perform. In silico perturbation experiments are one powerful use case for virtual cell models, but they can do many useful things.

一旦用相关数据训练了基础AI模型，它就需要执行一项任务。计算机扰动实验是虚拟细胞模型的一个强大用例，但它们可以做许多有用的事情。

Cell typing is a major strength of Geneformer, UCE, and scVI, according to Ambrose Carr, director of product management for data at CZI. 'It's helpful to say, 'this is a lymphocyte' versus 'this is a fibroblast,'' he said. Predictions 'are usually not perfect,' Carr said, 'but a reasonable prediction of what kind of cells and biology you're seeing in your sample is really helpful and expedites the process of understanding what your data are saying.'.

CZI数据产品管理总监安布罗斯·卡尔（AmbroseCarr）表示，细胞分型是Geneformer、UCE和scVI的主要优势。卡尔说，预测“通常并不完美，但对你在样本中看到的细胞和生物学进行合理的预测确实很有帮助，并加快了理解数据含义的过程。”。

Data normalization, such as eliminating batch effects, and multimodal integration of data are two more uses. Broadly speaking, 'simulation is one of the great strengths' of AI models, Nvidia's Vacek said, not just of virtual cell models.

数据规范化（例如消除批处理效应）和数据的多模式集成是另外两种用途。Nvidia的Vacek说，从广义上讲，“模拟是人工智能模型的最大优势之一”，而不仅仅是虚拟细胞模型。

'Generative AI does a much better job of simulating the true complexity of the human genome properly' than previous approaches, he said, especially regarding structural variants, which are harder to simulate than SNPs and indels. 'Conversely, it would be better at calling structural variants, as well,' he said.

他说，与以前的方法相比，生成人工智能在正确模拟人类基因组的真实复杂性方面做得更好，尤其是在结构变异方面，它比SNP和indel更难模拟相反，它在调用结构变体方面也会更好，”他说。

In addition to free access through CZI, the AI tools mentioned in this article could be downloaded directly off GitHub and run on a laptop computer, though that might be an excruciatingly slow process. To grease the wheels, some companies have already begun commercializing them, from startups to public companies like Nvidia and Ginkgo Bioworks..

除了通过CZI免费访问之外，本文中提到的人工智能工具还可以直接从GitHub下载并在笔记本电脑上运行，尽管这可能是一个极其缓慢的过程。为了给轮子加油，一些公司已经开始将它们商业化，从初创公司到Nvidia和银杏生物制品等上市公司。。

Capitalizing on commercial opportunities

利用商业机会

Given generative AI's ability to create novel protein sequences and perform in silico perturbation experiments, drug discovery is a fertile area for companies to apply these tools, and several companies are releasing ones they've created for public use.

鉴于生成人工智能能够创建新的蛋白质序列并进行计算机微扰实验，药物发现是公司应用这些工具的一个肥沃领域，一些公司正在发布他们创建的供公众使用的工具。

Nvidia offers BioNemo, an AI platform for building and training models for drug discovery, including 3D protein structure prediction, de novo protein and small molecule design, and molecular docking, among other applications, including genomics with the models Geneformer and DNABert. Numerous companies in the drug discovery, sequencing and infectious disease fields are using BioNemo to build and use generative AI models, Vacek said..

Nvidia提供了BioNemo，这是一个AI平台，用于构建和训练药物发现模型，包括3D蛋白质结构预测，从头蛋白质和小分子设计，以及分子对接，以及其他应用，包括Geneformer和DNABert模型的基因组学。瓦切克说，药物发现、测序和传染病领域的许多公司正在使用BioNemo构建和使用生成性人工智能模型。。

Startups are all over this space, including UK-based Shift Bioscience, which raised $16 million in seed funding this month, and UK-based Phenomic AI, which has developed a modified version of scVI to look for unique targets expressed in cancer tissue versus normal tissue. Last month, Phenomic released a free version of its tool that contains its data from normal tissues but not its cancer sample data, which it considers proprietary.

初创公司遍布这一领域，包括总部位于英国的Shift Bioscience，该公司本月筹集了1600万美元的种子资金，以及总部位于英国的Phenomic AI，该公司开发了scVI的改进版本，以寻找在癌症组织与正常组织中表达的独特靶标。上个月，Phenomic发布了其工具的免费版本，其中包含来自正常组织的数据，但不包含其认为专有的癌症样本数据。

Like other similar models, Phenomic AI's tool can do cell typing and data normalization, said Sam Cooper, the firm's cofounder and chief technology officer..

该公司联合创始人兼首席技术官萨姆·库珀（SamCooper）表示，与其他类似模型一样，Phenomic AI的工具可以进行细胞分型和数据标准化。。

'You can train a machine-learning model to go from English to French without having any paired data. So, you can train a model just to read English and just to read French, and it'll figure out a rough translation approach that works surprisingly well,' he said. Phenomic AI uses that approach for getting rid of technical batch effects between single-cell datasets generated by different assays, namely 10x Genomics' Chromium assays and plate-based assays from InDrop, another droplet-based method that is mostly defunct..

“你可以训练机器学习模型从英语到法语，而不需要任何配对数据。因此，你可以训练一个只会读英语和法语的模型，它会找到一种效果出奇好的粗略翻译方法，”他说。Phenomic AI使用这种方法来消除由不同测定产生的单细胞数据集之间的技术批次效应，即10x Genomics的铬测定法和InDrop的基于平板的测定法，InDrop是另一种基于液滴的方法，目前大多已不存在。。

'The most exciting thing is, we think we can take the same approach to mapping spatial RNA and bulk RNA, as well,' he said. 'We can create a unified model of different sorts of RNA expression technologies.'

他说：“最令人兴奋的是，我们认为我们也可以采用同样的方法绘制空间RNA和大量RNA。”我们可以创建不同种类RNA表达技术的统一模型。”

This translation approach could help with multimodal data integration, such as single-cell ATAC-seq and methylation data. Current methods often use single-cell gene expression as a way to bridge the datasets, often from co-assays. 'It's not as good as having massive amounts of paired data, but there's not that much paired data in biology compared to the amount of unpaired data,' Cooper said.

这种翻译方法可以帮助进行多模式数据集成，例如单细胞ATAC-seq和甲基化数据。目前的方法通常使用单细胞基因表达作为桥接数据集的一种方式，通常来自共同测定。”库珀说，这不如拥有大量配对数据好，但生物学中的配对数据与未配对数据的数量相比并不多。

'And the differences between the technologies and modalities are way smaller than they are between English and French.' .

“技术和方式之间的差异比英语和法语之间的差异要小得多。”。

In late September, Ginkgo Bioworks began selling access to AA-0, a protein LLM built in collaboration with Google through an application programming interface. The model is built on Ginkgo's proprietary data on protein structures and interactions. It's one of several AI models in development at Ginkgo and part of a broader strategy of offering its proprietary technologies to customers.

9月下旬，Ginkgo Bioworks开始销售AA-0的访问权限，AA-0是一种通过应用程序编程界面与谷歌合作构建的蛋白质LLM。该模型基于银杏专有的蛋白质结构和相互作用数据。。

The firm is also selling data for others to train AI models on. .

该公司还为其他人销售数据，以训练人工智能模型。

To start, Ginkgo is offering two uses of AA-0. The first allows customers to 'mask' a particular section of the input, say, a variable region in an antibody, and the model will fill in what's missing. The second is an 'embedding calculation,' an intermediate step in protein classification that determines, for example, whether a protein is a kinase or how many proteins in a dataset are kinases.

首先，银杏提供AA-0的两种用途。第一种方法允许客户“掩盖”输入的特定部分，例如抗体中的可变区域，该模型将填充缺失的内容。第二个是“嵌入计算”，这是蛋白质分类的中间步骤，可以确定蛋白质是激酶还是数据集中有多少蛋白质是激酶。

'For a protein with around 500 amino acids, users should be able to get predictions on 2,000 sequences for roughly 20 cents,' said Ankit Gupta, general manager of Ginkgo AI, adding that there will also be a free tier of access to the model.

银杏人工智能总经理安基特·古普塔（AnkitGupta）说：“对于一种含有500个氨基酸左右的蛋白质，用户应该能够以大约20美分的价格获得2000个序列的预测。”他补充说，该模型还将有一个免费的访问层。

For interested researchers, the barrier to entering the brave new world of AI isn't high. 'Researchers comfortable running computational tools will not find [Geneformer] so different,' Bick said. 'Two graduate students can pick it up over the course of a week.'

对于感兴趣的研究人员来说，进入人工智能这个勇敢的新世界的门槛并不高。”比克说，研究人员舒适地运行计算工具，不会发现（基因形成者）有这么大的不同两名研究生可以在一周的时间内学会它。”

Other tools also have good documentation, making them relatively easy to pick up, said Neda Mehdiabadi, a rare disease researcher at Australia's Murdoch Children's Research Institute, who has tried out both Geneformer and scFoundation, a model developed by researchers at China's Tsinghua University.

澳大利亚默多克儿童研究所（Murdoch Children's Research Institute）罕见病研究人员内达·梅赫迪亚巴迪（Neda Mehdiabadi）表示，其他工具也有很好的文献资料，因此相对容易找到，她已经试用了Geneformer和scFoundation，这是一种由中国清华大学研究人员开发的模型。

'I could understand both of them,' she said. 'I didn't need to have direct input from the authors. The only reason I decided to go with Geneformer was the recent update to make the model larger.' .

“我都能理解，”她说我不需要作者的直接投入。我决定使用Geneformer的唯一原因是最近的更新，以使模型更大。”。

But is bigger necessarily better? Benchmarking foundational AI models against each other is an emerging challenge, as is comparing them to methods already in use — including human intuition.

但是越大越好吗？将基础人工智能模型相互比较是一个新兴的挑战，将其与已经使用的方法（包括人类直觉）进行比较也是一个挑战。

'It's really important to have benchmarking on biologically meaningful tasks, as well as a diverse panel of those tasks, to confirm that the model has learned generalizable knowledge and to ensure that we consider how the ground truth was established,' Theodoris said. 'Because in some cases, it might not be very clear.'.

西奥多里斯说：“对生物学上有意义的任务以及这些任务的多样化小组进行基准测试，以确认该模型已经学习了可推广的知识，并确保我们考虑如何建立基本事实，这一点非常重要。”因为在某些情况下，可能不太清楚。”。

Though LLMs are prone to hallucinate in ways that could be detrimental to science, like making up citations, models like Geneformer don't suffer from this in the same way because of specificity of the data they were trained on. Moreover, by using them as hypothesis generators rather than the final word, it's only a question of good or bad hypotheses and not real versus imagined results.

尽管法学硕士容易产生对科学有害的幻觉，例如编造引文，但像Geneformer这样的模型却不会以同样的方式受到这种幻觉的影响，因为他们所训练的数据具有特殊性。此外，通过将它们用作假设生成器而不是最后一句话，这只是一个好假设或坏假设的问题，而不是真实与想象的结果。

'A hallucination could just be a bad prediction. It's something we're still learning about,' Bick said, noting that his team is trying to do more systematic benchmarking of Geneformer and its predictions. 'Some of the questions we're trying to answer are, 'What is our comparator group?' and 'What's our null hypothesis?' Is it a random set of genes pulled out of a hat? Is it some researcher saying, 'Here are six genes I think are cool?''.

“幻觉可能只是一个糟糕的预测。比克说，这是我们仍在学习的东西，他指出，他的团队正在尝试对Geneformer及其预测进行更系统的基准测试我们试图回答的一些问题是，“我们的比较组是什么？”“我们的零假设是什么？”它是从帽子中随机抽取的一组基因吗？是不是有研究人员说，‘我认为这六个基因很酷？’。

Theodoris further suggested 'there could be a lot of questions that we're able to answer with simpler approaches, where we don't necessarily need these larger models. We really want to understand where they are able to push our knowledge and make the predictions that the other approaches are not able to.'.

西奥多里斯进一步提出，“我们可以用更简单的方法回答许多问题，而我们不一定需要这些更大的模型。。

But the way things are going, it may be hard to imagine that AI won't work its way into every aspect of science.

但就目前的情况来看，很难想象人工智能不会深入到科学的各个方面。

Cleaning up a mess

清理混乱

Over the course of more than three decades as a scientist, Nolan has generated heaps of data. As a cofounder of Akoya Biosciences, IonPath, and Scale Biosciences, he has helped others create heaps more. Lately, he has become frustrated, feeling he has helped drown the field in more data than researchers could hope to analyze.

。作为Akoya Biosciences、IonPath和Scale Biosciences的联合创始人，他帮助其他人创造了更多的财富。最近，他感到沮丧，觉得自己帮助淹没了该领域的数据比研究人员希望分析的还要多。

But with a new startup founded last year, called Cellformatica, he's hoping to 'clean up the mess my lab helped to create.' The firm uses an LLM trained on 38 million PubMed abstracts, 6 million full-text articles, and 17 structured biological datasets to generate novel research ideas when provided with data — a mass spectrometry signature, a list of target genes — and a context, such as head and neck cancer.

但随着去年成立了一家名为Cellformatica的新公司，他希望“清理我的实验室造成的混乱”该公司使用经过3800万PubMed摘要，600万篇全文文章和17个结构化生物数据集培训的法学硕士，在提供数据（质谱签名，目标基因列表）和背景（如头颈癌）时，产生新的研究想法。

'What we've got behind the scenes is a Ph.D.-level scientist doing six months of work for you in an hour,' Nolan claimed. .

诺兰声称：“我们在幕后看到的是一位博士级科学家在一个小时内为你做了六个月的工作。”。

'It gives you hypotheses, many of which you could come up with yourself, but why should you?' he asked. In addition, it will outline the validation experiments needed to test the hypotheses. One can even tell it not to exceed a certain cost with experiments or to exclude results from a particular lab in its analysis.

“它给了你一些假设，其中许多你可以自己提出，但你为什么要呢？”他问道。此外，它将概述测试假设所需的验证实验。人们甚至可以告诉它不要在实验中超过一定的成本，或者在分析中排除特定实验室的结果。

'It isn't creative by nature,' Nolan said, but it does have a huge advantage over a human — it can analyze the entirety of the scientific literature at blazing fast speeds. 'It basically does the hard part of the legwork of going into the literature and finding answers and summarizing them for you in ways that you wouldn't have thought of doing before,' he said.

诺兰说：“它天生就不具有创造性，但它确实比人类有巨大的优势——它可以以极快的速度分析整个科学文献。”他说，这基本上完成了调查文献、寻找答案并以你以前从未想过的方式为你总结答案的艰难工作。

Cellformatica also has a module that looks for connections between genes or cellular processes in the context of a particular disease, going as far as building 'causality maps' that can show how a cancer progresses.

Cellformatica还有一个模块，可以在特定疾病的背景下寻找基因或细胞过程之间的联系，甚至可以构建可以显示癌症进展的“因果关系图”。

Nolan was able to use Cellformatica to create hypotheses for which immune cell events were associated with a response to immune checkpoint blockade in an analysis of the tumor microenvironment in Merkel-cell carcinoma. It also provided a list of targets that could be used to test the hypotheses, some of which could be drugged..

在分析默克尔细胞癌的肿瘤微环境时，诺兰能够使用Cellformatica来创建免疫细胞事件与免疫检查点阻断反应相关的假设。它还提供了一系列可用于检验假设的目标，其中一些可能会被麻醉。。

Still, Nolan considers Cellformatica to be 'relatively primitive' in comparison to what's possible, and a new development in AI models could have untold consequences for biology, he said.

尽管如此，诺兰认为，与可能的情况相比，细胞信息学是“相对原始的”，人工智能模型的新发展可能对生物学产生不可言喻的影响，他说。

Last month, OpenAI released a new model that promises the ability to perform multistep reasoning, something that its ChatGPT doesn't do. In a blog post, the firm said the new 'o1' model outperformed a Ph.D.-level human on a benchmarking battery of physics, biology, and chemistry problems, the first model AI to do so.

上个月，OpenAI发布了一个新模型，该模型承诺能够执行多步推理，这是其ChatGPT所不能做到的。在一篇博客文章中，该公司表示，新的“o1”模型在一系列物理、生物和化学问题的基准测试中优于博士水平的人类，这是第一个这样做的模型AI。

'These results do not imply that o1 is more capable than a Ph.D. in all respects — only that the model is more proficient in solving some problems that a Ph.D. would be expected to solve,' the firm said, noting that it can even be used 'to annotate cell sequencing data.' OpenAI did not respond to requests for comment..

该公司说：“这些结果并不意味着o1在所有方面都比博士更有能力，只是该模型在解决博士有望解决的一些问题方面更为熟练”，并指出它甚至可以用于“注释细胞测序数据”OpenAI没有回应置评请求。。

These so-called 'chain-of-thought' models could help researchers determine the questions they need to ask. And such a model would take the Cellformatica approach one step further. 'We can probably expect it to be more incisive in how it answers. If we wanted to provide details on how to do an experiment, it would reason through the questions better.'.

这些所谓的“思维链”模型可以帮助研究人员确定他们需要问的问题。这样的模型将使Cellformatica方法更进一步。”我们可能会期望它在回答问题时更加尖锐。如果我们想提供如何进行实验的细节，它会更好地通过问题进行推理。”。

'It really behooves us to take advantage of whatever technology might be available,' he said. 'If using large language models allow us to better understand what cancer means, faster, we'd be fools not to take advantage of it, if it's sitting right there on offer.'

他说：“我们真的应该利用任何可用的技术。”如果使用大型语言模型可以让我们更快地更好地理解癌症的含义，那么如果它就在那里，我们就不会利用它了。”

全球产业链接平台

重庆市渝北区金星科技大厦A区5楼512室

联系电话：023-67139735（重庆）

关于我们

产品服务