EN
登录

用于医学图像分割的Vision Mamba和xLSTM-UNet

Vision Mamba and xLSTM-UNet for medical image segmentation

Nature 等信源发布 2025-03-10 12:21

可切换为仅中文


Abstract

摘要

Deep learning-based medical image segmentation methods are generally divided into convolutional neural networks (CNNs) and Transformer-based models. Traditional CNNs are limited by their receptive field, making it challenging to capture long-range dependencies. While Transformers excel at modeling global information, their high computational complexity restricts their practical application in clinical scenarios.

基于深度学习的医学图像分割方法一般分为卷积神经网络(CNN)和基于Transformer的模型。传统的CNN受其感受野的限制,难以捕获长距离依赖关系。而Transformer虽然在建模全局信息方面表现出色,但其较高的计算复杂度限制了在临床场景中的实际应用。

To address these limitations, this study introduces VMAXL-UNet, a novel segmentation network that integrates Structured State Space Models (SSM) and lightweight LSTMs (xLSTM). The network incorporates Visual State Space (VSS) and ViL modules in the encoder to efficiently fuse local boundary details with global semantic context.

为了解决这些局限性,本研究引入了VMAXL-UNet,这是一种新颖的分割网络,集成了结构化状态空间模型(SSM)和轻量级LSTM(xLSTM)。该网络在编码器中引入了视觉状态空间(VSS)和ViL模块,以高效融合局部边界细节与全局语义上下文。

The VSS module leverages SSM to capture long-range dependencies and extract critical features from distant regions. Meanwhile, the ViL module employs a gating mechanism to enhance the integration of local and global features, thereby improving segmentation accuracy and robustness. Experiments on datasets such as ISIC17, ISIC18, CVC-ClinicDB, and Kvasir demonstrate that VMAXL-UNet significantly outperforms traditional CNNs and Transformer-based models in capturing lesion boundaries and their distant correlations.

VSS模块利用SSM捕捉长程依赖关系,并从远处区域提取关键特征。同时,ViL模块采用门控机制增强局部和全局特征的融合,从而提高分割的准确性和鲁棒性。在ISIC17、ISIC18、CVC-ClinicDB和Kvasir等数据集上的实验表明,VMAXL-UNet在捕获病灶边界及其远距离相关性方面显著优于传统CNN和基于Transformer的模型。

These results highlight the model’s superior performance and provide a promising approach for efficient segmentation in complex medical imaging scenarios..

这些结果突显了该模型的优越性能,并为复杂医学成像场景中的高效分割提供了一种有前景的方法。

Introduction

简介

High-resolution medical images are crucial in modern medicine, but their complexity and diversity pose significant challenges to traditional segmentation methods

高分辨率的医学图像在现代医学中至关重要,但其复杂性和多样性对传统的分割方法提出了重大挑战。

1

1

. For instance, retinal vessel segmentation often suffers from low contrast, high noise, and brightness variations, leading to insufficient segmentation accuracy. Additionally, the vast amount of data further increases the time and cost of manual annotation.

例如,视网膜血管分割常常受到低对比度、高噪声和亮度变化的影响,导致分割精度不足。此外,大量数据进一步增加了人工标注的时间和成本。

With the rapid development of computer and artificial intelligence technologies, convolutional neural networks (CNNs) have demonstrated powerful modeling capabilities and gained widespread attention. Among them, UNet

随着计算机和人工智能技术的迅速发展,卷积神经网络(CNN)展现了强大的建模能力并受到广泛关注。其中,UNet

2

2

, a deep learning architecture designed specifically for biomedical image segmentation, features a symmetric “U”-shaped structure comprising an encoder and a decoder. This design enables UNet to achieve outstanding performance in handling the complex structures in medical images. Following this direction, various UNet variants such as UNet++.

,一种专为生物医学图像分割设计的深度学习架构,具有对称的“U”形结构,包含编码器和解码器。这种设计使 UNet 能够在处理医学图像中的复杂结构时表现出色。沿着这一方向,出现了各种 UNet 变体,例如 UNet++。

3

3

, U-Net V2

,U-Net V2

4

4

, and UNet 3+

,以及UNet 3+

5

5

have been developed and successfully applied to image and volume segmentation across diverse medical imaging modalities. However, due to the inherent limitations of convolution operations, CNN-based networks struggle to effectively capture global context and long-range dependencies, limiting their performance in complex semantic modeling tasks..

已经被开发并成功应用于各种医学成像模式的图像和体积分割。然而,由于卷积操作的固有局限性,基于CNN的网络难以有效捕捉全局上下文和长程依赖关系,这限制了它们在复杂语义建模任务中的性能。

To overcome these limitations, researchers have drawn inspiration from the success of Transformers in natural language processing and introduced Transformer-based models into computer vision. Vision Transformer (ViT)

为了克服这些限制,研究人员从Transformer在自然语言处理中的成功中汲取了灵感,并将基于Transformer的模型引入到计算机视觉中。视觉Transformer(ViT)

6

6

, one of the first models fully based on multi-head self-attention, efficiently captures long-range dependencies and encodes complex shape information, demonstrating remarkable modeling capabilities. Building on this, improved models such as Swin Transformer

,是最早完全基于多头自注意力的模型之一,能够高效捕捉长程依赖并编码复杂形状信息,展现出卓越的建模能力。在此基础上,改进的模型如Swin Transformer

7

7

, which incorporates local window self-attention to reduce computational complexity, and DeiT

,它结合了局部窗口自注意力机制以降低计算复杂度,以及 DeiT

8

8

, which optimizes training strategies for smaller datasets, have been proposed. While these models achieve notable performance improvements, pure Transformer architectures remain limited in capturing local details and suffer from high computational complexity, especially when processing high-resolution images, posing challenges for clinical applications.

,这些模型优化了针对较小数据集的训练策略。尽管这些模型取得了显著的性能提升,但纯Transformer架构在捕捉局部细节方面仍然存在局限性,并且在处理高分辨率图像时面临高计算复杂性的挑战,这为临床应用带来了困难。

To address these issues, hybrid models combining CNNs and Transformers have been explored to enhance network performance. For example, TransUNet.

为了解决这些问题,已经探索了结合CNN和Transformer的混合模型以提升网络性能,例如TransUNet。

9

9

integrates Transformers with the UNet architecture, significantly improving medical image segmentation by enhancing global feature extraction. However, these methods have yet to strike an ideal balance between improving performance and reducing computational cost, necessitating further optimization..

将 Transformers 与 UNet 架构相结合,通过增强全局特征提取显著改善了医学图像分割。然而,这些方法尚未在提升性能和降低计算成本之间达到理想的平衡,仍需进一步优化。

Recently, Mamba

最近,Mamba

10

10

has garnered attention for its powerful sequence modeling capabilities. Particularly in medical image segmentation, integrating Mamba into classic UNet architectures effectively captures long-range dependencies. As a structured state space model (SSM), Mamba offers linear computational complexity and efficient global feature modeling, making it an ideal solution for processing global context information in segmentation models.

因其强大的序列建模能力而备受关注。特别是在医学图像分割领域,将Mamba集成到经典的UNet架构中能有效捕捉长程依赖关系。作为一种结构化状态空间模型(SSM),Mamba具备线性计算复杂度和高效的全局特征建模能力,是处理分割模型中全局上下文信息的理想解决方案。

Similarly, the recently proposed xLSTM.

类似地,最近提出的xLSTM。

11

11

, an extended version of traditional LSTM, enhances long-range dependency modeling through optimized gating mechanisms. Compared to Transformers, xLSTM not only achieves linear computational complexity but also excels in capturing sequence details. Although initially applied to natural language processing and image classification, xLSTM’s potential in medical image segmentation is worth exploring..

,传统LSTM的扩展版本,通过优化的门控机制增强了长程依赖建模能力。与Transformer相比,xLSTM不仅实现了线性计算复杂度,还在捕捉序列细节方面表现出色。尽管最初应用于自然语言处理和图像分类,但xLSTM在医学图像分割中的潜力值得探索。

The key to achieving accurate medical image segmentation lies in effectively capturing and integrating local features and long-range correlations. While combining CNNs and Transformers offers a promising direction, alternative approaches may also provide effective solutions. Inspired by VMamba

实现精确的医学图像分割的关键在于有效捕捉和整合局部特征以及长程相关性。虽然结合卷积神经网络(CNN)和Transformer提供了一个有前景的方向,但其他方法也可能提供有效的解决方案。受VMamba启发

12

12

and xLSTM, this paper proposes VMAXL-UNet, a novel segmentation model designed to overcome the limitations of existing methods and further improve segmentation accuracy.

并且提出了xLSTM,本文提出了一种新颖的分割模型VMAXL-UNet,旨在克服现有方法的局限性并进一步提高分割精度。

VMAXL-UNet inherits the classic UNet design elements, such as the encoder-decoder architecture and skip connections, while integrating the strengths of SSM and xLSTM to enhance its ability to model complex dynamic systems and long-range dependencies. To further boost performance, VMAXL-UNet employs a four-layer encoder structure, where the first three layers consist of VSS blocks and BasicConv blocks, and the fourth layer combines VSS blocks with ViL blocks, incorporating patch merging for down-sampling to enhance feature extraction.

VMAXL-UNet 继承了经典的 UNet 设计元素,例如编码器-解码器架构和跳跃连接,同时结合了 SSM 和 xLSTM 的优势,以增强其对复杂动态系统和长程依赖性的建模能力。为了进一步提升性能,VMAXL-UNet 采用了四层编码器结构,其中前三层由 VSS 模块和 BasicConv 模块组成,第四层则将 VSS 模块与 ViL 模块结合,并通过 patch merging 进行下采样,从而增强特征提取能力。

The decoder mirrors this design with four layers, each composed of two BasicConv blocks, and employs patch expansion for up-sampling to restore the segmentation output’s resolution. Skip connections integrate features via additive operations, further improving segmentation performance..

解码器镜像了这一设计,包含四层,每层由两个 BasicConv 块组成,并采用块扩展进行上采样,以恢复分割输出的分辨率。跳跃连接通过加法操作整合特征,进一步提升分割性能。

Extensive experiments were conducted on multiple medical image datasets, including ISIC17

在包括ISIC17在内的多个医学图像数据集上进行了广泛的实验。

13

13

, ISIC18

,ISIC18

14

14

, Kvasir-SEG

,Kvasir-SEG

15

15

, and ClinicDB

,以及ClinicDB

16

16

, to validate the proposed model. The results demonstrate that VMAXL-UNet achieves superior segmentation performance across these datasets, showcasing its strong capability and potential in medical image segmentation.The primary contributions of this work are as follows:

,以验证所提出的模型。结果表明,VMAXL-UNet 在这些数据集上实现了卓越的分割性能,展示了其在医学图像分割中的强大能力和潜力。这项工作的主要贡献如下:

1.

1.

Introduction of the ViL module for medical image segmentation, enhancing the model’s ability to handle complex lesion morphology and blurred boundaries, thereby improving segmentation accuracy, especially in fine-structure recognition and boundary clarity.

引入了用于医学图像分割的ViL模块,增强了模型处理复杂病灶形态和模糊边界的能力,从而提高了分割精度,特别是在精细结构识别和边界清晰度方面。

2.

2.

Proposal of VMAXL-UNet, an encoder-decoder architecture combining SSM and xLSTM, tailored for medical image segmentation tasks.

VMAXL-UNet 的提出,这是一种结合 SSM 和 xLSTM 的编码器-解码器架构,专为医学图像分割任务量身定制。

3.

3.

Comprehensive experiments conducted on four datasets, demonstrating the competitive performance and broad applicability of VMAXL-UNet in medical image segmentation.

在四个数据集上进行的综合实验,展示了VMAXL-UNet在医学图像分割中的竞争性能和广泛适用性。

Related Work

相关工作

Applications of CNNs and transformers in medical image segmentation

卷积神经网络和变压器在医学图像分割中的应用

Over the past few decades, Convolutional Neural Networks (CNNs) have been a core component of artificial neural networks, achieving remarkable success in deep learning and computer vision, and have found widespread application in the medical imaging field. Medical image segmentation, as a crucial task in image processing, witnessed a historic breakthrough with the introduction of U-Net.

在过去的几十年里,卷积神经网络(CNN)一直是人工神经网络的核心组件,在深度学习和计算机视觉领域取得了显著的成功,并在医学影像领域得到了广泛应用。医学图像分割作为图像处理中的一项关键任务,随着U-Net的引入实现了历史性的突破。

U-Net effectively performs pixel-level classification through its encoder-decoder symmetric structure, significantly improving segmentation accuracy. Inspired by U-Net, UNet++.

U-Net通过其编码器-解码器对称结构有效地实现了像素级分类,显著提高了分割精度。受U-Net启发,UNet++应运而生。

3

3

was developed by incorporating dense skip connections to address the semantic gap during feature fusion. Subsequent research introduced techniques such as attention mechanisms

通过引入密集的跳跃连接来开发,以解决特征融合过程中的语义差距。随后的研究介绍了注意力机制等技术

17

17

, image pyramids

,图像金字塔

18

18

, and residual networks

,以及残差网络

19

19

, further enhancing the performance of CNN-based segmentation methods. The Transformer architecture, initially proposed by Vaswani et al.

,进一步增强了基于CNN的分割方法的性能。Transformer架构最初由Vaswani等人提出。

20

20

in 2017 for natural language processing, gained significant attention for its superior ability to handle long-range dependencies and complex contextual relationships. In 2020, Dosovitskiy et al.

2017年因其在自然语言处理方面的卓越能力而受到广泛关注,特别是在处理长距离依赖性和复杂上下文关系方面。2020年,Dosovitskiy等人。

6

6

introduced the Vision Transformer, applying the Transformer architecture to image classification tasks in computer vision. Since then, Transformer-based methods for medical image segmentation have emerged, with notable works such as Swin-Unet

引入了视觉Transformer,将Transformer架构应用于计算机视觉中的图像分类任务。此后,基于Transformer的医学图像分割方法不断涌现,其中较为著名的工作包括Swin-Unet。

21

21

, which utilizes the Swin Transformer to enhance feature representation, marking the first Transformer-based U-Net architecture. Although both CNNs and Transformers have demonstrated impressive capabilities in image segmentation tasks, each has its limitations. As a result, many studies have begun to explore ways to combine the strengths of both CNNs and Transformers.

,它利用Swin Transformer增强特征表示,标志着第一个基于Transformer的U-Net架构的出现。尽管CNN和Transformer在图像分割任务中都展示了出色的能力,但它们各自都有局限性。因此,许多研究开始探索结合CNN和Transformer优势的方法。

For instance, UCTransNet.

例如,UCTransNet。

22

22

replaces the skip connections in U-Net with Transformer-based modules to enhance the fusion of global and local features. However, despite progress in the fusion of local and global features, these methods still face challenges in meeting the high-accuracy segmentation requirements in medical imaging..

用基于Transformer的模块取代了U-Net中的跳跃连接,以增强全局特征和局部特征的融合。然而,尽管在局部和全局特征融合方面取得了进展,这些方法在满足医学影像高精度分割需求方面仍面临挑战。

Applications of Mamba and LSTM in medical image segmentation

Mamba和LSTM在医学图像分割中的应用

In recent years, State Space Models (SSMs) have been introduced into deep learning for sequence modeling, with their parameters or mappings learned through gradient descent

近年来,状态空间模型(SSMs)被引入深度学习用于序列建模,其参数或映射通过梯度下降学习得到。

23

23

. SSMs essentially serve as a sequence transformation method that can be effectively integrated into deep neural networks. However, due to the computational and storage demands of state representations, SSMs have not been widely adopted in practical applications. This situation changed with the introduction of the Structured State Space Model (S4), which addresses the computational and storage limitations of traditional SSMs by reparameterizing the state matrix.

SSM 本质上作为一种序列变换方法,可以有效地集成到深度神经网络中。然而,由于状态表示的计算和存储需求,SSM 在实际应用中并未得到广泛采用。这种情况随着结构化状态空间模型(S4)的引入而改变,S4 通过对状态矩阵进行重新参数化,解决了传统 SSM 的计算和存储限制问题。

24

24

. Subsequently, Mamba

. 随后,Mamba

10

10

, one of the most successful variants of SSMs, emerged, significantly enhancing the application of SSMs. Mamba not only retains modeling performance comparable to Transformers but also exhibits linear scalability, enabling it to efficiently handle long sequence data. This makes SSMs a strong competitor to Transformers.

,SSM 最成功的变体之一出现了,大幅增强了 SSM 的应用。Mamba 不仅保持了与 Transformer 相当的建模性能,还展示了线性扩展能力,能够高效处理长序列数据。这使得 SSM 成为了 Transformer 的有力竞争者。

The advantages of Mamba have led to its rapid prominence in sequence modeling and demonstrated substantial potential in medical image processing, especially when combined with U-Net, further improving medical image segmentation accuracy. For example, models like Mamba-UNet.

Mamba 的优势使其在序列建模中迅速崭露头角,并在医学图像处理中展现出巨大的潜力,特别是当与 U-Net 结合时,可进一步提高医学图像分割的准确性。例如,Mamba-UNet 等模型。

25

25

and VM-UNet

和 VM-UNet

26

26

introduced the Visual Mamba module, constructing U-Net-like architectures that significantly enhanced multi-scale feature extraction capabilities, optimizing segmentation performance. The successful application of these models not only highlights Mamba’s potential in medical image segmentation but also provides new insights for the further development of deep learning models in medical image processing..

引入了视觉Mamba模块,构建类似U-Net的架构,显著增强了多尺度特征提取能力,优化了分割性能。这些模型的成功应用不仅凸显了Mamba在医学图像分割中的潜力,还为深度学习模型在医学图像处理领域的进一步发展提供了新的见解。

Long Short-Term Memory (LSTM) networks

长短期记忆(LSTM)网络

27

27

, proposed by Hochreiter and Schmidhuber in 1997, were designed to address the vanishing gradient problem encountered by traditional Recurrent Neural Networks (RNNs) when processing long sequence data. LSTMs are capable of effectively processing sequential data while retaining information from earlier steps in the sequence, making them particularly well-suited for tasks involving long-term dependencies.

1997年,Hochreiter和Schmidhuber提出了LSTM(长短期记忆网络),旨在解决传统循环神经网络 (RNN) 在处理长序列数据时遇到的梯度消失问题。LSTM能够在有效处理序列数据的同时保留序列中较早步骤的信息,因此特别适合涉及长期依赖性的任务。

28

28

. This ability allows LSTMs to learn and remember long-term dependencies, where information from earlier time steps is crucial for predicting later steps. In the field of medical imaging, LSTMs have gradually gained significant attention and application. For example, Salehin

这种能力使LSTMs能够学习和记住长期依赖关系,其中早期时间步的信息对预测后期时间步至关重要。在医学影像领域,LSTMs逐渐获得了显著的关注和应用。例如,Salehin

29

29

proposed an LSTM-based method for medical image classification, MedvLSTM, which shows great potential in improving the accuracy and efficiency of medical image classification. Shahzadi et al.

提出了基于LSTM的医学图像分类方法MedvLSTM,该方法在提高医学图像分类的准确性和效率方面显示出巨大的潜力。Shahzadi等人。

30

30

, addressing the limitations of traditional Convolutional Neural Networks (CNNs) in 3D medical image classification, particularly the challenges in optimizing 3D volume classification tasks, proposed a cascaded model combining CNNs and LSTMs, CNN-LSTM, which was applied to classify brain tumor MRI images and effectively distinguish between high-grade and low-grade gliomas.

针对传统卷积神经网络(CNN)在3D医学图像分类中的局限性,特别是优化3D体积分类任务中的挑战,提出了一种结合CNN和LSTM的级联模型CNN-LSTM,应用于脑肿瘤MRI图像的分类,并有效区分高级别和低级别胶质瘤。

Despite the success of LSTM in medical image processing, its inherent limitations, such as limited storage capacity, inability to correct storage decisions, and lack of parallel processing capability, still constrain its further application. To overcome these issues, xLSTM.

尽管LSTM在医学图像处理方面取得了成功,但其固有的局限性,如存储容量有限、无法纠正存储决策以及缺乏并行处理能力,仍然限制了它的进一步应用。为了解决这些问题,xLSTM应运而生。

11

11

was introduced, addressing the shortcomings of traditional LSTMs in terms of flexibility in information storage and parallel processing of long sequences, breathing new life into LSTMs in modern AI applications. For example, the xLSTM-UNet

被引入,解决了传统LSTM在信息存储灵活性和长序列并行处理方面的不足,为LSTM在现代人工智能应用中注入了新的活力。例如,xLSTM-UNet

31

31

model, by incorporating xLSTM into the U-Net architecture, enhanced the model’s ability to capture sequential data and significantly improved medical image segmentation accuracy. This innovative combination not only optimized LSTM performance but also boosted the accuracy and efficiency of medical image analysis..

通过将xLSTM引入U-Net架构,该模型增强了捕捉序列数据的能力,并显著提高了医学图像分割的准确性。这一创新组合不仅优化了LSTM的性能,还提升了医学图像分析的准确性和效率。

Methods

方法

Architecture overview

架构概览

The overall architecture of the model is shown in Fig.

模型的整体架构如图所示。

1

1

, which consists of three main components: the encoder, decoder, and skip connections. In the encoder, the input medical image is first divided into non-overlapping patches of size 4

,它由三个主要组件组成:编码器、解码器和跳跃连接。在编码器中,输入的医学图像首先被划分为大小为4的非重叠块。

\(\times\)

\(\times\)

4, transforming the input into a sequence of embeddings. The input image has dimensions of H

4,将输入转换为嵌入序列。输入图像的尺寸为 H。

\(\times\)

\(\times\)

W

W

\(\times\)

\(\times\)

3, and after passing through the patch embedding layer, the image is mapped to C channels, resulting in an embedded image of size H/4

3,经过patch embedding层后,图像被映射到C个通道,得到大小为H/4的嵌入图像

\(\times\)

\(\times\)

W/4

W/4

\(\times\)

\(\times\)

C. This embedded image is then passed into the encoder for feature extraction. The encoder consists of four stages, with patch merging operations applied at the end of the first three stages to gradually reduce spatial resolution and increase the number of channels, thereby enhancing the model’s ability to extract image features.

C. 然后,该嵌入图像被传递到编码器中进行特征提取。编码器由四个阶段组成,在前三个阶段的末尾应用了块合并操作,以逐步降低空间分辨率并增加通道数量,从而增强模型提取图像特征的能力。

Specifically, the first three stages are composed of VSS blocks and BasicConv blocks, which help capture both local features and long-range dependencies in the image. The final stage consists of VSS and ViL blocks, further improving feature representation and strengthening the modeling of long-range dependencies.

具体来说,前三个阶段由VSS块和BasicConv块组成,有助于捕捉图像中的局部特征和长程依赖关系。最后一个阶段由VSS和ViL块组成,进一步提升特征表示能力,并加强长程依赖关系的建模。

The number of channels at each stage is [C, 2C, 4C, 8C], enabling the model to effectively extract and fuse features at different levels..

每个阶段的通道数为 [C, 2C, 4C, 8C],使模型能够有效地在不同层次提取和融合特征。

Fig. 1

图1

Overall structure of the proposed VMAXL-UNet.

提出的VMAXL-UNet的整体结构。

Full size image

全尺寸图像

The decoder is also divided into four stages. In the last three stages, patch expanding operations are applied, gradually reducing the number of feature channels and increasing the spatial resolution of the feature maps, thereby restoring image details. The number of channels in the BasicConv blocks used in the decoder stages is [8C, 4C, 2C, C], and the number of BasicConv layers at each stage is [2, 2, 2, 2].

解码器同样分为四个阶段。在后三个阶段中,应用了 patch expanding 操作,逐步减少特征通道的数量并增加特征图的空间分辨率,从而恢复图像细节。解码器阶段中使用的 BasicConv 块的通道数为 [8C, 4C, 2C, C],每个阶段的 BasicConv 层数为 [2, 2, 2, 2]。

This design helps the decoder progressively restore the image’s spatial resolution, ensuring that the output dimensions match the input image’s size before the final output. After the decoder, the Final Projection layer adjusts the feature dimensions to the appropriate size for the segmentation task, ultimately generating the segmentation result.

该设计有助于解码器逐步恢复图像的空间分辨率,确保输出维度在最终输出之前与输入图像的尺寸相匹配。在解码器之后,最终投影层将特征维度调整为适合分割任务的适当大小,最终生成分割结果。

Skip connections play a critical role throughout the model by directly fusing features between the encoder and decoder through addition, which helps retain more detailed information and, consequently, improves segmentation accuracy..

跳跃连接通过加法直接在编码器和解码器之间融合特征,在整个模型中发挥了关键作用,这有助于保留更多细节信息,从而提高分割精度。

VSS block

VSS块

The VSS block is derived from VMamba

VSS块源自VMamba

12

12

, and its module structure is illustrated in Fig.

,其模块结构如图所示。

1

1

. In the VSS block, the input data is first processed by Layer Normalization and then split into two streams. In the first stream, the data passes through a linear layer followed by a nonlinear transformation via an activation function. In the second stream, the data is also processed through a linear layer, followed by a 3.

在VSS模块中,输入数据首先经过层归一化处理,然后被分成两个流。在第一个流中,数据通过一个线性层,随后经由激活函数进行非线性变换。在第二个流中,数据同样经过一个线性层处理,接着是一个3。

\(\times\)

\(\times\)

3 convolutional layer and an activation function to further extract local features. The output from the second stream is then fed into the 2D Selective Scanning (SS2D) module for deeper feature extraction. The features processed by the SS2D module undergo Layer Normalization to ensure consistency in the feature distribution.

3个卷积层和一个激活函数以进一步提取局部特征。第二条流的输出随后被输入到2D选择性扫描(SS2D)模块中进行更深层次的特征提取。经过SS2D模块处理的特征会通过层归一化来确保特征分布的一致性。

The normalized features are then element-wise multiplied with the output of the first stream, merging the information from both paths. Finally, the merged features are passed through a linear layer and combined with the original input data via a residual connection, forming the final output of the VSS block..

然后将归一化后的特征与第一个流的输出逐元素相乘,从而合并两条路径的信息。最后,通过线性层传递融合后的特征,并通过残差连接与原始输入数据结合,形成 VSS 模块的最终输出。

The SS2D module consists of three components: the scanning expansion operation, the S6 module, and the scanning merging operation. The scanning expansion operation, as shown in Fig.

SS2D模块由三个组件组成:扫描扩展操作、S6模块和扫描合并操作。扫描扩展操作如图所示。

2

2

a, unfolds the input image into sequences along four different directions. These sequences are then processed by the S6 module for feature extraction, ensuring that information from all directions is fully scanned, thereby capturing diverse features. Subsequently, the scanning merging operation, as depicted in Fig.

a,将输入图像沿四个不同方向展开为序列。这些序列随后由S6模块进行特征提取处理,确保所有方向的信息都被充分扫描,从而捕获多样化的特征。随后,如图所示,进行扫描合并操作。

.

2

2

b, sums and merges the sequences from the four directions, restoring the output image to the same size as the input image. Specifically, given the input feature

b,对来自四个方向的序列进行求和和合并,将输出图像恢复到与输入图像相同的大小。具体来说,给定输入特征

\(w\)

\(w\)

, the output feature

,输出特征

\(\bar{w}\)

\(\bar{w}\)

of SS2D can be expressed as:

SS2D 可以表示为:

$$\begin{aligned} w_z= & \text {Expand}(w, z) \end{aligned}$$

$$\begin{aligned} w_z = & \text{展开}(w, z) \end{aligned}$$

(1)

(1)

$$\begin{aligned} \bar{w}_z= & S6(w_z) \end{aligned}$$

$$\begin{aligned} \bar{w}_z= & S6(w_z) \end{aligned}$$

(2)

(2)

$$\begin{aligned} \bar{w}= & \text {Merge}(\bar{w}_1, \bar{w}_2, \bar{w}_3, \bar{w}_4) \end{aligned}$$

$$\begin{aligned} \bar{w}= & \text{合并}(\bar{w}_1, \bar{w}_2, \bar{w}_3, \bar{w}_4) \end{aligned}$$

(3)

(3)

where

其中

\(z \in V = \{1, 2, 3, 4\}\)

\(z \in V = \{1, 2, 3, 4\}\)

represents the four different scanning directions (as shown in Figure

代表四个不同的扫描方向(如图所示)

2

2

),

),

\(\text {Expand}(\cdot )\)

\(\text{展开}(\cdot )\)

and

\(\text {Merge}(\cdot )\)

\(\text {合并}(\cdot )\)

correspond to the scanning expansion and scanning merging operations, respectively, and

分别对应于扫描扩展和扫描合并操作,并且

\(S6(\cdot )\)

\(S6(\cdot )\)

denotes the output after passing through the S6 module. The Selective Scanning State Space Sequence Model (S6) in Equation (2) is the core of the VSS block, responsible for processing the input sequence through a series of linear transformations and discretization processes. For more details on S6, please refer to.

表示通过S6模块后的输出。公式(2)中的选择性扫描状态空间序列模型(S6)是VSS块的核心,负责通过一系列线性变换和离散化过程处理输入序列。有关S6的更多细节,请参见。

10

10

.

Fig. 2

图2

The left part is the SS2D scan expanding operation, and the right part is the SS2D scan merging operation.

左边部分是SS2D扫描扩展操作,右边部分是SS2D扫描合并操作。

Full size image

全尺寸图像

ViL Block

ViL块

For the ViL module (as shown in Fig.

对于ViL模块(如图所示。

1

1

), the input information first undergoes Layer Normalization to standardize the data distribution and stabilize the training process. The input is then split into two streams. In the first stream, the data is passed through a linear layer for linear transformation, followed by a SiLU activation function for nonlinear processing, enhancing the model’s expressive power.

),输入信息首先经过层归一化(Layer Normalization)来标准化数据分布并稳定训练过程。随后,输入被分为两个流。在第一个流中,数据通过一个线性层进行线性变换,接着经过SiLU激活函数进行非线性处理,增强了模型的表达能力。

In the second stream, the data is also processed through a linear layer, followed by a convolutional layer to capture local features, with an activation function further enriching the feature representations. Subsequently, the output from the second stream is fed into the mLSTM module to capture long-range dependencies and further extract features.

在第二条流中,数据同样通过一个线性层进行处理,随后经过一个卷积层以捕获局部特征,并通过激活函数进一步丰富特征表示。接着,第二条流的输出被输入到 mLSTM 模块中,以捕获长程依赖关系并进一步提取特征。

After processing by the mLSTM module, the extracted features undergo Layer Normalization again to ensure consistency in feature distribution. The normalized features are then element-wise multiplied with the output from the first stream, effectively integrating the information from both paths. Finally, the merged features are processed through a linear layer and combined with the original input data via a residual connection, resulting in the final output of the ViL module.

经过 mLSTM 模块处理后,提取的特征再次进行层归一化,以确保特征分布的一致性。归一化后的特征随后与第一条流的输出逐元素相乘,有效整合了两条路径的信息。最后,合并后的特征通过一个线性层处理,并通过残差连接与原始输入数据结合,得到 ViL 模块的最终输出。

This design not only captures local features but also effectively models long-range dependencies, thereby enhancing the overall feature representation capability..

该设计不仅捕捉局部特征,还有效建模长程依赖关系,从而增强了整体的特征表示能力。

mLSTM (Matrix LSTM), derived from xLSTM

mLSTM(矩阵LSTM),源自xLSTM

11

11

, significantly enhances the model’s memory and parallel processing capabilities by extending the vector operations in traditional LSTM to matrix operations. In mLSTM, each state is no longer represented by a single vector but by a matrix. This design allows it to capture more complex data relationships and patterns within a single time step.

通过将传统LSTM中的向量操作扩展为矩阵操作,显著增强了模型的记忆能力和并行处理能力。在mLSTM中,每个状态不再由单一的向量表示,而是由一个矩阵表示。这种设计使其能够在单个时间步内捕捉更复杂的数据关系和模式。

Additionally, mLSTM employs the FlashAttention mechanism, which dynamically guides the updating process of cell states and normalized states through the interaction of queries, keys, and values, ultimately generating the final hidden layer output. This design not only improves the model’s ability to model complex data patterns but also significantly boosts computational efficiency, as illustrated in Fig.

此外,mLSTM采用了FlashAttention机制,通过查询、键和值的相互作用动态引导单元状态和归一化状态的更新过程,最终生成最终的隐藏层输出。这种设计不仅增强了模型对复杂数据模式的建模能力,还显著提高了计算效率,如图所示。

.

3

3

. Specifically, the mLSTM layer first performs linear projections on the query, key, and value vectors:

。具体来说,mLSTM层首先对查询、键和值向量执行线性投影:

$$\begin{aligned} \text {Query Input: } q_t= & W_q x_t + b_q \end{aligned}$$

$$\begin{aligned} \text{查询输入: } q_t= & W_q x_t + b_q \end{aligned}$$

(4)

(4)

$$\begin{aligned} \text {Key Input: } k_t= & \frac{1}{\sqrt{d}} W_k x_t + b_k \end{aligned}$$

$$\begin{aligned} \text{关键输入: } k_t= & \frac{1}{\sqrt{d}} W_k x_t + b_k \end{aligned}$$

(5)

(5)

$$\begin{aligned} \text {Value Input: } v_t= & W_v x_t + b_v \end{aligned}$$

$$\begin{aligned} \text{值输入: } v_t= & W_v x_t + b_v \end{aligned}$$

(6)

(6)

where

其中

\(x_t\)

\(x_t\)

denotes the input vector,

表示输入向量,

\(W_q\)

\(W_q\)

,

\(W_k\)

\(W_k\)

, and

,以及

\(W_v\)

\(W_v\)

are the corresponding mapping (or weight) matrices, and

是相应的映射(或权重)矩阵,并且

\(b_q\)

\(b_q\)

,

\(b_k\)

\(b_k\)

, and

,以及

\(b_v\)

\(b_v\)

are the corresponding bias terms.

是相应的偏置项。

Fig. 3

图3

mLSTM model diagram.

mLSTM 模型图。

Full size image

全尺寸图像

mLSTM uses input gates and forget gates to control memory updates, and employs exponential gating and OR gating (OR gating) to facilitate matrix memory computations:

mLSTM 使用输入门和遗忘门来控制记忆更新,并采用指数门控和或门(OR 门)来促进矩阵记忆计算:

$$\begin{aligned} \text {Input Gate: } i_t= & \exp (\tilde{i}_t), \quad \tilde{i}_t = w_i^T x_t + b_i \end{aligned}$$

$$\begin{aligned} \text{输入门: } i_t= & \exp (\tilde{i}_t), \quad \tilde{i}_t = w_i^T x_t + b_i \end{aligned}$$

(7)

(7)

$$\begin{aligned} \text {Forget Gate: } f_t= & \sigma (\tilde{f}_t) \text { OR } (\tilde{f}_t), \quad \tilde{f}_t = w_f^T x_t + b_f \end{aligned}$$

$$\begin{aligned} \text{遗忘门: } f_t= & \sigma (\tilde{f}_t) \text{ 或 } (\tilde{f}_t), \quad \tilde{f}_t = w_f^T x_t + b_f \end{aligned}$$

(8)

(8)

where

其中

\(w_i^T\)

\(w_i^T\)

,

\(w_f^T\)

\(w_f^T\)

,

\(b_i\)

\(b_i\)

, and

,以及

\(b_f\)

\(b_f\)

denote the weight vectors and bias terms corresponding to the input gate and forget gate, respectively.

分别表示输入门和遗忘门对应的权重向量和偏置项。

\(\sigma\)

\(\sigma\)

represents the activation function, and

代表激活函数,并且

\(\exp (\cdot )\)

\(\exp (\cdot )\)

denotes the exponential operation.

表示指数运算。

mLSTM extends the memory cell to a matrix and combines the update mechanism of LSTM with the information retrieval scheme from Transformer, introducing an integrated attention-based cell state and hidden state update mechanism, allowing memory extraction from different time steps:

mLSTM 将记忆单元扩展为矩阵,并将 LSTM 的更新机制与 Transformer 的信息检索方案相结合,引入了一种基于注意力的细胞状态和隐藏状态集成更新机制,允许从不同的时间步提取记忆:

$$\begin{aligned} \text {Cell State: } C_t= & f_t C_{t-1} + i_t v_t k_t^T \end{aligned}$$

$$\begin{aligned} \text{单元状态: } C_t= & f_t C_{t-1} + i_t v_t k_t^T \end{aligned}$$

(9)

(9)

$$\begin{aligned} \text {Normalizer State: } n_t= & f_t n_{t-1} + i_t k_t \end{aligned}$$

$$\begin{aligned} \text{归一化状态:} n_t= & f_t n_{t-1} + i_t k_t \end{aligned}$$

(10)

(10)

$$\begin{aligned} \text {Output Gate: } o_t= & \sigma (\tilde{o}_t), \quad \tilde{o}_t = W_o x_t + b_o \end{aligned}$$

$$\begin{aligned} \text{输出门: } o_t= & \sigma (\tilde{o}_t), \quad \tilde{o}_t = W_o x_t + b_o \end{aligned}$$

(11)

(11)

$$\begin{aligned} \text {Hidden State: } h_t= & o_t \odot \tilde{h}_t, \quad \tilde{h}_t = \frac{C_t q_t}{\max \{|n_t^T q_t|, 1\}} \end{aligned}$$

$$\begin{aligned} \text{隐藏状态: } h_t= & o_t \odot \tilde{h}_t, \quad \tilde{h}_t = \frac{C_t q_t}{\max \{|n_t^T q_t|, 1\}} \end{aligned}$$

(12)

(12)

The cell state is updated using a weighted sum according to the ratio, where the forget gate corresponds to the weighted proportion of memory, and the input gate corresponds to the weighted proportion of key-value pairs, satisfying the covariance-based update rule. mLSTM adopts a normalizer that weights the key vectors.

细胞状态按照比例通过加权和进行更新,其中遗忘门对应于记忆的加权比例,输入门对应于键值对的加权比例,满足基于协方差的更新规则。mLSTM采用了一种对键向量进行加权的归一化器。

Finally, through normalization and weighted processing controlled by the output gate, the network’s hidden state .

最后,通过输出门控制的归一化和加权处理,网络的隐藏状态。

\(h_t\)

\(h_t\)

is obtained.

得到了。

For more details on mLSTM, please refer to

有关 mLSTM 的更多详细信息,请参阅

11

11

.

BasicConv block

基础卷积块

Each basic residual block consists of a convolutional layer (Conv), instance normalization (Instance Normalization), and a ReLU activation function, as shown in Fig.

每个基本残差块由卷积层(Conv)、实例归一化(Instance Normalization)和 ReLU 激活函数组成,如图所示。

1

1

. This paper uses a

。本文使用了

\(3 \times 3\)

\(3 \times 3\)

convolution kernel, stride of 1, and padding of 1. Each sub-module adopts residual connections, adding the input directly to the output, thereby forming residual learning. Formally, given an input image

卷积核、步幅为1、填充为1。每个子模块采用残差连接,将输入直接加到输出上,从而形成残差学习。形式上,给定一个输入图像

\(W_0 = I \in \mathbb {R}^{H_0 \times W_0 \times C_0}\)

\(W_0 = I \in \mathbb{R}^{H_0 \times W_0 \times C_0}\)

, the final output is:

,最终输出是:

$$\begin{aligned} Y_2 = W_0 + \text {Conv}(\text {ReLU}(\text {IN}(W_0))) + \text {Conv}(\text {ReLU}(\text {IN}(Y_1))) \end{aligned}$$

$$\begin{aligned} Y_2 = W_0 + \text {卷积}(\text {ReLU}(\text {实例归一化}(W_0))) + \text {卷积}(\text {ReLU}(\text {实例归一化}(Y_1))) \end{aligned}$$

(13)

(13)

where

其中

\(Y_1\)

\(Y_1\)

represents the output of the first submodule,

表示第一个子模块的输出,

\(Y_2\)

\(Y_2\)

is the output feature map of the second submodule,

是第二个子模块的输出特征图,

\(\text {Conv}(\cdot )\)

\(\text{Conv}(\cdot )\)

denotes the convolution operation output,

表示卷积操作的输出,

\(\text {IN}(\cdot )\)

\(\text {IN}(\cdot )\)

indicates the Instance Normalization operation, and

表示实例归一化操作,以及

\(\text {RELU}(\cdot )\)

\(\text{RELU}(\cdot)\)

refers to the Rectified Linear Unit (ReLU) activation function.

指的是修正线性单元(ReLU)激活函数。

In the decoder, the upsampled feature maps can undergo convolution operations, with residual connections preserving the input features. This process further refines the feature information, aids in recovering high-quality details, and helps mitigate the potential vanishing gradient problem during information propagation..

在解码器中,上采样的特征图会经过卷积操作,残差连接则保留输入特征。这一过程进一步优化了特征信息,有助于恢复高质量的细节,并缓解信息传播过程中可能存在的梯度消失问题。

Encoder

编码器

Given an input image

给定一个输入图像

\(Z_0 \in \mathbb {R}^{H \times W \times 3}\)

\(Z_0 \in \mathbb {R}^{H \times W \times 3}\)

, the encoder progressively compresses the input image information through multi-level feature extraction and downsampling operations. It achieves efficient fusion of global and local information via the VSS and ViL modules. Specifically, the VSS module utilizes the Structured State Space Model (SSM) for global information modeling, and incorporates depthwise separable convolutions (DW-CNN) to further optimize the features.

编码器通过多层次的特征提取和下采样操作逐步压缩输入图像信息。它通过VSS和ViL模块实现了全局信息与局部信息的有效融合。具体而言,VSS模块利用结构化状态空间模型(SSM)进行全局信息建模,并结合深度可分离卷积(DW-CNN)进一步优化特征。

The output of the VSS block is then given by:.

VSS块的输出由下式给出:。

$$\begin{aligned} F_{\text {VSS}} = \text {Linear}(SS2D(DWConv(Z))) \end{aligned}$$

$$\begin{aligned} F_{\text {VSS}} = \text {线性}(SS2D(DWConv(Z))) \end{aligned}$$

(14)

(14)

where

其中

Z

Z

represents the input image,

代表输入图像,

\(DWConv(\cdot )\)

\(DWConv(\cdot )\)

denotes the output of the depthwise separable convolution (DW-CNN) operation,

表示深度可分离卷积(DW-CNN)操作的输出,

\(SS2D(\cdot )\)

\(SS2D(\cdot )\)

represents the output after passing through the 2D selective scanning (SS2D) module, and

表示经过二维选择性扫描(SS2D)模块处理后的输出,且

\(\text {Linear}(\cdot )\)

\(\text{线性}(\cdot )\)

indicates a linear transformation operation used to map and reorganize the channel dimension.

表示用于映射和重新组织通道维度的线性变换操作。

The ViL module enhances the modeling ability of local details and edge information through mLSTM, making it particularly suitable for capturing lesion boundary information. The input data is first normalized as follows:

ViL模块通过mLSTM增强了对局部细节和边缘信息的建模能力,使其特别适合于捕捉病灶边界信息。输入数据首先进行如下归一化处理:

$$\begin{aligned} Z_1 = \text {LayerNorm}(Z) \end{aligned}$$

$$\begin{aligned} Z_1 = \text{层归一化}(Z) \end{aligned}$$

(15)

(15)

The features from the two paths are fused as follows:

两条路径的特征按如下方式融合:

$$\begin{aligned} P_{\text {fusion}} = \text {SiLU}(\text {Linear}(Z_1)) \odot \text {LayerNorm}(\text {mLSTM}(\text {Conv}(\text {Linear}(Z_1)))) \end{aligned}$$

$$\begin{aligned} P_{\text {融合}} = \text {SiLU}(\text {线性}(Z_1)) \odot \text {层归一化}(\text {mLSTM}(\text {卷积}(\text {线性}(Z_1)))) \end{aligned}$$

(16)

(16)

Finally, the output of the ViL block is:

最后,ViL块的输出是:

$$\begin{aligned} F_{\text {ViL}} = \text {Linear}(P_{\text {fusion}}) + Z \end{aligned}$$

$$\begin{aligned} F_{\text {ViL}} = \text {线性}(P_{\text {融合}}) + Z \end{aligned}$$

(17)

(17)

where

其中

\(Z\)

\(Z\)

represents the input image,

代表输入图像,

\(\text {LayerNorm}(\cdot )\)

\(\text{层归一化}(\cdot )\)

denotes the normalization of feature maps,

表示特征图的归一化,

\(\text {SiLU}(\cdot )\)

\(\text{SiLU}(\cdot)\)

represents the nonlinear activation function, and

代表非线性激活函数,且

\(\text {Linear}(\cdot )\)

\(\text{线性}(\cdot )\)

refers to the linear transformation operation applied to map and reorganize the channel dimensions. The symbol

指的是用于映射和重新组织通道维度的线性变换操作。符号

\(\odot\)

\(\odot\)

denotes element-wise multiplication,

表示逐元素乘法,

\(\text {Conv}(\cdot )\)

\(\text{卷积}(\cdot )\)

represents the output of the convolution operation, and

表示卷积操作的输出,以及

\(\text {mLSTM}(\cdot )\)

\(\text{mLSTM}(\cdot )\)

represents the output after processing through the mLSTM module.

表示经过 mLSTM 模块处理后的输出。

In the final layer of the encoder, a gating mechanism is employed to adaptively fuse the VSS and ViL features, thereby balancing global and local information. The gating mechanism is defined as:

在编码器的最后一层,采用门控机制自适应融合VSS和ViL特征,从而平衡全局和局部信息。门控机制定义如下:

$$\begin{aligned} g = \sigma (W_g [F_{\text {VSS}}, F_{\text {ViL}}]) + b_g \end{aligned}$$

$$\begin{aligned} g = \sigma (W_g [F_{\text {VSS}}, F_{\text {ViL}}]) + b_g \end{aligned}$$

(18)

(18)

where

其中

\(\sigma\)

\(\sigma\)

is the Sigmoid function, and

是Sigmoid函数,且

\(W_g\)

\(W_g\)

and

\(b_g\)

\(b_g\)

are learnable parameters. Thus, the fused feature representation is given by:

是可学习的参数。因此,融合后的特征表示为:

$$\begin{aligned} F_{\text {final}} = g \odot F_{\text {VSS}} + (1 - g) F_{\text {ViL}} \end{aligned}$$

$$\begin{aligned} F_{\text {最终}} = g \odot F_{\text {VSS}} + (1 - g) F_{\text {ViL}} \end{aligned}$$

(19)

(19)

where

其中

\(\odot\)

\(\odot\)

denotes element-wise multiplication.

表示逐元素相乘。

Decoder

解码器

The decoder progressively restores the multi-level features extracted by the encoder to the original image resolution through upsampling and skip connections, enabling precise segmentation. Each layer of the decoder receives multi-scale features from the encoder through skip connections, enhancing the model’s ability to recover segmentation details.

解码器通过上采样和跳跃连接,将编码器提取的多层特征逐步恢复到原始图像分辨率,从而实现精确的分割。解码器的每一层通过跳跃连接接收来自编码器的多尺度特征,增强了模型恢复分割细节的能力。

The effective transmission and reconstruction of features are ensured through the BasicConv block:.

通过 BasicConv 块确保特征的有效传输和重建:

$$\begin{aligned} F_{\text {decode}} = \text {BasicConv}(\text {PatchExpand}(F_{\text {up}}) + F_{\text {skip}}) \end{aligned}$$

$$\begin{aligned} F_{\text {解码}} = \text {基础卷积}(\text {Patch扩展}(F_{\text {上采样}}) + F_{\text {跳跃连接}}) \end{aligned}$$

(20)

(20)

where

其中

\(F_{\text {skip}}\)

\(F_{\text {跳过}}\)

represents the skip connection feature map from the encoder,

表示来自编码器的跳跃连接特征图,

\(\text {PatchExpand}(\cdot )\)

\(\text {PatchExpand}(\cdot )\)

denotes the upsampling operation applied to the input feature map to increase its spatial resolution,

表示对输入特征图进行上采样操作以提高其空间分辨率,

\(F_{\text {up}}\)

\(F_{\text {上}}\)

is the upsampled feature map from the previous decoder layer, and

是从前一个解码器层上采样的特征图,并且

\(\text {BasicConv}(\cdot )\)

\(\text {BasicConv}(\cdot )\)

represents the output after passing through the BasicConv block.

表示通过 BasicConv 块后的输出。

Experiments and results

实验与结果

Datasets

数据集

ISIC dataset

ISIC数据集

In this study, we selected two skin lesion segmentation datasets, namely ISIC17

在这项研究中,我们选择了两个皮肤病变分割数据集,即 ISIC17。

13

13

and ISIC18

和 ISIC18

14

14

. These datasets contain a large number of high-quality annotated skin lesion images, with 2150 and 2694 labeled images, respectively. We split the data into training and testing sets at a ratio of 7:3. Specifically, for the ISIC17 dataset, 1505 images were used for training, and 645 images were used for testing.

这些数据集包含大量高质量的注释皮肤病变图像,分别有2150张和2694张标记图像。我们按照7:3的比例将数据分为训练集和测试集。具体来说,对于ISIC17数据集,1505张图像用于训练,645张图像用于测试。

For the ISIC18 dataset, 1886 images were allocated to the training set, while 808 images were used for testing..

对于ISIC18数据集,1886张图像被分配到训练集,而808张图像用于测试。

Polyp segmentation dataset

多边形分割数据集

For the polyp segmentation task, we utilized two common endoscopic image datasets, Kvasir-SEG

对于息肉分割任务,我们使用了两个常见的内窥镜图像数据集,Kvasir-SEG

15

15

and ClinicDB

和 ClinicDB

16

16

. These datasets consist of high-definition endoscopic images primarily obtained from colonoscopy and gastroscopy procedures. The Kvasir-SEG dataset contains 1000 labeled images, while the ClinicDB dataset includes 612 labeled images. Following the experimental setup of PraNet

这些数据集由主要从结肠镜检查和胃镜检查程序中获得的高清内窥镜图像组成。Kvasir-SEG 数据集包含 1000 张标注图像,而 ClinicDB 数据集包括 612 张标注图像。按照 PraNet 的实验设置进行。

32

32

, we adopted a separate training and testing strategy: 900 images from the Kvasir-SEG dataset and 550 images from the ClinicDB dataset were used for training, with the remaining images allocated to the testing set.

我们采用了单独的训练和测试策略:从Kvasir-SEG数据集中选取了900张图像,从ClinicDB数据集中选取了550张图像用于训练,其余图像分配给测试集。

Experimental details

实验细节

The experiment was carried out on an Ubuntu 22.04 system, utilizing an environment equipped with Python 3.10.4, PyTorch 2.3.0, and CUDA 11.8. All experimental tasks were executed on a single NVIDIA A10 GPU.We resized all images in the datasets to 256

该实验在 Ubuntu 22.04 系统上进行,使用了配备 Python 3.10.4、PyTorch 2.3.0 和 CUDA 11.8 的环境。所有实验任务均在单个 NVIDIA A10 GPU 上执行。我们将数据集中的所有图像调整为 256 大小。

\(\times\)

\(\times\)

256 pixels and employed data augmentation techniques such as random flipping and random rotation to prevent overfitting. Regarding operational parameters, the batch size was set to 32, the optimizer used was AdamW

256像素,并采用了随机翻转和随机旋转等数据增强技术来防止过拟合。在操作参数方面,批量大小设置为32,优化器使用的是AdamW。

33

33

, with an initial learning rate of 2.3e-4. We also utilized CosineAnnealingLR

`, 初始学习率为 2.3e-4。我们还使用了 CosineAnnealingLR`

34

34

as the learning rate scheduler, with a maximum of 50 epochs, setting the minimum learning rate to 1e-5, and the training cycles were set to 500 iterations. During model training, the weights of the encoder were initialized using VMamba-S

作为学习率调度器,设置最大训练轮数为50轮,最小学习率设置为1e-5,并且训练周期被设定为500次迭代。在模型训练过程中,编码器的权重使用VMamba-S进行初始化。

12

12

weights pretrained on ImageNet-1k.

在 ImageNet-1k 上预训练的权重。

Loss function

损失函数

Based on the characteristics of binary cross-entropy and the Dice similarity coefficient, and considering that all our dataset masks consist of two classes (target and background), we designed a hybrid loss function with

基于二元交叉熵和Dice相似系数的特点,并考虑到我们所有的数据集掩码均由两类组成(目标和背景),我们设计了一种混合损失函数,其中包含

\(\beta _1 = 1\)

\(\beta _1 = 1\)

and

\(\beta _2 = 1\)

\(\beta _2 = 1\)

. This ensures that the loss function effectively distinguishes between the target and background while maintaining a balanced treatment of both classes. The formula is as follows:

这确保了损失函数在有效区分目标和背景的同时,对两类保持均衡的处理。公式如下:

$$\begin{aligned} L_{\text {BceDice}}= & \beta _1 L_{\text {Bce}} + \beta _2 L_{\text {Dice}} \end{aligned}$$

$$\begin{aligned} L_{\text {BceDice}}= & \beta _1 L_{\text {Bce}} + \beta _2 L_{\text {Dice}} \end{aligned}$$

(21)

(21)

$$\begin{aligned} L_{\text {Bce}}= & -\frac{1}{N} \sum _{i=1}^{N} \left[ y_i \log (p_i) + (1 - y_i) \log (1 - p_i) \right] \end{aligned}$$

$$\begin{aligned} L_{\text {Bce}}= & -\frac{1}{N} \sum _{i=1}^{N} \left[ y_i \log (p_i) + (1 - y_i) \log (1 - p_i) \right] \end{aligned}$$

(22)

(22)

$$\begin{aligned} L_{\text {Dice}}= & 1 - \frac{2 \sum _{i=1}^{N} y_i p_i}{\sum _{i=1}^{N} y_i + \sum _{i=1}^{N} p_i} \end{aligned}$$

$$\begin{aligned}L_{\text{Dice}} = 1 - \frac{2 \sum_{i=1}^{N} y_i p_i}{\sum_{i=1}^{N} y_i + \sum_{i=1}^{N} p_i}\end{aligned}$$

(23)

(23)

Where

哪里

N

represents the total number of samples,

表示样本的总数,

\(y_i\)

\(y_i\)

and

\(p_i\)

\(p_i\)

denote the true label and the predicted value for pixel

表示像素的真实标签和预测值

i

, respectively.

,分别。

\(L_{\text {Bce}}\)

\(L_{\text{Bce}}\)

represents the binary cross-entropy loss, which measures the difference between the model’s predictions and the true labels.

代表二元交叉熵损失,它衡量模型预测与真实标签之间的差异。

\(L_{\text {Dice}}\)

\(L_{\text {Dice}}\)

represents the Dice loss, used to assess the overlap between the model’s predicted segmentation and the true label.

代表 Dice 损失,用于评估模型预测的分割结果与真实标签之间的重叠情况。

\(\beta _1\)

\(\beta _1\)

and

\(\beta _2\)

\(\beta _2\)

are two weighting factors that control the relative importance of the binary cross-entropy loss (

是两个控制二元交叉熵损失相对重要性的权重因子(

\(L_{\text {Bce}}\)

\(L_{\text {Bce}}\)

) and the Dice loss (

`) 以及 Dice 损失 (`

\(L_{\text {Dice}}\)

\(L_{\text {Dice}}\)

).

)。

Results

结果

We compare the proposed VMAXL-UNet model with several state-of-the-art methods, including CNN-based UNet and EGEUNet models, Transformer-based Swin-Unet, Mamba-based VM-UNet, the recently proposed xLSTM-based xLSTM-UNet, and MLP-based U-KAN. In the experiments, we use Intersection over Union (IoU) and Dice coefficient as evaluation metrics.

我们将提出的 VMAXL-UNet 模型与几种最先进的方法进行了比较,包括基于 CNN 的 UNet 和 EGEUNet 模型、基于 Transformer 的 Swin-Unet、基于 Mamba 的 VM-UNet、最近提出的基于 xLSTM 的 xLSTM-UNet,以及基于 MLP 的 U-KAN。在实验中,我们使用交并比(IoU)和 Dice 系数作为评估指标。

The evaluation results on the ISIC dataset are shown in Table .

在ISIC数据集上的评估结果如表所示。

1

1

, while the results on the selected Polyp segmentation datasets are presented in Table

,而所选息肉分割数据集的结果呈现在表中

2

2

. The results demonstrate that our VMAXL-UNet model outperforms other methods across all four datasets. Notably, on the Kvasir-SEG and ClinicDB datasets, VMAXL-UNet achieves significant advantages. These two datasets contain many targets with blurry boundaries, making it difficult to distinguish them from the background, which highlights VMAXL-UNet’s ability to effectively capture long-range dependencies to enhance segmentation performance.

结果表明,我们的VMAXL-UNet模型在所有四个数据集上均优于其他方法。特别是在Kvasir-SEG和ClinicDB数据集上,VMAXL-UNet展现了显著的优势。这两个数据集包含许多边界模糊的目标,难以与背景区分,这突显了VMAXL-UNet有效捕捉长程依赖关系以增强分割性能的能力。

In addition to the accuracy advantage, we further demonstrate the efficiency of our model as a network baseline. As shown in Table .

除了准确性优势之外,我们还进一步证明了我们的模型作为网络基线的效率。如表所示。

3

3

, we present the computational complexity (FLOPs) and the number of parameters (Params), along with the average segmentation accuracy, across the four datasets. The experimental results indicate that our proposed model not only surpasses most segmentation methods in terms of accuracy but also demonstrates considerable efficiency.

我们展示了四个数据集上的计算复杂度(FLOPs)和参数数量(Params),以及平均分割精度。实验结果表明,我们提出的模型不仅在准确性方面超越了大多数分割方法,而且展现出相当高的效率。

Overall, in the trade-off between segmentation accuracy and efficiency, our method exhibits the best performance..

总体而言,在分割精度和效率之间的权衡中,我们的方法表现出最佳性能。

Table 1 Performance comparison of different models on ISIC17 and ISIC18 datasets.

表1 不同模型在ISIC17和ISIC18数据集上的性能比较。

Full size table

全尺寸表格

Table 2 Performance comparison of different models on Kvasir-SEG and ClinicDB datasets.

表2 不同模型在Kvasir-SEG和ClinicDB数据集上的性能比较。

Full size table

全尺寸表格

Table 3 Performance and efficiency comparison of different models.

表3 不同模型的性能和效率比较。

Full size table

全尺寸表格

To further validate the segmentation performance of our model, we conducted a visual analysis of the segmentation results on the ISIC17, ISIC18, Kvasir-SEG, and ClinicDB datasets, as shown in Figs.

为了进一步验证我们模型的分割性能,我们对 ISIC17、ISIC18、Kvasir-SEG 和 ClinicDB 数据集上的分割结果进行了可视化分析,如图所示。

4

4

and

5

5

. From the visual results, it is evident that the proposed VMAXL-UNet model significantly outperforms the other comparative models in terms of segmentation quality. Even when processing small objects, VMAXL-UNet not only accurately localizes the target regions but also generates coherent and clear boundaries.

从可视化结果来看,很明显,所提出的VMAXL-UNet模型在分割质量方面显著优于其他对比模型。即使在处理小物体时,VMAXL-UNet不仅能够准确定位目标区域,还能生成连贯且清晰的边界。

These results strongly demonstrate that VMAXL-UNet excels in both local feature extraction and global context modeling. The model’s outstanding performance in fine-grained segmentation tasks further validates its practicality and robustness in complex medical image segmentation scenarios..

这些结果充分表明,VMAXL-UNet在局部特征提取和全局上下文建模方面均表现出色。该模型在细粒度分割任务中的优异表现进一步验证了其在复杂医学图像分割场景中的实用性和鲁棒性。

Fig. 4

图4

Comparison of visualized experimental results on the Kvasir-SEG and ClinicDB data sets.

在Kvasir-SEG和ClinicDB数据集上可视化的实验结果比较。

Full size image

全尺寸图像

Fig. 5

图5

Comparison of visualized experimental results on the ISIC17 and ISIC18 data sets.

在ISIC17和ISIC18数据集上可视化的实验结果比较。

Full size image

全尺寸图像

Ablation studies

消融研究

To investigate the impact of various factors on model performance and to validate the effectiveness of the proposed model, a comprehensive ablation study was conducted on the ISIC17 dataset. As shown in Tables

为了研究各种因素对模型性能的影响并验证所提出模型的有效性,在ISIC17数据集上进行了全面的消融研究。如表所示

4

4

and

5

5

, two experiments were designed and implemented.

,设计并实施了两个实验。

Table 4 Ablation Study Results Across Different Modules.

表4:不同模块的消融研究结果。

Full size table

全尺寸表格

Table 5 Ablation Study Results with Different Numbers of Blocks.

表5 不同块数量的消融研究结果。

Full size table

全尺寸表格

In the first experiment, we modified the encoder part of the model and constructed three different variant models: Model 1 consists only of BasicConv blocks; Model 2 is composed of BasicConv and VSS blocks; Model 3 represents the original VMAXL-UNet model. The experimental results indicate that, although Model 1, which only uses BasicConv blocks, outperforms the traditional U-Net model, demonstrating the inherent advantage of the BasicConv block, further incorporation of the VSS and ViL modules (i.e., Model 3) significantly enhances overall performance, particularly achieving the best segmentation accuracy.

在第一次实验中,我们修改了模型的编码器部分,并构建了三种不同的变体模型:模型1仅由BasicConv块组成;模型2由BasicConv和VSS块组成;模型3代表原始的VMAXL-UNet模型。实验结果表明,尽管仅使用BasicConv块的模型1性能优于传统U-Net模型,显示出BasicConv块的固有优势,但进一步引入VSS和ViL模块(即模型3)显著提升了整体性能,尤其在分割准确性上达到了最佳效果。

This result fully confirms the substantial contribution of the VSS and ViL modules to model performance..

该结果充分证实了VSS和ViL模块对模型性能的显著贡献。

In the second experiment, we further explored the impact of the number of VSS+ViL module blocks in the encoder on model performance. Three scenarios were tested, with the number of VSS+ViL blocks set to 0, 1, and 2, respectively. When the number of VSS+ViL blocks was 0, the encoder was composed solely of BasicConv and VSS blocks.

在第二次实验中,我们进一步探讨了编码器中VSS+ViL模块块的数量对模型性能的影响。测试了三种场景,分别将VSS+ViL块的数量设置为0、1和2。当VSS+ViL块的数量为0时,编码器仅由BasicConv和VSS块组成。

The experimental results showed that the model performance did not increase monotonically with the number of VSS+ViL blocks. Specifically, when the number of VSS+ViL blocks was 1, the model achieved higher segmentation accuracy, indicating that a moderate number of VSS+ViL blocks can effectively enhance model performance..

实验结果表明,模型性能并非随着VSS+ViL块数量的增加而单调提升。具体而言,当VSS+ViL块的数量为1时,模型达到了更高的分割精度,这表明适量的VSS+ViL块能够有效提升模型性能。

The results of these experiments not only validate the importance of the VSS and ViL modules but also provide valuable insights for determining the optimal configuration of module quantities in model design.

这些实验结果不仅验证了 VSS 和 ViL 模块的重要性,还为确定模型设计中模块数量的最佳配置提供了宝贵的见解。

Attention visualization for comparative analysis

用于对比分析的注意力可视化

To further validate the design rationale and practical application potential of VMAXL-UNet, we present the attention heatmaps of VMAXL-UNet and its two variant models (Model 1 and Model 2) in the medical image segmentation task in Fig.

为了进一步验证 VMAXL-UNet 的设计原理及其实际应用潜力,我们在图中展示了 VMAXL-UNet 及其两个变体模型(模型 1 和模型 2)在医学图像分割任务中的注意力热图。

6

6

. These heatmaps visually reflect the model’s attention to key regions in the input images. The input images contain complex intestinal structures and lesion areas, indicating a high level of difficulty in the segmentation task; the ground truth represents manually annotated lesion regions, serving as the standard reference for segmentation results.

这些热图直观地反映了模型对输入图像中关键区域的注意力。输入图像包含复杂的肠结构和病灶区域,表明分割任务的难度较高;地面真值代表人工标注的病灶区域,作为分割结果的标准参考。

The high-response regions (indicated in red) of VMAXL-UNet precisely cover the lesion areas, showing a high degree of consistency with the ground truth. This is attributed to the integration of the SSM and xLSTM modules in the encoder: the former enhances the model’s attention to detail regions by efficiently fusing multi-scale features, while the latter improves the model’s ability to capture global context information.

VMAXL-UNet 的高响应区域(红色标示)精确覆盖了病灶区域,与真实情况显示出高度一致性。这归功于编码器中 SSM 和 xLSTM 模块的集成:前者通过高效融合多尺度特征增强了模型对细节区域的关注,而后者提升了模型捕捉全局上下文信息的能力。

In contrast, the high-response regions of the variant Model 1 (which does not integrate the SSM and xLSTM modules in the encoder, instead using BasicConv blocks) exhibit some diffusion, making it difficult to accurately focus on the lesion boundaries, potentially leading to blurry boundaries in the segmentation results.

相比之下,变体模型1(在编码器中未整合SSM和xLSTM模块,而是使用BasicConv块)的高响应区域表现出一定的扩散,难以准确聚焦于病灶边界,可能导致分割结果中的边界模糊。

The attention distribution of variant Model 2 (which integrates only the SSM module in the encoder) is more dispersed, with some high-response areas deviating from the lesion regions, indicating insufficient adaptability in complex scenarios. These comparative results demonstrate that VMAXL-UNet outperforms the variant models in capturing lesion details and boundary information, thereby fully validating its design effectiveness and practical potential in medical image segmentation tasks..

变体模型2(仅在编码器中整合了SSM模块)的注意力分布更为分散,部分高响应区域偏离病灶区域,表明其在复杂场景中的适应性不足。这些对比结果表明,VMAXL-UNet在捕获病灶细节和边界信息方面优于变体模型,从而充分验证了其在医学图像分割任务中的设计有效性和实际潜力。

Fig. 6

图6

Attention heatmap comparison of VMAXL-UNet and its variants.

VMAXL-UNet及其变体的注意力热图比较。

Full size image

全尺寸图像

Conclusions

结论

This study presents VMAXL-UNet, a novel UNet variant based on the State Space Model (SSM) and an enhanced Long Short-Term Memory network (xLSTM). By introducing the Visual State Space (VSS) module and the ViL module, VMAXL-UNet demonstrates significant performance improvements in medical image segmentation tasks.

本研究提出了VMAXL-UNet,一种基于状态空间模型(SSM)和增强型长短期记忆网络(xLSTM)的新型UNet变体。通过引入视觉状态空间(VSS)模块和ViL模块,VMAXL-UNet在医学图像分割任务中表现出显著的性能提升。

Specifically, the VSS module leverages visual saliency analysis to enable the model to focus more effectively on critical features, while the ViL module enhances the modeling of sequential dependencies in complex image structures. Additionally, the incorporation of VMamba pre-trained weights accelerates model convergence and improves initial performance..

具体来说,VSS模块利用视觉显著性分析,使模型能够更有效地聚焦于关键特征,而ViL模块则增强了对复杂图像结构中序列依赖性的建模能力。此外,引入VMamba预训练权重加快了模型收敛并提升了初始性能。

Experiments on several medical image datasets, such as dermatological and polyp segmentation tasks, demonstrate that VMAXL-UNet outperforms or matches existing state-of-the-art models across multiple evaluation metrics. However, there is still room for optimization in terms of computational efficiency.

在多个医学图像数据集(如皮肤病和息肉分割任务)上的实验表明,VMAXL-UNet 在多项评估指标上优于或匹敌现有的最先进模型。然而,在计算效率方面仍有优化空间。

Future work could explore more lightweight network architectures or improved training strategies to reduce computational costs. Additionally, the design of VMAXL-UNet is highly versatile, and further research will investigate its potential in 3D medical image segmentation and other image processing tasks, such as organ segmentation and tumor detection, to assess its applicability and generalization capabilities..

未来的工作可以探索更轻量级的网络架构或改进的训练策略,以降低计算成本。此外,VMAXL-UNet 的设计具有高度的通用性,进一步的研究将探讨其在 3D 医学图像分割及其他图像处理任务(如器官分割和肿瘤检测)中的潜力,以评估其适用性和泛化能力。

Data availability

数据可用性

The data used in this study is publicly accessible, with ISIC17 and ISIC18 datasets sourced from the ISIC Challenge. The access link is

本研究使用的数据是公开可访问的,其中 ISIC17 和 ISIC18 数据集来源于 ISIC 挑战赛。访问链接为

https://challenge.isic-archive.com/data/

https://challenge.isic-archive.com/data/

. For the Kvasir-SEG and ClinicDB datasets, they can be accessed through their official websites:

对于Kvasir-SEG和ClinicDB数据集,可以通过其官方网站访问:

https://datasets.simula.no/kvasir-seg/

https://datasets.simula.no/kvasir-seg/

and

https://opendatalab.org.cn/OpenDataLab/CVC-ClinicDB

https://opendatalab.org.cn/OpenDataLab/CVC-ClinicDB

, respectively.

,分别。

References

参考文献

Muksimova, S., Umirzakova, S., Mardieva, S. & Cho, Y.-I. Enhancing medical image denoising with innovative teacher-student model-based approaches for precision diagnostics.

穆克西莫娃,S.,乌米尔扎科娃,S.,马尔季耶娃,S.,崔英一。利用创新的师生模型方法提升医学图像去噪以实现精准诊断。

Sensors

传感器

23

二十三

, 9502 (2023).

,9502(2023)。

Article

文章

ADS

广告

PubMed

PubMed

PubMed Central

PubMed Central

Google Scholar

谷歌学术

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In

Ronneberger, O., Fischer, P. & Brox, T. U-net: 用于生物医学图像分割的卷积网络。在

Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18

医学图像计算与计算机辅助干预-MICCAI 2015:第18届国际会议,德国慕尼黑,2015年10月5日至9日,会议记录,第三部分 18

(ed. Ronneberger, O.) 234–241 (Springer, 2015).

(编辑:罗内伯格,O.)234-241页(施普林格出版社,2015年)。

Google Scholar

谷歌学术索

Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In

周, Z., 拉赫曼·西迪quee, M. M., 塔伊巴赫什, N. & 梁, J. Unet++:一种用于医学图像分割的嵌套U-Net架构。在

Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4

深度学习在医学图像分析和多模态学习临床决策支持中的应用:第四届国际研讨会DLMIA 2018与第八届国际研讨会ML-CDS 2018,联合MICCAI 2018举办,西班牙格拉纳达,2018年9月20日,会议录4

(ed. Zhou, Z.) 3–11 (Springer, 2018).

(周志华编)3-11页(Springer,2018)。

Peng, Y., Sonka, M. & Chen, D.Z. U-net v2: Rethinking the skip connections of u-net for medical image segmentation. Preprint at

彭, Y., Sonka, M. & 陈, D.Z. U-net v2: 重新思考U-net的跳跃连接在医学图像分割中的应用。预印本 tại

arXiv:2311.17791

arXiv:2311.17791

(2023).

(2023)。

Huang, H. et al. Unet 3+: A full-scale connected unet for medical image segmentation. In

黄, H. 等. Unet 3+: 一种用于医学图像分割的全尺度连接Unet。在

ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)

ICASSP 2020–2020年IEEE国际声学、语音与信号处理会议(ICASSP)

(ed. Huang, H.) 1055–1059 (IEEE, 2020).

(编者:黄 H.)1055–1059页(IEEE,2020年)。

Chapter

章节

MATH

数学

Google Scholar

谷歌学术索

Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at

Dosovitskiy, A. 一张图像相当于16x16个词:大规模图像识别的Transformer。预印本于

arXiv:2010.11929

arXiv:2010.11929

(2020).

(2020)。

Liu, Z.

刘,Z.

et al.

Swin transformer: Hierarchical vision transformer using shifted windows. In:

Swin transformer:使用移位窗口的分层视觉Transformer。载于:

Proc. IEEE/CVF international conference on computer vision

Proc. IEEE/CVF国际计算机视觉会议

, 10012–10022 (2021).

,10012-10022(2021)。

Touvron, H., Cord, M. & Jégou, H. Deit iii: Revenge of the vit. In

Touvron, H., Cord, M. & Jégou, H. Deit iii: Revenge of the vit. 在

European conference on computer vision

欧洲计算机视觉会议

(ed. Touvron, H.) 516–533 (Springer, 2022).

(编者:图鲁松,H.)516-533页(施普林格出版社,2022年)。

MATH

数学

Google Scholar

谷歌学术

Chen, J.

陈, J.

et al.

等。

Transunet: Transformers make strong encoders for medical image segmentation. Preprint at

Transunet:Transformer 在医学图像分割中作为强大的编码器。预印本 tại

arXiv:2102.04306

arXiv:2102.04306

(2021).

(2021)。

Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at

顾, A. & 道, T. Mamba: 具有选择性状态空间的线性时间序列建模。预印本于

arXiv:2312.00752

arXiv:2312.00752

(2023).

(2023)。

Beck, M.

贝克,M.

et al.

等。

xlstm: Extended long short-term memory. Preprint at

xlstm:扩展的长短期记忆。预印本于

arXiv:2405.04517

arXiv:2405.04517

(2024).

(2024)。

Zhu, L.

朱莉

et al.

Vision mamba: Efficient visual representation learning with bidirectional state space model. Preprint at

视觉马巴:使用双向状态空间模型的高效视觉表示学习。预印本 tại

arXiv:2401.09417

arXiv:2401.09417

(2024).

(2024)。

Berseth, M. Isic 2017-skin lesion analysis towards melanoma detection. Preprint at

贝瑟斯,M. 伊西克 2017-皮肤病变分析在黑色素瘤检测中的应用。预印本于

arXiv:1703.00523

arXiv:1703.00523

(2017).

(2017)。

Codella, N.

科德拉,N.

et al.

等。

Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). Preprint at

皮肤病变分析用于黑色素瘤检测2018:由国际皮肤影像合作组织(ISIC)主办的挑战赛。预印本位于

arXiv:1902.03368

arXiv:1902.03368

(2019).

(2019)。

Jha, D. et al. Kvasir-seg: A segmented polyp dataset. In

Jha, D. 等。Kvasir-seg:一个分割后的息肉数据集。于

MultiMedia modeling: 26th international conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, proceedings, part II 26

多媒体建模:第26届国际会议,MMM 2020,韩国大田,2020年1月5日至8日,会议记录,第二部分 26

(ed. Jha, D.) 451–462 (Springer, 2020).

(编辑:Jha, D.)451–462页(Springer,2020年)。

Chapter

章节

MATH

数学

Google Scholar

谷歌学术

Bernal, J. et al. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs saliency maps from physicians.

Bernal, J. 等。Wm-dova 地图用于结肠镜检查中精确的息肉突出显示:与医生的显著性地图验证。

Comput. Med. Imaging Graph.

计算机医学影像图形学

43

四十三

, 99–111 (2015).

,99-111页(2015年)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学术

Oktay, O.

奥克塔伊,O.

et al.

等。

Attention u-net: Learning where to look for the pancreas. Preprint at

注意力u-net:学习在何处寻找胰腺。预印本 tại

arXiv:1804.03999

arXiv:1804.03999

(2018).

(2018)。

Iqbal, A. & Sharif, M. Unet: A semi-supervised method for segmentation of breast tumor images using a u-shaped pyramid-dilated network.

Iqbal, A. & Sharif, M. Unet:一种使用U形金字塔膨胀网络的乳腺肿瘤图像分割半监督方法。

Expert Syst. Appl.

专家系统与应用

221

221

, 119718 (2023).

,119718(2023)。

Article

文章

Google Scholar

谷歌学术索

Heinrich, M. P., Stille, M. & Buzug, T. M. Residual u-net convolutional neural network architecture for low-dose ct denoising.

海因里希,M. P.,斯蒂勒,M. & 布祖格,T. M. 用于低剂量CT去噪的残差U-Net卷积神经网络架构。

Curr. Direct. Biomed. Eng.

生物医学工程当前方向

4

4

, 297–300 (2018).

,297-300页(2018年)。

Article

文章

MATH

数学

Google Scholar

谷歌学术索

Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems (2017).

Vaswani, A. 注意力就是你所需要的。神经信息处理系统进展(2017)。

Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In

曹,H. 等。Swin-Unet:用于医学图像分割的类Unet纯Transformer。在

European conference on computer vision

欧洲计算机视觉会议

(ed. Cao, H.) 205–218 (Springer, 2022).

(编者:曹辉)205–218页(Springer,2022年)。

MATH

数学

Google Scholar

谷歌学术索

Wang, H., Cao, P., Wang, J. & Zaiane, O. R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer.

王晗、曹鹏、王静、奥马尔·R·扎伊内。Uctransnet:从通道角度结合Transformer重新思考U-Net中的跳跃连接。

Proc. AAAI Conf. Artif. Intell.

AAAI人工智能会议录

36

36

, 2441–2449 (2022).

,2441–2449(2022)。

MATH

数学

Google Scholar

谷歌学术

Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers.

顾, A. 等。结合循环、卷积和连续时间模型与线性状态空间层。

Adv. Neural. Inf. Process. Syst.

神经信息处理系统进展

34

34

, 572–585 (2021).

,572-585(2021)。

MATH

数学

Google Scholar

谷歌学术

Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at

顾,A., Goel, K. & Ré, C. 使用结构化状态空间高效建模长序列。预印本于

arXiv:2111.00396

arXiv:2111.00396

(2021).

(2021)。

Wang, Z., Zheng, J.-Q., Zhang, Y., Cui, G. & Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. Preprint at

王,郑,张,崔,李。Mamba-unet:用于医学图像分割的类Unet纯视觉Mamba。预印本于

arXiv:2402.05079

arXiv:2402.05079

(2024).

(2024)。

Ruan, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. Preprint at

阮, J. & 向, S. Vm-unet: 用于医学图像分割的视觉曼巴unet。预印本于

arXiv:2402.02491

arXiv:2402.02491

(2024).

(2024)。

Hochreiter, S. & Schmidhuber, J. Lstm can solve hard long time lag problems. Adv. Neural Inf. Process. Systems

Hochreiter, S. & Schmidhuber, J. 长短期记忆网络可以解决长时间滞后问题。《神经信息处理系统进展》

9

9

(1996).

(1996)。

Al-Selwi, S. M., Hassan, M. F., Abdulkadir, S. J. & Muneer, A. Lstm inefficiency in long-term dependencies regression problems.

阿尔-塞尔维,S. M.,哈桑,M. F.,阿卜杜勒卡迪尔,S. J. & 穆尼尔,A. 长期依赖回归问题中LSTM的低效性。

J. Adv. Res. Appl. Sci. Eng. Technol.

高级研究应用科学工程与技术杂志

30

30

, 16–31 (2023).

,16-31页(2023年)。

Article

文章

Google Scholar

谷歌学术

Salehin, I. et al. Real-time medical image classification with ml framework and dedicated cnn-lstm architecture.

Salehin, I. 等。 使用机器学习框架和专用的CNN-LSTM架构进行实时医学图像分类。

J. Sensors

传感器期刊

2023

2023

, 3717035 (2023).

,3717035(2023)。

Article

文章

Google Scholar

谷歌学术

Shahzadi, I., Tang, T. B., Meriadeau, F. & Quyyum, A. Cnn-lstm: Cascaded framework for brain tumour classification. In

沙赫扎迪,I.,唐,T. B.,梅里亚德,F. & 奎尤姆,A. Cnn-lstm:用于脑肿瘤分类的级联框架。于

2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES)

2018年IEEE-EMBS生物医学工程与科学会议(IECBES)

(ed. Shahzadi, I.) 633–637 (IEEE, 2018).

(编辑:Shahzadi, I.)633–637页(IEEE,2018年)。

Chapter

章节

Google Scholar

谷歌学术索

Chen, T.

陈,T.

et al.

等。

xlstm-unet can be an effective 2d & 3d medical image segmentation backbone with vision-lstm (vil) better than its mamba counterpart. Preprint at

xlstm-unet 可以作为一个有效的 2D 和 3D 医学图像分割主干网络,其视觉 LSTM(vil)表现优于其 Mamba 对应版本。预印本见

arXiv:2407.01530

arXiv:2407.01530

(2024).

(2024)。

Fan, D.-P. et al. Pranet: Parallel reverse attention network for polyp segmentation. In

范, D.-P. 等。Pranet:用于息肉分割的并行反向注意力网络。载于

International conference on medical image computing and computer-assisted intervention

国际医学图像计算与计算机辅助干预会议

(ed. Fan, D.-P.) 263–273 (Springer, 2020).

(编者:范大平)263-273页(施普林格出版社,2020年)。

MATH

数学

Google Scholar

谷歌学术搜索

Loshchilov, I. Decoupled weight decay regularization. Preprint at

Loshchilov, I. 解耦权重衰减正则化。预印本位于

arXiv:1711.05101

arXiv:1711.05101

(2017).

(2017)。

Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. Preprint at

Loshchilov, I. & Hutter, F. Sgdr: 带热重启的随机梯度下降。预印本于

arXiv:1608.03983

arXiv:1608.03983

(2016).

(2016)。

Li, C.

李,C.

et al.

U-kan makes strong backbone for medical image segmentation and generation. Preprint at

U-kan为医学图像分割和生成提供了强大的骨干支持。预印本 tại

arXiv:2406.02918

arXiv:2406.02918

(2024).

(2024)。

Ruan, J., Xie, M., Gao, J., Liu, T. & Fu, Y. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. In

阮, J., 谢, M., 高, J., 刘, T. & 傅, Y. Ege-unet: 一种高效的群组增强型unet用于皮肤病变分割。在

International conference on medical image computing and computer-assisted intervention

医学图像计算与计算机辅助干预国际会议

(ed. Ruan, J.) 481–490 (Springer, 2023).

(阮,J.编)481–490页(Springer,2023)。

MATH

数学

Google Scholar

谷歌学术

Download references

下载参考文献

Funding

资金

This work was supported by the Funds for Central - Guided Local Science & Technology Development (Grant No.202407AC110005) Key Technologies for the Construction of a Whole-process Intelligent Service System for Neuroendocrine Neoplasm

本研究得到了中央引导地方科技发展资金(项目编号:202407AC110005)神经内分泌肿瘤全病程智能服务体系构建关键技术的支持。

Author information

作者信息

Authors and Affiliations

作者与所属机构

School of Information Science and Engineering, Yunnan University, Yunnan, 650504, China

云南省昆明市云南大学信息科学与工程学院,邮编650504,中国

Xin Zhong, Gehao Lu & Hao Li

钟欣,卢戈浩,李浩

Authors

作者

Xin Zhong

钟欣

View author publications

查看作者的出版物

You can also search for this author in

您还可以搜索此作者在

PubMed

PubMed

Google Scholar

谷歌学术

Gehao Lu

吕哥豪

View author publications

查看作者的出版物

You can also search for this author in

您还可以搜索此作者在

PubMed

PubMed

Google Scholar

谷歌学术

Hao Li

李浩

View author publications

查看作者的出版物

You can also search for this author in

您还可以搜索此作者在

PubMed

PubMed

Google Scholar

谷歌学术

Contributions

贡献

Methodology,Software,Validation,X.Z; Founding acquisition,Resources,Supervision,G.L; Methodology,Writing-Review,H.L

方法论、软件、验证,X.Z;资金获取、资源、监督,G.L;方法论、写作-审核,H.L

Corresponding author

通讯作者

Correspondence to

致信给

Gehao Lu

吕歌豪

.

Ethics declarations

伦理声明

Competing interests

竞争利益

The authors declare no competing interests.

作者声明不存在竞争性利益。

Additional information

附加信息

Publisher’s note

出版社的说明

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature 对已发布地图中的管辖权声明和机构隶属关系保持中立。

Rights and permissions

权利与许可

Open Access

开放获取

This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material.

本文根据知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议获得许可,该协议允许您在任何媒介或格式中进行任何非商业性的使用、分享、分发和复制,只要您对原作者和来源给予适当的署名,提供指向知识共享许可协议的链接,并说明您是否修改了许可材料。

You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

根据本许可,您无权分享从本文或其部分内容改编的材料。本文中的图像或其他第三方材料包含在文章的 Creative Commons 许可中,除非在材料的署名行中另有说明。如果材料未包含在文章的 Creative Commons 许可中,并且您的预期用途不被法律规定允许或超出了允许的使用范围,您需要直接从版权持有人处获得许可。

To view a copy of this licence, visit .

要查看此许可证的副本,请访问 。

http://creativecommons.org/licenses/by-nc-nd/4.0/

http://creativecommons.org/licenses/by-nc-nd/4.0/

.

Reprints and permissions

重印和许可

About this article

关于本文

Cite this article

引用这篇文章

Zhong, X., Lu, G. & Li, H. Vision Mamba and xLSTM-UNet for medical image segmentation.

钟,X.,卢,G.,李,H. Vision Mamba 和 xLSTM-UNet 用于医学图像分割。

Sci Rep

科学报告

15

15

, 8163 (2025). https://doi.org/10.1038/s41598-025-88967-5

,8163(2025)。https://doi.org/10.1038/s41598-025-88967-5

Download citation

下载引用

Received

已收到

:

12 November 2024

2024年11月12日

Accepted

已接受

:

03 February 2025

2025年2月3日

Published

已发布

:

10 March 2025

2025年3月10日

DOI

数字对象标识符

:

https://doi.org/10.1038/s41598-025-88967-5

https://doi.org/10.1038/s41598-025-88967-5

Share this article

分享这篇文章

Anyone you share the following link with will be able to read this content:

任何你分享以下链接的人都将能够阅读此内容:

Get shareable link

获取可共享链接

Sorry, a shareable link is not currently available for this article.

抱歉,这篇文章目前没有可共享的链接。

Copy to clipboard

复制到剪贴板

Provided by the Springer Nature SharedIt content-sharing initiative

由 Springer Nature SharedIt 内容共享计划提供

Keywords

关键词

Deep Learning

深度学习

Medical Image Segmentation

医学图像分割

SSM

SSM

XLSTM

XLSTM