目录
Samba: A Unified Mamba-based Framework for General Salient Object Detection
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
Exploiting Temporal State Space Sharing for Video Semantic Segmentation
Event-based Video Super-Resolution via State Space Models
LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation
MambaOut: Do We Really Need Mamba for Vision?
M3amba: Memory Mamba is All You Need for Whole Slide Image Classification
TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
Mamba-Reg: Vision Mamba Also Needs Registers
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Models
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
GlobalMamba: Global Image Serialization for Vision Mamba
Vision Mamba: A Comprehensive Survey and Taxonomy
Samba: A Unified Mamba-based Framework for General Salient Object Detection
Abstract:
Existing salient object detection (SOD) models primarily resort to convolutional neural networks (CNNs) and Transformers. However, the limited receptive fields of CNNs and quadratic computational complexity of transformers both constrain the performance of current models on discovering attention-grabbing objects. The emerging state space model, namely Mamba, has demonstrated its potential to balance global receptive fields and computational complexity. Therefore, we propose a novel unified framework based on the pure Mamba architecture, dubbed saliency Mamba (Samba), to flexibly handle general SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), and RGB-D VSOD. Specifically, we rethink Mamba's scanning strategy from the perspective of SOD, and identify the importance of maintaining spatial continuity of salient patches within scanning sequences. Based on this, we propose a saliency-guided Mamba block (SGMB), incorporating a spatial neighboring scanning (SNS) algorithm to preserve spatial continuity of salient patches. Additionally, we propose a context-aware upsampling (CAU) method to promote hierarchical feature alignment and aggregations by modeling contextual dependencies. Experimental results show that our Samba outperforms existing methods across five SOD tasks on 21 datasets with lower computational cost, confirming the superiority of introducing Mamba to the SOD areas. Our code will be made publicly available.
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
Abstract:
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Exploiting Temporal State Space Sharing for Video Semantic Segmentation
Abstract:
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code will be released.
Event-based Video Super-Resolution via State Space Models
Abstract:
Exploiting temporal correlations is crucial for video super-resolution (VSR). Recent approaches enhance this by incorporating event cameras. In this paper, we introduce MamEVSR, an Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. MamEVSR stands out by offering global receptive field coverage with linear computational complexity, thus addressing the limitations of convolutional neural networks and Transformers. The key components of MamEVSR include: (1) The interleaved Mamba (iMamba) block, which interleaves tokens from adjacent frames and applies multi-directional selective state space modeling, enabling efficient feature fusion and propagation across bi-directional frames while maintaining linear complexity. (2) The cross-modality Mamba (cMamba) block facilitates further interaction and aggregation between event information and the output from the iMamba block. The cMamba block can leverage complementary spatio-temporal information from both modalities and allows MamEVSR to capture finer motion details. Experimental results show that the proposed MamEVSR achieves superior performance on various datasets quantitatively and qualitatively.
LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation
In this paper, we propose LC-Mamba, a Mamba-basedmodel that captures fine-grained spatiotemporal infor-mation in video frames, addressing limitations in cur-rent interpolation methods and enhancing performance.The main contributions are as follows: First, we apply ashifted local window technique to reduce historical de-cay and enhance local spatial features, allowing multi-scale capture of detailed motion between frames. Sec-ond, we introduce a Hilbert curve-based selective statescan to maintain continuity across window boundaries,preserving spatial correlations both within and betweenwindows. Third, we extend the Hilbert curve to enablevoxel-level scanning to effectively capture spatiotempo-ral characteristics between frames. The proposed LC-Mamba achieves competitive results, with a PSNR of36.53 dB on Vimeo-90k, outperforming prior models by+0.03 dB. The code and models are publicly available athttps://anonymous.4open.science/r/LC-Mamba-FE7C
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
Unlike modern native digital videos, the restoration of old films requires addressing specific degradations inherent to analog sources. However, existing specialized methods still fall short compared to general video restoration techniques. In this work, we propose a new baseline to re-examine the challenges in old film restoration. First, we develop an improved Mamba-based framework, dubbed MambaOFR, which can dynamically adjust the degradation removal patterns by generating degradation-aware prompts to tackle the complex and composite degradations present in old films. Second, we introduce a flow-guided mask deformable alignment module to mitigate the propagation of structured defect features in the temporal domain. Third, we introduce the first benchmark dataset that includes both synthetic and real-world old film clips. Extensive experiments show that the proposed method achieves state-of-the-art performance, outperforming existing advanced approaches in old film restoration. The implementation and model will be released.
MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation
Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets—Breakfast, 50Salads, and Assembly101—while also significantly improving computational and memory efficiency.
MambaOut: Do We Really Need Mamba for Vision?
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification on ImageNet does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks on COCO or ADE20K are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks.
M3amba: Memory Mamba is All You Need for Whole Slide Image Classification
Multi-instance learning (MIL) has demonstrated impressive performance in whole slide image (WSI) analysis. However, existing approaches struggle with undesirable results and unbearable computational overhead due to the quadratic complexity of Transformers. Recently, Mamba has offered a feasible solution for modeling long-range dependencies with linear complexity. However, vanilla Mamba inherently suffers from contextual forgetting issues, making it ill-suited for capturing global dependencies across instances in large-scale WSIs. To address this, we propose a memory-driven Mamba network, dubbed M3amba, to fully explore the global latent relations among instances. Specifically, M3amba retains and iteratively updates historical information with a dynamic memory bank (DMB), thus overcoming the catastrophic forgetting defects of Mamba for long-term context representation. For better feature representation, M3amba involves an intra-group bidirectional Mamba (BiMamba) block to refine local interactions within groups. Meanwhile, we additionally perform cross-attention fusion to incorporate relevant historical information across groups, facilitating richer inter-group connections. The joint learning of inter- and intra-group representations with memory merits enables M3amba with a more powerful capability for achieving accurate and comprehensive WSI representation. Extensive experiments on four datasets demonstrate that M3amba outperforms the state-of-the-art by 6.2\% and 7.0\% in accuracy on the TCGA BRAC and TCGA Lung datasets while maintaining low computational costs.
TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image
Recently, Mamba-based frameworks have achieved substantial advancements across diverse computer vision and NLP tasks, particularly in their capacity for reasoning over long-range information with linear complexity. However, the fixed 2D-to-1D scanning pattern overlooks the local structures of an image, limiting its effectiveness in aggregating 2D spatial information. While stacking additional Mamba layers can partially address this issue, it increases parameter intensity and constrains real-time application. In this work, we reconsider the local optimal scanning path in Mamba, enhancing the rigid and uniform 1D scan through the local shortest path theory, thus creating a structure-aware Mamba suited for lightweight single-image super-resolution. Specifically, we draw inspiration from the Traveling Salesman Problem (TSP) to establish a local optimal scanning path for improved structural 2D information utilization. Here, local patch aggregation occurs in a content-adaptive manner with minimal propagation cost. TSP-Mamba demonstrates substantial improvements over existing Mamba-based and Transformer-based architectures. For example, TSP-Mamba surpasses MambaIR by up to 0.7dB in lightweight SISR, with comparable parameters and very slightly extra computational demands (1-2 GFlops for 720P images).
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task adapter for Mamba models, which consists of two functional modules: Adaptor-T and Adapator-S. When solving the hidden states for SSM, we apply a casual prediction module Adaptor-T to select a set of learnable locations as memory augmentation feature states to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both two modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that adapts the base model for transfer learning tasks. Extensive experiments verify the effectiveness of Mamba-Adapter in three settings. Notably, our Mamba-Adapter achieves state-of-the-art on the ImageNet and COCO benchmarks. The code will be released publicly.
Mamba-Reg: Vision Mamba Also Needs Registers
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba---they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture MambaReg. Qualitative observations suggest, compared to vanilla Vision Mamba, MambaReg's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, MambaReg attains stronger performance and scales better. For example, on the ImageNet benchmark, our MambaReg-B attains 83.0% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 340M parameters), attaining a competitive accuracy of 83.6% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports MambaReg's efficacy.
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction (MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba), Efficient Multi-Kernel Depthwise Deconvolution (MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum ×21↑ faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT and Vim, Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8x and 6.2x faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images.
2024的几篇
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Models
Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models.
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
https://arxiv.org/pdf/2407.08083
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://anonymous.4open.science/r/mamba_vision-D073
GlobalMamba: Global Image Serialization for Vision Mamba
[2410.10316] GlobalMamba: Global Image Serialization for Vision Mamba
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.
Vision Mamba: A Comprehensive Survey and Taxonomy
https://arxiv.org/pdf/2405.04404
State Space Model (SSM) is a mathematical model used to describe and analyze the behavior of dynamic systems. This model has witnessed numerous applications in several fields, including control theory, signal processing, economics and machine learning. In the field of deep learning, state space models are used to process sequence data, such as time series analysis, natural language processing (NLP) and video understanding. By mapping sequence data to state space, long-term dependencies in the data can be better captured. In particular, modern SSMs have shown strong representational capabilities in NLP, especially in long sequence modeling, while maintaining linear time complexity. Notably, based on the latest state-space models, Mamba merges time-varying parameters into SSMs and formulates a hardware-aware algorithm for efficient training and inference. Given its impressive efficiency and strong long-range dependency modeling capability, Mamba is expected to become a new AI architecture that may outperform Transformer. Recently, a number of works have attempted to study the potential of Mamba in various fields, such as general vision, multi-modal, medical image analysis and remote sensing image analysis, by extending Mamba from natural language domain to visual domain. To fully understand Mamba in the visual domain, we conduct a comprehensive survey and present a taxonomy study. This survey focuses on Mamba's application to a variety of visual tasks and data types, and discusses its predecessors, recent advances and far-reaching impact on a wide range of domains. Since Mamba is now on an upward trend, please actively notice us if you have new findings, and new progress on Mamba will be included in this survey in a timely manner and updated on the Mamba project at this https URL.