Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection

MA Li; XIN Jiangbo; WANG Lu; DAI Xinguan; SONG Shuang

doi:10.11999/JEIT251124

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 >

MA Li, XIN Jiangbo, WANG Lu, DAI Xinguan, SONG Shuang. Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251124

Citation:

MA Li, XIN Jiangbo, WANG Lu, DAI Xinguan, SONG Shuang. Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251124

Citation:

PDF( 3970 KB)

Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection

doi: 10.11999/JEIT251124 cstr: 32379.14.JEIT251124

1.
College of Communication and Information Technology, Xi’an University of Science and Technology, Xi’an 710054, China
2.
Changzhou Research Institute, China Coal Technology and Engineering Group, Changzhou 213015, China
3.
College of Safety science and engineering, Xi’an University of Science and Technology, Xi’an 710054, China

Funds: National Natural Science Foundation of China (52574269)

Accepted Date: 2026-02-05
Rev Recd Date: 2026-02-05

Available Online: 2026-02-13

Abstract

Abstract

Objective Spatio-temporal action detection aims to localize and recognize action instances in untrimmed videos, which is crucial for applications like intelligent surveillance and human-computer interaction. Existing methods, particularly those based on 3D CNNs or Transformers, often struggle with balancing computational complexity and modeling long-range temporal dependencies effectively. The YOWO series, while efficient, relies on 3D convolutions with limited receptive fields. The recent Mamba architecture, known for its linear computational complexity and selective state space mechanism, shows great potential for long-sequence modeling. This paper explores the integration of Mamba into the YOWO framework to enhance temporal modeling efficiency and capability while reducing computational burden, addressing a significant gap in applying Mamba specifically to spatio-temporal action detection tasks. Methods The proposed Mamba-YOWO framework is built upon the lightweight YOWOv3 architecture. It features a dual-branch heterogeneous design for feature extraction. The 2D branch, based on YOLOv8’s CSPDarknet and PANet, processes keyframes to extract multi-scale spatial features The core innovation lies in the 3D temporal modeling branch, which replaces traditional 3D convolutions with a hierarchical structure composed of a Stem layer and three Stages (Stage1-Stage3). Stage1 and Stage2 utilize Patch Merging for spatial downsampling and stack Decomposed Bidirectionally Fractal Mamba (DBFM) blocks. The DBFM block employs a bidirectional Mamba structure to capture temporal dependencies from both past-to-future and future-to-past contexts. Crucially, a Spatio-Temporal Interleaved Scan (STIS) strategy is introduced within DBFM, which combines bidirectional temporal scanning with spatial Hilbert quad-directional scanning, effectively serializing video data while preserving spatial locality and temporal coherence. Stage3 incorporates a 3D average pooling layer to compress features temporally. An Efficient Multi-scale Spatio-Temporal Fusion (EMSTF) module is designed to integrate features from the 2D and 3D branches. It employs group convolution-guided hierarchical interaction for preliminary fusion and a parallel dual-branch structure for refined fusion, generating an adaptive spatio-temporal attention map. Finally, a lightweight detection head with decoupled classification and regression sub-networks produces the final action tubes. Results and Discussions Extensive experiments were conducted on UCF101-24 and JHMDB datasets. Compared to the YOWOv3/L baseline on UCF101-24, Mamba-YOWO achieved a Frame-mAP of 90.24% and a Video-mAP@0.5 of 60.32%, representing significant improvements of 2.1% and 6.0%, respectively (Table 1). Notably, this performance gain was achieved while reducing parameters by 7.3% and computational load (GFLOPs) by 5.4%. On JHMDB, Mamba-YOWO attained a Frame-mAP of 83.2% and a Video-mAP@0.5 of 86.7% (Table 2). Ablation studies confirmed the effectiveness of key components: The optimal number of DBFM blocks in Stage2 was found to be 4, beyond which performance degraded likely due to overfitting (Table 3). The proposed STIS scan strategy outperformed 1D-Scan, Selective 2D-Scan, and Continus 2D-Scan, demonstrating the benefit of jointly modeling temporal coherence and spatial structure (Table 4). The EMSTF module also proved superior to other fusion methods like CFAM, EAG, and EMA (Table 5), highlighting its enhanced capability for cross-modal feature integration. The performance gains are attributed to the efficient long-range temporal dependency modeling by the Mamba-based branch with linear complexity and the effective multi-scale feature fusion facilitated by the EMSTF module. Conclusions This paper presents Mamba-YOWO, an efficient spatio-temporal action detection framework that integrates the Mamba architecture into YOWOv3. By replacing traditional 3D convolutions with a DBFM-based temporal modeling branch featuring the STIS strategy, the model effectively captures long-range dependencies with linear complexity. The designed EMSTF module further enhances discriminative feature fusion through group convolution and dynamic gating. Experimental results on UCF101-24 and JHMDB datasets demonstrate that Mamba-YOWO achieves superior detection accuracy (e.g., 90.24% Frame-mAP on UCF101-24) while simultaneously reducing model parameters and computational costs. Future work will focus on theoretical exploration of Mamba’s temporal mechanisms, extending its capability for long-video sequencing, and enabling lightweight deployment on edge devices.
- Spatio-temporal action detection,
- YOWO network,
- State space model,
- Spatio-temporal feature fusion

FullText(HTML)

References(39)

References

[1]	王彩玲, 闫晶晶, 张智栋. 基于多模态数据的人体行为识别方法研究综述[J]. 计算机工程与应用, 2024, 60(9): 1–18. doi: 10.3778/j.issn.1002-8331.2310-0090. WANG Cailing, YAN Jingjing, and ZHANG Zhidong. Review on human action recognition methods based on multimodal data[J]. Computer Engineering and Applications, 2024, 60(9): 1–18. doi: 10.3778/j.issn.1002-8331.2310-0090.
[2]	ZHEN Rui, SONG Wenchao, HE Qiang, et al. Human-computer interaction system: A survey of talking-head generation[J]. Electronics, 2023, 12(1): 218. doi: 10.3390/electronics12010218.
[3]	SIMONYAN K and ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, Montreal, Canada, 2014: 568–576.
[4]	WANG Limin, XIONG Yuanjun, WANG Zhe, et al. Temporal segment networks: Towards good practices for deep action recognition[C]. Proceedings of 14th European Conference on Computer Vision – ECCV 2016, Amsterdam, The Netherlands, 2016: 20–36. doi: 10.1007/978-3-319-46484-8_2.
[5]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497. doi: 10.1109/ICCV.2015.510.
[6]	CARREIRA J and ZISSERMAN A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4724–4733. doi: 10.1109/CVPR.2017.502.
[7]	钱惠敏, 陈实, 皇甫晓瑛. 基于双流-非局部时空残差卷积神经网络的人体行为识别[J]. 电子与信息学报, 2024, 46(3): 1100–1108. doi: 10.11999/JEIT230168. QIAN Huimin, CHEN Shi, and HUANGFU Xiaoying. Human activities recognition based on two-stream NonLocal spatial temporal residual convolution neural network[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1100–1108. doi: 10.11999/JEIT230168.
[8]	WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7794–7803. doi: 10.1109/CVPR.2018.00813.
[9]	DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 2625–2634. doi: 10.1109/CVPR.2015.7298878.
[10]	曹毅, 李杰, 叶培涛, 等. 利用可选择多尺度图卷积网络的骨架行为识别[J]. 电子与信息学报, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702. CAO Yi, LI Jie, YE Peitao, et al. Skeleton-based action recognition with selective multi-scale graph convolutional network[J]. Journal of Electronics & Information Technology, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702.
[11]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. Proceedings of 9th International Conference on Learning Representations, Austria, 2021. (查阅网上资料, 未找到出版地信息, 请确认).
[12]	BERTASIUS G, WANG Heng, and TORRESANI L. Is space-time attention all you need for video understanding?[C]. Proceedings of the 38th International Conference on Machine Learning, 2021: 813–824. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
[13]	ZHANG Chenlin, WU Jianxin, and LI Yin. ActionFormer: Localizing moments of actions with transformers[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 492–510. doi: 10.1007/978-3-031-19772-7_29.
[14]	韩宗旺, 杨涵, 吴世青, 等. 时空自适应图卷积与Transformer结合的动作识别网络[J]. 电子与信息学报, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551. HAN Zongwang, YANG Han, WU Shiqing, et al. Action recognition network combining spatio-temporal adaptive graph convolution and Transformer[J]. Journal of Electronics & Information Technology, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551.
[15]	GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[C]. Conference on Language Modeling, Philadelphia, 2024. (查阅网上资料, 未找到本条文献信息, 请确认).
[16]	ZHU Lianghui, LIAO Bencheng, ZHANG Qian, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model[C]. Procedding of Forty-First International Conference on Machine Learning, Vienna, Austria, 2024.
[17]	LIU Yue, TIAN Yunjie, ZHAO Yuzhong, et al. VMamba: Visual state space model[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3273.
[18]	LI Kunchang, LI Xinhao, WANG Yi, et al. VideoMamba: State space model for efficient video understanding[C]. Proceedings of the 18th European Conference on Computer Vision – ECCV 2024, Milan, Italy, 2025: 237–255. doi: 10.1007/978-3-031-73347-5_14.
[19]	LEE S H, SON T, SEO S W, et al. JARViS: Detecting actions in video using unified actor-scene context relation modeling[J]. Neurocomputing, 2024, 610: 128616. doi: 10.1016/j.neucom.2024.128616.
[20]	REKA A, BORZA D L, REILLY D, et al. Introducing gating and context into temporal action detection[C]. Proceedings of Computer Vision – ECCV 2024 Workshops, Milan, Italy, 2025: 322–334. doi: 10.1007/978-3-031-91581-9_23.
[21]	KWON D, KIM I, and KWAK S. Boosting semi-supervised video action detection with temporal context[C]. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, 2025: 847–858. doi: 10.1109/WACV61041.2025.00092.
[22]	ZHOU Xuyang, WANG Ye, TAO Fei, et al. Hierarchical chat-based strategies with MLLMs for Spatio-temporal action detection[J]. Information Processing & Management, 2025, 62(4): 104094. doi: 10.1016/j.ipm.2025.104094.
[23]	Köpüklü O, WEI Xiangyu, and RIGOLL G. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization[J]. arXiv preprint arXiv: 1911.06644, 2019. (查阅网上资料, 不确定本文献类型是否正确, 请确认).
[24]	JIANG Zhiqiang, YANG Jianhua, JIANG Nan, et al. YOWOv2: A stronger yet efficient multi-level detection framework for real-time spatio-temporal action detection[C]. Proceedings of 17th International Conference on Intelligent Robotics and Applications, Xi’an, China, 2025: 33–48. doi: 10.1007/978-981-96-0774-7_3.
[25]	NGUYEN D D M, NHAN B D, WANG J C, et al. YOWOv3: An efficient and generalized framework for spatiotemporal action detection[J]. IEEE Intelligent Systems, 2026, 41(1): 75–85. doi: 10.1109/MIS.2025.3581100.
[26]	VARGHESE R and M S. YOLOv8: A novel object detection algorithm with enhanced performance and robustness[C]. 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 2024: 1–6. doi: 10.1109/ADICS58448.2024.10533619.
[27]	LIU Ze, NING Jia, CAO Yue, et al. Video swin transformer[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 3192–3201. doi: 10.1109/CVPR52688.2022.00320.
[28]	XIAO Haoke, TANG Lv, JIANG Pengtao, et al. Boosting vision state space model with fractal scanning[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 8646–8654. doi: 10.1609/aaai.v39i8.32934.
[29]	SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv: 1212.0402, 2012. (查阅网上资料, 不确定本文献类型是否正确, 请确认).
[30]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563. doi: 10.1109/ICCV.2011.6126543.
[31]	LI Yixuan, WANG Zixu, WANG Limin, et al. Actions as moving points[C]. Proceedings of 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 68–84. doi: 10.1007/978-3-030-58517-4_5.
[32]	LI Yuxi, LIN Weiyao, WANG Tao, et al. Finding action tubes with a sparse-to-dense framework[C]. Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York Hilton Midtown, USA, 2020: 11466–11473. doi: 10.1609/aaai.v34i07.6811.
[33]	MA Xurui, LUO Zhigang, ZHANG Xiang, et al. Spatio-temporal action detector with self-attention[C]. 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021: 1–8. doi: 10.1109/IJCNN52387.2021.9533300.
[34]	ZHAO Jiaojiao, ZHANG Yanyi, LI Xinyu, et al. TubeR: Tubelet transformer for video action detection[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 13588–13597. doi: 10.1109/CVPR52688.2022.01323.
[35]	FAURE G J, CHEN M H, and LAI Shanghong. Holistic interaction transformer network for action detection[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 3329–3339. doi: 10.1109/WACV56688.2023.00334.
[36]	CHEN Lei, TONG Zhan, SONG Yibing, et al. Efficient video action detection with token dropout and context refinement[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 10354–10365. doi: 10.1109/ICCV51070.2023.00953.
[37]	WU Tao, CAO Mengqi, GAO Ziteng, et al. STMixer: A one-stage sparse action detector[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 14720–14729. doi: 10.1109/CVPR52729.2023.01414.
[38]	SU Shaowen and GAN Minggang. Online spatio-temporal action detection with adaptive sampling and hierarchical modulation[J]. Multimedia Systems, 2024, 30(6): 349. doi: 10.1007/s00530-024-01543-1.
[39]	QIN Yefeng, CHEN Lei, BEN Xianye, et al. You watch once more: A more effective CNN architecture for video spatio-temporal action localization[J]. Multimedia Systems, 2024, 30(1): 53. doi: 10.1007/s00530-023-01254-z.