Advanced Search
Turn off MathJax
Article Contents
ZOU Minrui, LI Yuxuan, DAI Yimian, LI Xiang, CHENG Mingming. UMM-Det: A Unified Object Detection Framework for Heterogeneous Multi-Modal Remote Sensing Imagery[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250933
Citation: ZOU Minrui, LI Yuxuan, DAI Yimian, LI Xiang, CHENG Mingming. UMM-Det: A Unified Object Detection Framework for Heterogeneous Multi-Modal Remote Sensing Imagery[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250933

UMM-Det: A Unified Object Detection Framework for Heterogeneous Multi-Modal Remote Sensing Imagery

doi: 10.11999/JEIT250933 cstr: 32379.14.JEIT250933
Funds:  The National Science Fund for Distinguished Young Scholar (62225604), The National Natural Science Foundation of China (62301261, 62206134), The General Program of Shenzhen Natural Science Foundation (JCYJ20240813114237048), Natural Science Foundation of Tianjin(No.25JCQNJC01370), The Supercomputing Center of Nankai University (NKSC)
  • Accepted Date: 2026-01-04
  • Rev Recd Date: 2026-01-04
  • Available Online: 2026-01-10
  •   Objective  With the increasing demand for space-based situational awareness, object detection across multiple modalities has become a fundamental yet challenging task. Current large-scale multimodal detection models for space-based remote sensing primarily operate on single-frame images from visible light, synthetic aperture radar (SAR), and infrared modalities. Although these models achieve acceptable performance in conventional detection, they severely neglect the crucial role of infrared video sequences in enhancing the accuracy of weak and small target detection. Temporal information inherent in sequential infrared data provides discriminative cues for separating dynamic targets from complex clutter, which cannot be captured by single-frame detectors. To address this limitation, this study proposes UMM-Det, a novel unified detection model tailored for infrared sequences. The proposed model not only extends the capability of existing space-based multimodal frameworks to sequential data but also demonstrates that exploiting temporal dynamics is indispensable for next-generation high-precision space-based sensing systems.  Methods  UMM-Det builds upon the unified multimodal detection framework SM3Det but introduces three key innovations. First, the ConvNeXt backbone is replaced with InternImage, a state-of-the-art architecture featuring dynamic sampling and large receptive field modeling. This modification is intended to improve feature extraction robustness against multi-scale variations and low-contrast appearances that are typical of weak and small targets. Second, a novel spatiotemporal visual prompting module is specifically designed for the infrared branch. This module generates high-contrast motion features by applying a refined frame-difference enhancement strategy. The resulting temporal priors guide the backbone network to focus attention on dynamic target regions, thereby mitigating the confusion introduced by static background noise. In addition, to overcome the imbalance between positive and negative samples during training, the probabilistic anchor assignment (PAA) strategy is incorporated into the infrared detection head. This improves the reliability of anchor selection and enhances the precision of small target detection under highly skewed data distributions. The overall pipeline is illustrated in Fig. 1, and the schematic of the spatiotemporal visual prompting module is shown in Fig. 2.  Results and Discussions  Extensive experiments are conducted on three public benchmarks: SatVideoIRSTD for infrared sequence detection, SARDet-50K for SAR-based target detection, and DOTA for visible light remote sensing detection. Results demonstrate that UMM-Det consistently outperforms the baseline SM3Det model across all modalities. Specifically, in infrared sequence small target detection, UMM-Det improves detection accuracy by 2. 54% over SM3Det (Table 2, Fig. 5), validating the effectiveness of incorporating temporal priors. In SAR target detection (Table 2, Fig. 3), the model achieves an improvement of 2. 40% mAP@0. 5: 0. 95, while in visible light detection (Table 2, Fig. 4), a gain of 1. 77% is observed. These improvements highlight the generalizability of the proposed framework across heterogeneous modalities. Furthermore, despite performance gains, UMM-Det reduces the number of parameters by more than 50% compared with SM3Det (Table 2), thereby ensuring high efficiency and lightweight deployment suitability for space-based systems. Qualitative comparisons shown in Fig. 4 and Fig. 5 indicate that UMM-Det detects low-contrast and dynamic weak targets missed by the baseline.The discussions emphasize three major findings. First, the spatiotemporal visual prompting strategy effectively transforms frame-to-frame variations into salient motion-aware cues, which are critical for distinguishing small dynamic targets from clutter in complex infrared environments. Second, the integration of InternImage as the backbone substantially strengthens multi-scale representation capability, ensuring robustness in detecting targets of varying sizes and contrast levels. Third, the probabilistic anchor assignment strategy significantly alleviates the training imbalance problem, leading to more stable optimization and higher detection reliability. Taken together, these components demonstrate a synergistic effect, yielding superior performance not only in sequential infrared data but also in static SAR and visible modalities.  Conclusions  This study proposes UMM-Det, the first space-based multimodal detection model explicitly designed to incorporate infrared sequence information into a unified detection framework. By leveraging InternImage for advanced feature extraction, a spatiotemporal visual prompting module for motion-aware enhancement, and probabilistic anchor assignment for balanced training, UMM-Det delivers significant gains in detection accuracy while reducing computational cost by more than half. The experimental results on SatVideoIRSTD, SARDet-50K, and DOTA collectively demonstrate that the model achieves state-of-the-art performance across infrared, SAR, and visible light modalities, with improvements of 2. 54%, 2. 40%, and 1. 77% respectively. Beyond its technical contributions, UMM-Det provides an effective pathway toward the construction of next-generation high-performance space-based situational awareness systems, where precision, efficiency, and lightweight design are simultaneously critical. Future research may extend this framework to multi-satellite collaborative sensing and real-time onboard deployment.
  • loading
  • [1]
    安成锦, 杨俊刚, 梁政宇, 等. 阵列相机图像邻近目标超分辨方法[J]. 电子与信息学报, 2023, 45(11): 4050–4059. doi: 10.11999/JEIT230810.

    AN Chengjin, YANG Jungang, LIANG Zhengyu, et al. Closely spaced objects super-resolution method using array camera images[J]. Journal of Electronics & Information Technology, 2023, 45(11): 4050–4059. doi: 10.11999/JEIT230810.
    [2]
    杨俊刚, 刘婷, 刘永贤, 等. 基于非凸低秩塔克分解的红外小目标检测方法[J]. 红外与毫米波学报, 2025, 44(2): 311–325. doi: 10.11972/j.issn.1001-9014.2025.02.018.

    YANG Jungang, LIU Ting, LIU Yongxian, et al. Infrared small target detection method based on nonconvex low-rank Tuck decomposition[J]. Journal of Infrared and Millimeter Waves, 2025, 44(2): 311–325. doi: 10.11972/j.issn.1001-9014.2025.02.018.
    [3]
    林再平, 罗伊杭, 李博扬, 等. 基于梯度可感知通道注意力模块的红外小目标检测前去噪网络[J]. 红外与毫米波学报, 2024, 43(2): 254–260. doi: 10.11972/j.issn.1001-9014.2024.02.015.

    LIN Zaiping, LUO Yihang, LI Boyang, et al. Gradient-aware channel attention network for infrared small target image denoising before detection[J]. Journal of Infrared and Millimeter Waves, 2024, 43(2): 254–260. doi: 10.11972/j.issn.1001-9014.2024.02.015.
    [4]
    SHI Qian, HE Da, LIU Zhengyu, et al. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping[J]. Journal of Remote Sensing, 2023, 3: 0078. doi: 10.34133/remotesensing.0078.
    [5]
    TIAN Jiaqi, ZHU Xiaolin, SHEN Miaogen, et al. Effectiveness of spatiotemporal data fusion in fine-scale land surface phenology monitoring: A simulation study[J]. Journal of Remote Sensing, 2024, 4: 0118. doi: 10.34133/remotesensing.0118.
    [6]
    LIU Shuaijun, LIU Jia, TAN Xiaoyue, et al. A hybrid spatiotemporal fusion method for high spatial resolution imagery: Fusion of gaofen-1 and sentinel-2 over agricultural landscapes[J]. Journal of Remote Sensing, 2024, 4: 0159. doi: 10.34133/remotesensing.0159.
    [7]
    MEI Shaohui, LIAN Jiawei, WANG Xiaofei, et al. A comprehensive study on the robustness of deep learning-based image classification and object detection in remote sensing: Surveying and benchmarking[J]. Journal of Remote Sensing, 2024, 4: 0219. doi: 10.34133/remotesensing.0219.
    [8]
    GUO Xin, LAO Jiangwei, DANG Bo, et al. SkySense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 27662–27673. doi: 10.1109/CVPR52733.2024.02613.
    [9]
    ZHANG Yingying, RU Lixiang, Wu Kang, et al. SkySense V2: A unified foundation model for multi-modal remote sensing[J]. arXiv: 2507.13812, 2025. doi: 10.48550/arXiv.2507.13812. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [10]
    BI Hanbo, FENG Yingchao, TONG Boyuan, et al. RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025: 1–18. doi: 10.1109/tpami.2025.3643453. (查阅网上资料,未找到本条文献出版年和卷期信息,请确认).
    [11]
    LI Xuyang, LI Chenyu, GHAMISI P, et al. FlexiMo: A flexible remote sensing foundation model[J]. arXiv: 2503.23844, 2025. doi: 10.48550/arXiv.2503.23844. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [12]
    YAO Kelu, XU Nuo, YANG Rong, et al. Falcon: A remote sensing vision-language foundation model (Technical Report)[J]. arXiv: 2503.11070, 2025. doi: 10.48550/arXiv.2503.11070. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [13]
    QIN Xiaolei, WANG Di, ZHANG Jing, et al. TiMo: Spatiotemporal foundation model for satellite image time series[J]. arXiv: 2505.08723, 2025. doi: 10.48550/arXiv.2505.08723. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [14]
    YAO Liang, LIU Fan, CHEN Delong, et al. RemoteSAM: Towards segment anything for earth observation[C]. Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 2025: 3027–3036. doi: 10.1145/3746027.3754950.
    [15]
    LI Yuxuan, LI Xiang, LI Yunheng, et al. SM3Det: A unified model for multi-modal remote sensing object detection[J]. arXiv: 2412.20665, 2024. doi: 10.48550/arXiv.2412.20665. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [16]
    WANG Wenhai, DAI Jifeng, CHEN Zhe, et al. InternImage: Exploring large-scale vision foundation models with deformable convolutions[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 14408–14419. doi: 10.1109/CVPR52729.2023.01385.
    [17]
    LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
    [18]
    KIM K and LEE H S. Probabilistic anchor assignment with IoU prediction for object detection[C]. Proceedings of 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 355–371. doi: 10.1007/978-3-030-58595-2_22.
    [19]
    LI Yuxuan, LI Xiang, LI Weijie, et al. SARDet-100K: Towards open-source benchmark and toolkit for large-scale SAR object detection[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 4079.
    [20]
    XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/CVPR.2018.00418.
    [21]
    LI Ruojing, AN Wei, YING Xinyi, et al. Probing deep into temporal profile makes the infrared small target detector much better[J]. arXiv: 2506.12766, 2025. doi: 10.48550/arXiv.2506.12766. (查阅网上资料,不确定本文献类型是否正确,请确认).
    [22]
    李朝旭, 徐清宇, 安玮, 等. 红外图像暗弱目标轻量级检测网络[J]. 红外与毫米波学报, 2025, 44(2): 299–310. doi: 10.11972/j.issn.1001-9014.2025.02.017.

    LI Zhaoxu, XU Qingyu, AN Wei, et al. A lightweight dark object detection network for infrared images[J]. Journal of Infrared and Millimeter Waves, 2025, 44(2): 299–310. doi: 10.11972/j.issn.1001-9014.2025.02.017.
    [23]
    YING Xinyi, LIU Li, LIN Zaipin, et al. Infrared small target detection in satellite videos: A new dataset and a novel recurrent feature refinement framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5002818. doi: 10.1109/TGRS.2025.3542368.
    [24]
    GUO Menghao, LU Chengze, LIU Zhengning, et al. Visual attention network[J]. Computational Visual Media, 2023, 9(4): 733–752. doi: 10.1007/s41095-023-0364-2.
    [25]
    LI Yuxuan, HOU Qibin, ZHENG Zhaohui, et al. Large selective kernel network for remote sensing object detection[C]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 16748–16759. doi: 10.1109/ICCV51070.2023.01540.
    [26]
    WANG Wenhai, XIE Enze, LI Xiang, et al. PVT v2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415–424. doi: 10.1007/s41095-022-0274-8.
    [27]
    LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324.
    [28]
    REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031.
    [29]
    CAI Zhaowei, VASCONCELOS N. Cascade R-CNN: Delving into high quality object detection[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6154–6162. doi: 10.1109/CVPR.2018.00644.
    [30]
    LI Xiang, WANG Wenhai, WU Lijun, et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[C]. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 1763.
    [31]
    DING Jian, XUE Nan, LONG Yang, et al. Learning RoI transformer for oriented object detection in aerial images[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019: 2844–2853. doi: 10.1109/CVPR.2019.00296.
    [32]
    LIU Yujie, SUN Xiaorui, SHAO Wenbin, et al. S2ANet: Combining local spectral and spatial point grouping for point cloud processing[J]. Virtual Reality & Intelligent Hardware, 2024, 6(4): 267–279. doi: 10.1016/j.vrih.2023.06.005.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(3)  / Tables(2)

    Article Metrics

    Article views (53) PDF downloads(9) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return