Advanced Search
Turn off MathJax
Article Contents
YU Haoyang, LI Yansheng, XIAO Lingli, ZHOU Jiyuan. A Lightweight Semantic Visual Simultaneous Localization and Mapping Framework for Inspection Robots in Dynamic Environments[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250301
Citation: YU Haoyang, LI Yansheng, XIAO Lingli, ZHOU Jiyuan. A Lightweight Semantic Visual Simultaneous Localization and Mapping Framework for Inspection Robots in Dynamic Environments[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250301

A Lightweight Semantic Visual Simultaneous Localization and Mapping Framework for Inspection Robots in Dynamic Environments

doi: 10.11999/JEIT250301 cstr: 32379.14.JEIT250301
Funds:  The National Natural Science Foundation of China (52575102), Chongqing Urban Administration Bureau Project (Urban Management Science Section 2022-34), Chongqing Natural Science Foundation (CSTB2022NSCQ-MSX0340)
  • Received Date: 2025-04-25
  • Rev Recd Date: 2025-07-28
  • Available Online: 2025-08-04
  •   Objective  In complex dynamic environments such as industrial parks and urban roads, inspection robots depend heavily on visual Simultaneous Localization And Mapping (SLAM) systems. However, the presence of moving objects often causes feature drift and map degradation, reducing SLAM performance. Furthermore, conventional semantic segmentation models typically require extensive computational resources, rendering them unsuitable for embedded platforms with limited processing capabilities, thereby constraining SLAM deployment in autonomous inspection tasks. To address these challenges, this study proposes a lightweight semantic visual SLAM framework designed for inspection robots operating in dynamic environments. The framework incorporates a semantic segmentation-based dynamic feature rejection method to achieve real-time identification of dynamic regions at low computational cost, thereby improving SLAM robustness and mapping accuracy.  Methods  Building upon the 11th generation lightweight YOLO segmentation model (YOLOv11n-seg), a systematic lightweight redesign is implemented. to enhance performance under constrained computational resources. First, the original neck is replaced with DyCANet, a lightweight multi-scale feature fusion module that integrates dynamic point sampling and channel attention to improve semantic representation and boundary segmentation. DyCANet combines DySample, a dynamic upsampling operator that performs content-aware spatial sampling with minimal overhead, and ChannelAttention_HSFPN, a hierarchical attention structure that strengthens multi-scale integration and highlights critical semantic cues, particularly for small or occluded objects in complex scenes. Second, a Dynamic Convolution module (DynamicConv) is embedded into all C3k2 modules to enhance the adaptability and efficiency of feature extraction. Inspired by the Mixture-of-Experts framework, DynamicConv applies a conditional computation mechanism that dynamically adjusts kernel weights based on the input feature characteristics. This design allows the network to extract features more effectively across varying object scales and motion patterns, improving robustness against dynamic disturbances with low computational cost. Third, the original segmentation head is replaced by the Reused and Shared Convolutional Segmentation Head (RSCS Head), which enables decoder structure sharing across multi-scale branches. RSCS reduces redundant computation by reusing convolutional layers and optimizing feature decoding paths, further improving overall model efficiency while maintaining segmentation accuracy. These architectural modifications result in DHSR-YOLOSeg, a lightweight semantic segmentation model that significantly reduces parameter count and computational cost while preserving performance. DHSR-YOLOSeg is integrated into the tracking thread of ORB-SLAM3. to provide real-time semantic information. This enables dynamic object detection and the removal of unstable feature points during localization, thereby enhancing the robustness and trajectory consistency of SLAM in complex dynamic environments.  Results and Discussions  Ablation experiments on the COCO dataset demonstrate that, compared with the baseline YOLOv11n-seg, the proposed DHSR-YOLOSeg achieves a 13.8% reduction in parameter count, a 23.1% decrease in Giga Floating Point Operations (GFLOPs), and an approximate 2% increase in mean Average Precision at IoU 0.5 (mAP50) (Table 1). On the KITTI dataset, DHSR-YOLOSeg reaches an inference speed of 60.19 frame/s, which is 2.14% faster than YOLOv11n-seg and 275% faster than the widely used Mask R-CNN (Table 2). For trajectory accuracy evaluation on KITTI sequences 00~10, DHSR-YOLOSeg outperforms ORB-SLAM3 in 8 out of 10 sequences, achieving a maximum Root Mean Square Error (RMSE) reduction of 16.76% and an average reduction of 8.78% (Table 3). Compared with DynaSLAM and DS-SLAM, the proposed framework exhibits more consistent error suppression across sequences, improving both trajectory accuracy and stability. In terms of runtime efficiency, DHSR-YOLOSeg achieves an average per-frame processing time of 48.86 ms on the KITTI dataset, 18.44% and 41.38% lower than DS-SLAM and DynaSLAM, respectively (Table 4). The per-sequence processing time ranges from 41 to 55 ms, which is comparable to the 35.64 ms of ORB-SLAM3, indicating that the integration of semantic segmentation introduces only a modest computational overhead.  Conclusions  This study addresses the challenge of achieving robust localization for inspection robots operating in dynamic environments, particularly in urban road settings characterized by frequent interference from pedestrians, vehicles, and other moving objects. To this end, a semantic-enhanced visual SLAM framework is proposed, in which a lightweight semantic segmentation model, DHSR-YOLOSeg, is integrated into the stereo-based ORB-SLAM3 pipeline. This integration enables real-time identification of dynamic objects and removal of their associated feature points, thereby improving localization robustness and trajectory consistency. The DHSR-YOLOSeg model incorporates three key architectural components—DyCANet for feature fusion, DynamicConv for adaptive convolution, and the RSCS Head for efficient multi-scale decoding. Together, these components reduce the model’s size and computational cost while preserving segmentation performance, providing an efficient and deployable perception solution for resource-constrained platforms. Experimental findings show that: (1) ablation tests on the COCO dataset confirm substantial reductions in complexity with preserved accuracy, supporting embedded deployment; (2) frame rate comparisons on the KITTI dataset demonstrate superior performance over both lightweight and standard semantic segmentation methods, meeting real-time SLAM requirements; (3) trajectory evaluations indicate that the dynamic feature rejection strategy effectively mitigates localization errors in dynamic scenes; and (4) the overall system maintains high runtime efficiency, ensuring a practical balance between semantic segmentation and real-time localization performance. However, current experiments are conducted under ideal conditions using standardized datasets, without fully reflecting real-world challenges such as multi-sensor interference or unstructured environments. Moreover, the trade-offs between model complexity and accuracy for each lightweight module have not been systematically assessed. Future work will focus on multimodal sensor fusion and adaptive dynamic perception strategies to enhance the robustness and applicability of the proposed system in real-world autonomous inspection scenarios.
  • loading
  • [1]
    HALDER S and AFSARI K. Robots in inspection and monitoring of buildings and infrastructure: A systematic review[J]. Applied Sciences, 2023, 13(4): 2304. doi: 10.3390/app13042304.
    [2]
    LI Yuhao, FU Chengguo, YANG Hui, et al. Design of a closed piggery environmental monitoring and control system based on a track inspection robot[J]. Agriculture, 2023, 13(8): 1501. doi: 10.3390/agriculture13081501.
    [3]
    罗朝阳, 张荣芬, 刘宇红, 等. 自动驾驶场景下的行人意图语义VSLAM[J]. 计算机工程与应用, 2024, 60(17): 107–116. doi: 10.3778/j.issn.1002-8331.2306-0159.

    LUO Zhaoyang, ZHANG Rongfen, LIU Yuhong, et al. Pedestrian intent semantic VSLAM in automatic driving scenarios[J]. Computer Engineering and Applications, 2024, 60(17): 107–116. doi: 10.3778/j.issn.1002-8331.2306-0159.
    [4]
    李国逢, 谈嵘, 曹媛媛. 手持SLAM: 城市测量的新方法与实践[J]. 测绘通报, 2024(S2): 255–259. doi: 10.13474/j.cnki.11-2246.2024.S253.

    LI Guofeng, TAN Rong, and CAO Yuanyuan. Handheld SLAM: Emerging techniques and practical implementations in urban surveying[J]. Bulletin of Surveying and Mapping, 2024(S2): 255–259. doi: 10.13474/j.cnki.11-2246.2024.S253.
    [5]
    ZHANG Tianzhe and DAI Jun. Electric power intelligent inspection robot: A review[J]. Journal of Physics: Conference Series, 2021, 1750(1): 012023. doi: 10.1088/1742-6596/1750/1/012023.
    [6]
    MUR-ARTAL R, MONTIEL J M M, and TARDÓS J D. ORB-SLAM: A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5): 1147–1163. doi: 10.1109/TRO.2015.2463671.
    [7]
    MUR-ARTAL R and TARDÓS J D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE Transactions on Robotics, 2017, 33(5): 1255–1262. doi: 10.1109/TRO.2017.2705103.
    [8]
    CAMPOS C, ELVIRA R, RODRÍGUEZ J J G, et al. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM[J]. IEEE Transactions on Robotics, 2021, 37(6): 1874–1890. doi: 10.1109/TRO.2021.3075644.
    [9]
    QIN Tong, LI Peiliang, and SHEN Shaojie. VINS-Mono: A robust and versatile monocular visual-inertial state estimator[J]. IEEE Transactions on Robotics, 2018, 34(4): 1004–1020. doi: 10.1109/TRO.2018.2853729.
    [10]
    ZANG Qiuyu, ZHANG Kehua, WANG Ling, et al. An adaptive ORB-SLAM3 system for outdoor dynamic environments[J]. Sensors, 2023, 23(3): 1359. doi: 10.3390/s23031359.
    [11]
    WU Hangbin, ZHAN Shihao, SHAO Xiaohang, et al. SLG-SLAM: An integrated SLAM framework to improve accuracy using semantic information, laser and GNSS data[J]. International Journal of Applied Earth Observation and Geoinformation, 2024, 133: 104110. doi: 10.1016/j.jag.2024.104110.
    [12]
    BESCOS B, FÁCIL J M, CIVERA J, et al. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4): 4076–4083. doi: 10.1109/LRA.2018.2860039.
    [13]
    VINCENT J, LABBÉ M, LAUZON J S, et al. Dynamic object tracking and masking for visual SLAM[C]. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, 2020: 4974–4979. doi: 10.1109/IROS45743.2020.9340958.
    [14]
    KHANAM R and HUSSAIN M. YOLOv11: An overview of the key architectural enhancements[EB/OL]. https://arxiv.org/abs/2410.17725, 2024.
    [15]
    XU Ziheng, NIU Jianwei, LI Qingfeng, et al. NID-SLAM: Neural implicit representation-based RGB-D SLAM in dynamic environments[C]. 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, Canada, 2024: 1–6. doi: 10.1109/ICME57554.2024.10687512.
    [16]
    GONG Can, SUN Ying, ZOU Chunlong, et al. Real-time visual SLAM based YOLO-fastest for dynamic scenes[J]. Measurement Science and Technology, 2024, 35(5): 056305. doi: 10.1088/1361-6501/ad2669.
    [17]
    WU Peiyi, TONG Pengfei, ZHOU Xin, et al. Dyn-DarkSLAM: YOLO-based visual SLAM in low-light conditions[C]. 2024 IEEE 25th China Conference on System Simulation Technology and its Application (CCSSTA), Tianjin, China, 2024: 346–351. doi: 10.1109/CCSSTA62096.2024.10691775.
    [18]
    ZHANG Ruidong and ZHANG Xinguang. Geometric constraint-based and improved YOLOv5 semantic SLAM for dynamic scenes[J]. ISPRS International Journal of Geo-Information, 2023, 12(6): 211. doi: 10.3390/ijgi12060211.
    [19]
    YANG Tingting, JIA Shuwen, YU Ying, et al. Enhancing visual SLAM in dynamic environments with improved YOLOv8[C]. The Sixteenth International Conference on Digital Image Processing (ICDIP), Haikou, China, 2024: 132741Y. doi: 10.1117/12.3037734.
    [20]
    HAN Kai, WANG Yunhe, GUO Jianyuan, et al. ParameterNet: Parameters are all you need for large-scale visual pretraining of mobile networks[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 15751–15761. doi: 10.1109/CVPR52733.2024.01491.
    [21]
    YU Jiazuo, ZHUGE Yunzhi, ZHANG Lu, et al. Boosting continual learning of vision-language models via mixture-of-experts adapters[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 23219–23230. doi: 10.1109/CVPR52733.2024.02191.
    [22]
    LIU Wenze, LU Hao, FU Hongtao, et al. Learning to upsample by learning to sample[C]. The IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 6004–6014. doi: 10.1109/ICCV51070.2023.00554.
    [23]
    HUANGFU Yi, HUANG Zhonghao, YANG Xiaogang, et al. HHS-RT-DETR: A method for the detection of citrus greening disease[J]. Agronomy, 2024, 14(12): 2900. doi: 10.3390/agronomy14122900.
    [24]
    CAO Qi, CHEN Hang, WANG Shang, et al. LH-YOLO: A lightweight and high-precision SAR ship detection model based on the improved YOLOv8n[J]. Remote Sensing, 2024, 16(22): 4340. doi: 10.3390/rs16224340.
    [25]
    YU Chao, LIU Zuxin, LIU Xinjun, et al. DS-SLAM: A semantic visual SLAM towards dynamic environments[C]. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018: 1168–1174. doi: 10.1109/IROS.2018.8593691.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(10)  / Tables(4)

    Article Metrics

    Article views (148) PDF downloads(13) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return