From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition
-
摘要: 触觉感知是机器人理解物体物理属性并实现精细交互的重要基础,在视觉受限或复杂场景中具有不可替代的作用。相比连续触觉信号,事件驱动的脉冲触觉以异步稀疏的方式编码接触动态变化,具备更高时间分辨率与能效优势。然而,受限于高采集成本与小样本规模,现有监督触觉识别方法在开放世界场景中面临显著泛化瓶颈。尽管零样本学习可缓解数据稀缺,但现有方法多依赖视觉辅助或人工粗粒度语义,难以适配脉冲触觉的复杂时空动态结构,且缺乏稳定的跨模态语义对齐机制。针对上述问题,该文提出一种面向脉冲触觉信号的零样本物体识别方法。首先,构建仿生脉冲图神经网络,对脉冲触觉时空演化进行建模,提取判别性且具生物可解释性的触觉表征;其次,引入大语言模型生成细粒度结构化触觉语义空间,并为各类别构建一致维度的语义原型;在此基础上,设计触觉到语义的正向映射与语义到触觉的反向重构,形成循环一致的跨模态对齐框架,从而提升对齐稳定性与未见类别泛化能力。在标准的零样本设定下,该文在脉冲触觉数据集上进行了系统实验评估,实验结果表明,该方法在平均类别准确率、Top-k准确率及语义对齐一致性等指标上均优于现有方法,验证了其在脉冲触觉零样本识别任务中的有效性与鲁棒性。该研究为脉冲触觉语义理解及开放环境下的机器人感知提供了新的技术路径。Abstract:
Objective Tactile perception is essential for robots to understand object properties and enable dexterous interactions. However, tactile data acquisition is costly and difficult to scale, limiting the applicability of conventional supervised learning in open-world scenarios. Zero-shot learning (ZSL) offers a promising solution by transferring knowledge from seen to unseen categories via semantic representations. However, existing tactile ZSL methods either rely on auxiliary visual information or depend on manually designed attributes, which are often subjective and lack generalization. Meanwhile, event-based tactile sensors produce sparse, asynchronous spiking signals with rich spatiotemporal dynamics, posing additional challenges for semantic modeling. Consequently, systematic studies on zero-shot recognition of such data remain limited. To address these issues, we propose a zero-shot object recognition framework for event-based spiking tactile perception, aiming to bridge low-level tactile dynamics with high-level semantics in a scalable manner. Methods The proposed framework consists of three key components ( Fig. 1 ): spiking tactile feature extraction, semantic prototype construction, and cross-modal tactile–semantic alignment. First, a biomimetic spiking graph neural network is employed to model raw event-based tactile signals. By integrating leaky integrate-and-fire (LIF) neurons with graph-based message passing, the model captures both temporal firing dynamics and spatial relationships among tactile sensing units, producing discriminative and biologically interpretable high-level tactile embeddings. Second, instead of relying on manually annotated attributes, large language models (LLMs) are introduced to generate structured, fine-grained, and extensible tactile attribute descriptions for each object category. These textual descriptions are further encoded into continuous semantic vectors, forming class-level semantic prototypes with consistent dimensionality across categories. This strategy enables flexible semantic expansion and avoids the labor-intensive process of attribute engineering. Third, a bidirectional tactile–semantic alignment mechanism is designed to enhance generalization to unseen categories. Specifically, a forward mapping projects tactile embeddings into the semantic space for classification, while a reverse mapping reconstructs tactile features from semantic representations. A cycle-consistency constraint is imposed between the two mappings to enforce structural coherence and semantic stability across modalities. The overall framework is trained on seen categories only, and zero-shot inference is performed by matching tactile embeddings of unseen samples with their corresponding semantic prototypes in the shared embedding space.Results and Discussions The proposed method is evaluated under a strict zero-shot setting on event-based spiking tactile datasets with disjoint seen and unseen sets. Performance is assessed using mean class accuracy, Top-k accuracy, and semantic alignment score. The framework consistently outperforms state-of-the-art tactile ZSL baselines across all metrics. Ablation studies validate each component: removing the spiking graph neural network leads to notable performance degradation, confirming the importance of explicitly modeling spatiotemporal tactile dynamics; replacing LLM-generated semantics with manually defined attributes reduces generalization, highlighting the advantage of structured and semantically rich language-driven representations. t-SNE visualization shows that cycle-consistent alignment produces more compact intra-class clusters and clearer inter-class boundaries for unseen categories. The bidirectional alignment mechanism also improves semantic stability and reduces projection bias. These results indicate that combining biologically inspired spiking models with language-extended semantics offers a robust solution for open-set tactile perception. Conclusions This paper presents a novel zero-shot object recognition framework for spiking tactile perception by integrating spiking graph neural networks with semantic representations. The proposed method addresses the limitations of existing tactile ZSL approaches by avoiding reliance on visual data and manual attribute design, while effectively modeling the spatiotemporal dynamics of spiking tactile signals. Experimental results demonstrate superior performance under strict zero-shot settings, confirming the effectiveness and robustness of the proposed approach. This work provides a strong baseline for zero-shot spiking tactile recognition and provides a principled pathway toward open-world tactile cognition in robotic systems. Future work will explore multimodal extensions and real-world robotic deployment under noisy and dynamic sensing conditions. -
表 1 标准ZSL设置下不同方法在零样本脉冲触觉识别任务上的性能对比(%)
方法 基线类型 MCA Top-1 Acc Top-2 Acc ALE[28] 嵌入式ZSL (语言) 55.92 $ \pm $ 5.17 47.36 $ \pm $ 6.84 72.41 $ \pm $ 8.12 f-CLSWGAN[31] 生成式ZSL (语言) 58.74 $ \pm $ 6.23 50.21 $ \pm $ 7.55 76.89 $ \pm $ 7.33 zsl4ttr[11] 生成式ZSL (图像+语言) 61.35 $ \pm $ 5.88 52.63 $ \pm $ 6.92 79.05 $ \pm $ 6.74 UniT[32] 触觉表征学习 63.41 $ \pm $ 4.96 54.80 $ \pm $ 6.40 81.22 $ \pm $ 5.91 Octopi[33] 触觉-语言大模型 65.08 $ \pm $ 6.40 56.14 $ \pm $ 7.82 83.57 $ \pm $ 6.15 Ours 嵌入式ZSL(语言) 73.48 $ \pm $ 4.35 62.68 $ \pm $ 7.81 88.75 $ \pm $ 6.47 表 2 关键模块的消融实验分析(%)
LLM 语义 双向映射 循环一致性 MCA SAS Î Î Î 31.57 30.24 Î P P 56.12 52.78 P Î P 54.98 51.46 P P Î 71.35 78.92 P P P 73.48 86.17 表 3 提示策略的消融实验分析(%)
策略 提示方式 MCA Top-1 Acc Top-2 Acc A 类别名基线 56.12 47.53 71.84 B 自由形式描述 68.94 58.26 81.33 C 结构化-3属性 71.51 60.17 84.60 D 结构化-5属性 73.48 62.68 88.75 -
[1] LI Baojiang, LI Liang, WANG Haiyan, et al. TVT-transformer: A tactile-visual-textual fusion network for object recognition[J]. Information Fusion, 2025, 118: 102943. doi: 10.1016/j.inffus.2025.102943. [2] LUO Shan, LEPORA N F, YUAN Wenzhen, et al. Tactile robotics: An outlook[J]. IEEE Transactions on Robotics, 2025, 41: 5564–5583. doi: 10.1109/TRO.2025.3608686. [3] ZHANG Yupo, LI Xiaoyu, FANG Senlin, et al. Multi-branch multi-scale channel fusion graph convolutional networks with transfer cost for robotic tactile recognition tasks[J]. IEEE Transactions on Automation Science and Engineering, 2025, 22: 11856–11868. doi: 10.1109/TASE.2025.3541339. [4] LI Liang, QIU Shengjie, LI Baojiang, et al. Object recognition based on tactile information: A generalized recognition network combining wavelet transform and transformer model for small sample datasets[J]. Information Sciences, 2025, 719: 122464. doi: 10.1016/j.ins.2025.122464. [5] UEDA S, HASHIMOTO A, HAMAYA M, et al. Visuo-tactile zero-shot object recognition with vision-language model[C]. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 2024: 7243–7250. doi: 10.1109/IROS58592.2024.10801766. [6] TAUNYAZOV T, SNG W, LIM B, et al. Event-driven visual-tactile sensing and learning for robots[C]. Robotics: Science and Systems XVI, Corvalis, USA, 2020. doi: 10.15607/RSS.2020.XVI.020. [7] 张铁林, 李澄宇, 王刚, 等. 适合类脑脉冲神经网络的应用任务范式分析与展望[J]. 电子与信息学报, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459.ZHANG Tielin, LI Chengyu, WANG Gang, et al. Research advances and new paradigms for biology-inspired spiking neural networks[J]. Journal of Electronics & Information Technology, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459. [8] YANG Jing, LIU Tingqing, REN Yaping, et al. AM-SGCN: Tactile object recognition for adaptive multichannel spiking graph convolutional neural networks[J]. IEEE Sensors Journal, 2023, 23(24): 30805–30820. doi: 10.1109/JSEN.2023.3329559. [9] NAG S, ZHU Xiatian, SONG Yizhe, et al. Zero-shot temporal action detection via vision-language prompting[C]. 17th European Conference on Computer Vision–ECCV, Tel Aviv, Israel, 2022: 681–697. doi: 10.1007/978-3-031-20062-5_39. [10] YANG Fengyu, FENG Chao, CHEN Ziyang, et al. YANG Fengyu, FENG Chao, CHEN Ziyang, et al. Binding touch to everything: Learning unified multimodal tactile representations[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 26330–26343. doi: 10.1109/CVPR52733.2024.02488. [11] CAO Guanqun, JIANG Jiaqi, BOLLEGALA D, et al. Multimodal zero-shot learning for tactile texture recognition[J]. Robotics and Autonomous Systems, 2024, 176: 104688. doi: 10.1016/j.robot.2024.104688. [12] MERIBOUT M, TAKELE N A, DEREGE O, et al. Tactile sensors: A review[J]. Measurement, 2024, 238: 115332. doi: 10.1016/j.measurement.2024.115332. [13] GUO Fangming, YU Fangwen, LI Mingyan, et al. Event-driven tactile sensing with dense spiking graph neural networks[J]. IEEE Transactions on Instrumentation and Measurement, 2025, 74: 2508113. doi: 10.1109/TIM.2025.3541787. [14] HU Jiarui, ZHOU Yanmin, WANG Zhipeng, et al. X-Tacformer: Spatio-tempral attention model for tactile recognition[C]. 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024: 9638–9644. doi: 10.1109/ICRA57147.2024.10610365. [15] 杨静, 吉晓阳, 李少波, 等. 具有正则化约束的脉冲神经网络机器人触觉物体识别方法[J]. 电子与信息学报, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711.YANG Jing, JI Xiaoyang, LI Shaobo, et al. Spiking neural network robot tactile object recognition method with regularization constraints[J]. Journal of Electronics & Information Technology, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711. [16] POURPANAH F, ABDAR M, LUO Yuxuan, et al. A review of generalized zero-shot learning methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4051–4070. doi: 10.1109/TPAMI.2022.3191696. [17] CAO Weipeng, WU Yuhao, SUN Yixuan, et al. A review on multimodal zero-shot learning[J]. WIREs Data Mining and Knowledge Discovery, 2023, 13(2): e1488. doi: 10.1002/widm.1488. [18] FOTEINOPOULOU N M and PATRAS I. EmoCLIP: A vision-language method for zero-shot video facial expression recognition[C]. 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkiye, 2024: 1–10. doi: 10.1109/FG59268.2024.10581982. [19] ZHONG Shaohong, ALBINI A, MAIOLINO P, et al. TactGen: Tactile sensory data generation via zero-shot sim-to-real transfer[J]. IEEE Transactions on Robotics, 2025, 41: 1316–1328. doi: 10.1109/TRO.2024.3521967. [20] LIU Huaping, SUN Fuchun, FANG Bin, et al. Cross-modal zero-shot-learning for tactile object recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(7): 2466–2474. doi: 10.1109/TSMC.2018.2818184. [21] BI T, SFERRAZZA C, and D’ANDREA R. Zero-shot sim-to-real transfer of tactile control policies for aggressive swing-up manipulation[J]. IEEE Robotics and Automation Letters, 2021, 6(3): 5761–5768. doi: 10.1109/LRA.2021.3084880. [22] MIRZA M J, KARLINSKY L, LIN Wei, et al. Meta-prompting for automating zero-shot visual recognition with LLMs[C]. 18th European Conference on Computer Vision–ECCV, Milan, Italy, 2025: 370–387. doi: 10.1007/978-3-031-72627-9_21. [23] NAGAR A, JAISWAL S, and TAN C. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis[C]. 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024: 1–8. doi: 10.1109/IJCNN60899.2024.10650020. [24] LIU Cheng, WANG Chao, PENG Yan, et al. ZVQAF: Zero-shot visual question answering with feedback from large language models[J]. Neurocomputing, 2024, 580: 127505. doi: 10.1016/j.neucom.2024.127505. [25] YELENOV A, UMURBEKOV I, KENZHEBEK D, et al. Zero-shot reasoning with haptic and visual feedback in vision-language-action robotic manipulation[C]. Proceedings of the CLAWAR 2025 Conference on AI Enabled Robotic Loco-Manipulation, Yokohama, Japan, 2025: 205–216. doi: 10.1007/978-3-032-09427-8_20. [26] BOULENGER V, MARTEL M, BOUVET C, et al. Feeling better: Tactile verbs speed up tactile detection[J]. Brain and Cognition, 2020, 142: 105582. doi: 10.1016/j.bandc.2020.105582. [27] MILLER T M, BLANKENBURG F, and PULVERMÜLLER F. Language, but not music, shapes tactile perception[J]. Language and Cognition, 2025, 17: e53. doi: 10.1017/langcog.2025.10006. [28] AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for attribute-based classification[C]. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 819–826. doi: 10.1109/CVPR.2013.111. [29] CHI Wei, ZHANG Ying, ZHANG Xiaolu, et al. MA-SGNN: A multi-view adaptive spiking graph neural network for event-based tactile recognition[C]. 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Wuhan, China, 2025: 2088–2093. doi: 10.1109/BIBM66473.2025.11355976. [30] 曹毅, 吴伟官, 李平, 等. 基于时空特征增强图卷积网络的骨架行为识别[J]. 电子与信息学报, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749.CAO Yi, WU Weiguan, LI Ping, et al. Skeleton action recognition based on spatio-temporal feature enhanced graph convolutional network[J]. Journal of Electronics & Information Technology, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749. [31] XIAN Yongqin, LORENZ T, SCHIELE B, et al. Feature generating networks for zero-shot learning[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5542–5551. doi: 10.1109/CVPR.2018.00581. [32] XU Zhengtong, UPPULURI R, ZHANG Xinwei, et al. UniT: Data efficient tactile representation with generalization to unseen objects[J]. IEEE Robotics and Automation Letters, 2025, 10(6): 5481–5488. doi: 10.1109/LRA.2025.3559835. [33] YU S, LIN K, XIAO Anxing, et al. Octopi: Object property reasoning with large tactile-language models[C]. Robotics: Science and Systems, Delft, Netherlands, 2024. -
下载:
下载: