From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition

CHI Wei; XU Jin

doi:10.11999/JEIT260158

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

Citation:

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

Citation:

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

PDF( 1197 KB)

From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition

doi: 10.11999/JEIT260158 cstr: 32379.14.JEIT260158

CHI Wei,
XU Jin^,

School of Computer Science, Peking University, Beijing 100871, China

Funds: The National Major Scientific Research Instrument Development Project (62427811), The National Natural Science Foundation General Project (62572008), The National Natural Science Foundation Youth Project (62403011, 62502025), The National Natural Science Foundation Key Project (62332006)

Received Date: 2026-02-06
Accepted Date: 2026-04-17
Rev Recd Date: 2026-04-17

Available Online: 2026-04-30

Abstract

Abstract

Objective Tactile perception enables robots to understand object properties and perform dexterous interactions. However, tactile data are costly to collect and difficult to scale, which limits conventional supervised learning in open-world scenarios. Zero-Shot Learning (ZSL) provides a promising solution by transferring knowledge from seen to unseen categories through semantic representations. Existing tactile ZSL methods either rely on auxiliary visual information or use manually designed attributes, which are often subjective and limited in generalization. Event-based spiking tactile signals are sparse and asynchronous, with rich spatiotemporal dynamics. These properties make semantic modeling more challenging. Systematic studies on zero-shot recognition for such data remain limited. To address these issues, this paper proposes a zero-shot object recognition framework for spiking tactile perception. The framework aims to bridge low-level tactile dynamics and high-level semantics in a scalable manner. Methods The proposed framework consists of three components (Fig. 1): spiking tactile feature extraction, semantic prototype construction, and cross-modal tactile-semantic alignment. First, a biomimetic Spiking Graph Neural Network (SGNN) is used to model raw event-based spiking tactile signals. By integrating Leaky Integrate-and-Fire (LIF) neurons with graph-based message passing, the SGNN captures temporal firing dynamics and spatial relationships among tactile sensing units. It then generates discriminative and biologically interpretable high-level tactile embeddings. Second, instead of using manually annotated attributes, a Large Language Model (LLM) is used to generate structured, fine-grained, and extensible tactile attribute descriptions for each object category. These textual descriptions are encoded as continuous semantic vectors to form class-level semantic prototypes with consistent dimensionality across categories. This strategy supports flexible semantic expansion and avoids labor-intensive attribute engineering. Third, a bidirectional tactile-semantic alignment mechanism is designed to improve generalization to unseen categories. A forward mapping projects tactile embeddings into the semantic space for classification, whereas a reverse mapping reconstructs tactile features from semantic representations. A cycle-consistency constraint is imposed between the two mappings to preserve structural coherence and semantic stability across modalities. The overall framework is trained only on seen categories. During zero-shot inference, tactile embeddings of unseen samples are matched with their corresponding semantic prototypes in the shared embedding space. Results and Discussions The proposed method is evaluated on the Ev-Object event-based tactile dataset under a strict zero-shot setting, with disjoint seen and unseen category sets. Performance is assessed using Mean Class Accuracy (MCA), Top-k accuracy, and the Semantic Alignment Score (SAS). The proposed framework consistently outperforms representative tactile ZSL baselines across all metrics. It achieves an MCA of 73.48%, a Top-1 accuracy of 62.68%, and a Top-2 accuracy of 88.75%. Ablation studies show that removing the LLM semantic module, bidirectional mapping, or cycle-consistency constraint reduces recognition performance and semantic alignment quality. Removing the LLM semantic module causes a substantial decrease in MCA, which confirms the role of structured LLM-generated tactile semantics in knowledge transfer. Removing the bidirectional mapping or the cycle-consistency constraint also reduces performance, indicating that both components help maintain stable cross-modal alignment. The t-SNE visualization further shows that cycle-consistent alignment yields more compact intra-class clusters and clearer inter-class separation for unseen categories. Semantic prototypes are also better located near the centers of tactile feature clusters. These results indicate that combining biologically inspired spiking models with LLM-generated tactile semantics provides an effective solution for open-world tactile perception. Conclusions This paper presents a zero-shot object recognition framework for spiking tactile perception by integrating SGNN-based tactile representation with semantic prototypes. The proposed method addresses key limitations of existing tactile ZSL approaches by avoiding visual data and manual attribute design while effectively modeling the spatiotemporal dynamics of event-based spiking tactile signals. Experimental results under strict zero-shot settings confirm the effectiveness and robustness of the proposed framework. This work provides a strong baseline for zero-shot spiking tactile recognition and offers a principled path toward open-world tactile cognition in robotic systems. Future work will explore generalized zero-shot tactile perception, multimodal extensions, and real-world robotic deployment under noisy and dynamic sensing conditions.
- Spiking tactile perception,
- Zero-shot learning,
- Spiking graph neural network,
- Cross-modal alignment,
- Large language models

FullText(HTML)

References(33)

References

[1]	LI Baojiang, LI Liang, WANG Haiyan, et al. TVT-transformer: A tactile-visual-textual fusion network for object recognition[J]. Information Fusion, 2025, 118: 102943. doi: 10.1016/j.inffus.2025.102943.
[2]	LUO Shan, LEPORA N F, YUAN Wenzhen, et al. Tactile robotics: An outlook[J]. IEEE Transactions on Robotics, 2025, 41: 5564–5583. doi: 10.1109/TRO.2025.3608686.
[3]	ZHANG Yupo, LI Xiaoyu, FANG Senlin, et al. Multi-branch multi-scale channel fusion graph convolutional networks with transfer cost for robotic tactile recognition tasks[J]. IEEE Transactions on Automation Science and Engineering, 2025, 22: 11856–11868. doi: 10.1109/TASE.2025.3541339.
[4]	LI Liang, QIU Shengjie, LI Baojiang, et al. Object recognition based on tactile information: A generalized recognition network combining wavelet transform and transformer model for small sample datasets[J]. Information Sciences, 2025, 719: 122464. doi: 10.1016/j.ins.2025.122464.
[5]	UEDA S, HASHIMOTO A, HAMAYA M, et al. Visuo-tactile zero-shot object recognition with vision-language model[C]. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 2024: 7243–7250. doi: 10.1109/IROS58592.2024.10801766.
[6]	TAUNYAZOV T, SNG W, LIM B, et al. Event-driven visual-tactile sensing and learning for robots[C]. Robotics: Science and Systems XVI, Corvalis, USA, 2020. doi: 10.15607/RSS.2020.XVI.020.
[7]	张铁林, 李澄宇, 王刚, 等. 适合类脑脉冲神经网络的应用任务范式分析与展望[J]. 电子与信息学报, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459. ZHANG Tielin, LI Chengyu, WANG Gang, et al. Research advances and new paradigms for biology-inspired spiking neural networks[J]. Journal of Electronics & Information Technology, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459.
[8]	YANG Jing, LIU Tingqing, REN Yaping, et al. AM-SGCN: Tactile object recognition for adaptive multichannel spiking graph convolutional neural networks[J]. IEEE Sensors Journal, 2023, 23(24): 30805–30820. doi: 10.1109/JSEN.2023.3329559.
[9]	NAG S, ZHU Xiatian, SONG Yizhe, et al. Zero-shot temporal action detection via vision-language prompting[C]. 17th European Conference on Computer Vision–ECCV, Tel Aviv, Israel, 2022: 681–697. doi: 10.1007/978-3-031-20062-5_39.
[10]	YANG Fengyu, FENG Chao, CHEN Ziyang, et al. YANG Fengyu, FENG Chao, CHEN Ziyang, et al. Binding touch to everything: Learning unified multimodal tactile representations[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 26330–26343. doi: 10.1109/CVPR52733.2024.02488.
[11]	CAO Guanqun, JIANG Jiaqi, BOLLEGALA D, et al. Multimodal zero-shot learning for tactile texture recognition[J]. Robotics and Autonomous Systems, 2024, 176: 104688. doi: 10.1016/j.robot.2024.104688.
[12]	MERIBOUT M, TAKELE N A, DEREGE O, et al. Tactile sensors: A review[J]. Measurement, 2024, 238: 115332. doi: 10.1016/j.measurement.2024.115332.
[13]	GUO Fangming, YU Fangwen, LI Mingyan, et al. Event-driven tactile sensing with dense spiking graph neural networks[J]. IEEE Transactions on Instrumentation and Measurement, 2025, 74: 2508113. doi: 10.1109/TIM.2025.3541787.
[14]	HU Jiarui, ZHOU Yanmin, WANG Zhipeng, et al. X-Tacformer: Spatio-tempral attention model for tactile recognition[C]. 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024: 9638–9644. doi: 10.1109/ICRA57147.2024.10610365.
[15]	杨静, 吉晓阳, 李少波, 等. 具有正则化约束的脉冲神经网络机器人触觉物体识别方法[J]. 电子与信息学报, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711. YANG Jing, JI Xiaoyang, LI Shaobo, et al. Spiking neural network robot tactile object recognition method with regularization constraints[J]. Journal of Electronics & Information Technology, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711.
[16]	POURPANAH F, ABDAR M, LUO Yuxuan, et al. A review of generalized zero-shot learning methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4051–4070. doi: 10.1109/TPAMI.2022.3191696.
[17]	CAO Weipeng, WU Yuhao, SUN Yixuan, et al. A review on multimodal zero-shot learning[J]. WIREs Data Mining and Knowledge Discovery, 2023, 13(2): e1488. doi: 10.1002/widm.1488.
[18]	FOTEINOPOULOU N M and PATRAS I. EmoCLIP: A vision-language method for zero-shot video facial expression recognition[C]. 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkiye, 2024: 1–10. doi: 10.1109/FG59268.2024.10581982.
[19]	ZHONG Shaohong, ALBINI A, MAIOLINO P, et al. TactGen: Tactile sensory data generation via zero-shot sim-to-real transfer[J]. IEEE Transactions on Robotics, 2025, 41: 1316–1328. doi: 10.1109/TRO.2024.3521967.
[20]	LIU Huaping, SUN Fuchun, FANG Bin, et al. Cross-modal zero-shot-learning for tactile object recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(7): 2466–2474. doi: 10.1109/TSMC.2018.2818184.
[21]	BI T, SFERRAZZA C, and D’ANDREA R. Zero-shot sim-to-real transfer of tactile control policies for aggressive swing-up manipulation[J]. IEEE Robotics and Automation Letters, 2021, 6(3): 5761–5768. doi: 10.1109/LRA.2021.3084880.
[22]	MIRZA M J, KARLINSKY L, LIN Wei, et al. Meta-prompting for automating zero-shot visual recognition with LLMs[C]. 18th European Conference on Computer Vision-ECCV, Milan, Italy, 2025: 370–387. doi: 10.1007/978-3-031-72627-9_21.
[23]	NAGAR A, JAISWAL S, and TAN C. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis[C]. 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024: 1–8. doi: 10.1109/IJCNN60899.2024.10650020.
[24]	LIU Cheng, WANG Chao, PENG Yan, et al. ZVQAF: Zero-shot visual question answering with feedback from large language models[J]. Neurocomputing, 2024, 580: 127505. doi: 10.1016/j.neucom.2024.127505.
[25]	YELENOV A, UMURBEKOV I, KENZHEBEK D, et al. Zero-shot reasoning with haptic and visual feedback in vision-language-action robotic manipulation[C]. The CLAWAR 2025 Conference on AI Enabled Robotic Loco-Manipulation, Yokohama, Japan, 2025: 205–216. doi: 10.1007/978-3-032-09427-8_20.
[26]	BOULENGER V, MARTEL M, BOUVET C, et al. Feeling better: Tactile verbs speed up tactile detection[J]. Brain and Cognition, 2020, 142: 105582. doi: 10.1016/j.bandc.2020.105582.
[27]	MILLER T M, BLANKENBURG F, and PULVERMÜLLER F. Language, but not music, shapes tactile perception[J]. Language and Cognition, 2025, 17: e53. doi: 10.1017/langcog.2025.10006.
[28]	AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for attribute-based classification[C]. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 819–826. doi: 10.1109/CVPR.2013.111.
[29]	CHI Wei, ZHANG Ying, ZHANG Xiaolu, et al. MA-SGNN: A multi-view adaptive spiking graph neural network for event-based tactile recognition[C]. 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Wuhan, China, 2025: 2088–2093. doi: 10.1109/BIBM66473.2025.11355976.
[30]	曹毅, 吴伟官, 李平, 等. 基于时空特征增强图卷积网络的骨架行为识别[J]. 电子与信息学报, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749. CAO Yi, WU Weiguan, LI Ping, et al. Skeleton action recognition based on spatio-temporal feature enhanced graph convolutional network[J]. Journal of Electronics & Information Technology, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749.
[31]	XIAN Yongqin, LORENZ T, SCHIELE B, et al. Feature generating networks for zero-shot learning[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5542–5551. doi: 10.1109/CVPR.2018.00581.
[32]	XU Zhengtong, UPPULURI R, ZHANG Xinwei, et al. UniT: Data efficient tactile representation with generalization to unseen objects[J]. IEEE Robotics and Automation Letters, 2025, 10(6): 5481–5488. doi: 10.1109/LRA.2025.3559835.
[33]	YU S, LIN K, XIAO Anxing, et al. Octopi: Object property reasoning with large tactile-language models[C]. Robotics: Science and Systems, Delft, Netherlands, 2024.