An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer

YANG Jing; LI Xiaoyong; RUAN Xiaoli; LI Shaobo; TANG Xianghong; XU Ji

doi:10.11999/JEIT241090

Volume 47 Issue 7

Jul. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(7): 2375-2384

YANG Jing, LI Xiaoyong, RUAN Xiaoli, LI Shaobo, TANG Xianghong, XU Ji. An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer[J]. Journal of Electronics & Information Technology, 2025, 47(7): 2375-2384. doi: 10.11999/JEIT241090

Citation:

YANG Jing, LI Xiaoyong, RUAN Xiaoli, LI Shaobo, TANG Xianghong, XU Ji. An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer[J]. Journal of Electronics & Information Technology, 2025, 47(7): 2375-2384. doi: 10.11999/JEIT241090

Citation:

PDF( 1127 KB)

An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer

doi: 10.11999/JEIT241090 cstr: 32379.14.JEIT241090v

1.
State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
2.
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 201100, China

Funds: The National Natural Science Foundation of China (62441608,62166005), The Science and Technology Project of Guizhou Province (QKHZC[2023]368), The Developing Objects and Projects of Scientific and Technological Talents in Guiyang City (ZKHT[2023]48-8), Guizhou University Basic Research Fund ([2024]08), The Open Project of State Key Laboratory of Public Big Data (PBD2023-16)

Received Date: 2024-12-10
Rev Recd Date: 2025-04-09

Available Online: 2025-04-29

Publish Date: 2025-07-22

Abstract

Abstract

Objective Audio-visual Generalized Zero-Shot Learning (GZSL) integrates audio and visual signals in videos to enable the classification of known classes and the effective recognition of unseen classes. Most existing approaches prioritize the alignment of audio-visual and textual label embeddings, but overlook the interdependence between audio and video, and the mismatch between model outputs and target distributions. This study proposes an audio-visual GZSL method based on a Multimodal Fusion Transformer (MFT) to address these limitations. Methods The MFT employs a transformer-based multi-head attention mechanism to enable effective cross-modal interaction between visual and audio features. To optimize the output probability distribution, the Kullback-Leibler (KL) divergence between the predicted and target distributions is minimized, thereby aligning predictions more closely with the true distribution. This optimization also reduces overfitting and improves generalization to unseen classes. In addition, cosine similarity loss is applied to measure the similarity of learned representations within the same class, promoting feature consistency and improving discriminability. Results and Discussions The experiments include both GZSL and Zero-Shot Learning (ZSL) tasks. The ZSL task requires classification of unseen classes only, whereas the GZSL task addresses both unseen and seen class classification to mitigate catastrophic forgetting. To evaluate the proposed method, experiments are conducted on three benchmark datasets: VGGSound-GZSL^cls, UCF-GZSL^cls, and ActivityNet-GZSL^cls (Table 1). MFT is quantitatively compared with five ZSL methods and nine GZSL methods (Table 2). The results show that the proposed method achieves state-of-the-art performance on all three datasets. For example, on ActivityNet-GZSL^cls, MFT exceedes the previous best ClipClap-GZSL method by 14.6%. This confirms the effectiveness of MFT in modeling cross-modal dependencies, aligning predicted and target distributions, and achieving semantic consistency between audio and visual features. Ablation studies (Tables 3～5) further support the contribution of each module in the proposed framework. Conclusions This study proposes a transformer-based audio-visual GZSL method that uses a multi-head self-attention mechanism to extract intrinsic information from audio and video data and enhance cross-modal interaction. This design enables more accurate capture of semantic consistency between modalities, improving the quality of cross-modal feature representations. To align the predicted and target distributions and reinforce intra-class consistency, KL divergence and cosine similarity loss are incorporated during training. KL divergence improves the match between predicted and true distributions, while cosine similarity loss enhances discriminability within each class. Extensive experiments demonstrate the effectiveness of the proposed method.
- Audio-visual zero-shot learning,
- Video classification,
- Attention mechanisms,
- Kullback-Leibler (KL) divergence

FullText(HTML)

References(38)

References

[1]	AN Hongchao, YANG Jing, ZHANG Xiuhua, et al. A class-incremental learning approach for learning feature-compatible embeddings[J]. Neural Networks, 2024, 180: 106685. doi: 10.1016/j.neunet.2024.106685.
[2]	LI Qinglang, YANG Jing, RUAN Xiaoli, et al. SPIRF-CTA: Selection of parameter importance levels for reasonable forgetting in continuous task adaptation[J]. Knowledge-Based Systems, 2024, 305: 112575. doi: 10.1016/j.knosys.2024.112575.
[3]	MA Xingjiang, YANG Jing, LIN Jiacheng, et al. LVAR-CZSL: Learning visual attributes representation for compositional zero-shot learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 13311–13323. doi: 10.1109/tcsvt.2024.3444782.
[4]	DING Shujie, RUAN Xiaoli, YANG Jing, et al. LRDTN: Spectral-spatial convolutional fusion long-range dependence transformer network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5500821. doi: 10.1109/tgrs.2024.3510625.
[5]	AFOURAS T, OWENS A, CHUNG J S, et al. Self-supervised learning of audio-visual objects from video[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 208–224. doi: 10.1007/978-3-030-58523-5_13.
[6]	ASANO Y M, PATRICK M, RUPPRECHT C, et al. Labelling unlabelled videos from scratch with multi-modal self-supervision[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 4660–4671.
[7]	MITTAL H, MORGADO P, JAIN U, et al. Learning state-aware visual representations from audible interactions[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 23765–23779.
[8]	MORGADO P, MISRA I, and VASCONCELOS N. Robust audio-visual instance discrimination[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12929–12940. doi: 10.1109/cvpr46437.2021.01274.
[9]	XU Hu, GHOSH G, HUANG Poyao, et al. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding[C]. The 2021 Conference on Empirical Methods in Natural Language Processing, Online, 2021: 6787–6800. doi: 10.18653/v1/2021.emnlp-main.544.
[10]	LIN K Q, WANG A J, SOLDAN M, et al. Egocentric video-language pretraining[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 7575–7586.
[11]	LIN K Q, ZHANG Pengchuan, CHEN J, et al. UniVTG: Towards unified video-language temporal grounding[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 2782–2792. doi: 10.1109/iccv51070.2023.00262.
[12]	MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9876–9886. doi: 10.1109/cvpr42600.2020.00990.
[13]	AFOURAS T, ASANO Y M, FAGAN F, et al. Self-supervised object detection from audio-visual correspondence[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10565–10576. doi: 10.1109/cvpr52688.2022.01032.
[14]	CHEN Honglie, XIE Weidi, AFOURAS T, et al. Localizing visual sounds the hard way[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16862–16871. doi: 10.1109/cvpr46437.2021.01659.
[15]	QIAN Rui, HU Di, DINKEL H, et al. Multiple sound sources localization from coarse to fine[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 292–308. doi: 10.1007/978-3-030-58565-5_18.
[16]	XU Haoming, ZENG Runhao, WU Qingyao, et al. Cross-modal relation-aware networks for audio-visual event localization[C]. The 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 3893–3901. doi: 10.1145/3394171.3413581.
[17]	PARIDA K K, MATIYALI N, GUHA T, et al. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos[C]. The 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 3240–3249. doi: 10.1109/wacv45572.2020.9093438.
[18]	MAZUMDER P, SINGH P, PARIDA K K, et al. AVGZSLNet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings[C]. The 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3089–3098. doi: 10.1109/wacv48630.2021.00313.
[19]	MERCEA O B, RIESCH L, KOEPKE A S, et al. Audiovisual generalised zero-shot learning with cross-modal attention and language[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10543–10553. doi: 10.1109/cvpr52688.2022.01030.
[20]	MERCEA O B, HUMMEL T, KOEPKE A S, et al. Temporal and cross-modal attention for audio-visual zero-shot learning[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 488–505. doi: 10.1007/978-3-031-20044-1_28.
[21]	ZHENG Qichen, HONG Jie, and FARAZI M. A generative approach to audio-visual generalized zero-shot learning: Combining contrastive and discriminative techniques[C]. 2023 International Joint Conference on Neural Networks, Gold Coast, Australia, 2023: 1–8. doi: 10.1109/ijcnn54540.2023.10191705.
[22]	HONG Jie, HAYDER Z, HAN Junlin, et al. Hyperbolic audio-visual zero-shot learning[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7839–7849. doi: 10.1109/iccv51070.2023.00724.
[23]	CHEN Haoxing, LI Yaohui, HONG Yan, et al. Boosting audio-visual zero-shot learning with large language models[J]. arXiv: 2311.12268, 2023. doi: 10.48550/arXiv.2311.12268.
[24]	KURZENDÖRFER D, MERCEA O B, KOEPKE A S, et al. Audio-visual generalized zero-shot learning using pre-trained large multi-modal models[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 2627–2638. doi: 10.1109/cvprw63382.2024.00269.
[25]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
[26]	MEI Xinhao, MENG Chutong, LIU Haohe, et al. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3339–3354. doi: 10.1109/TASLP.2024.3419446.
[27]	CHEN Honglie, XIE Weidi, VEDALDI A, et al. Vggsound: A large-scale audio-visual dataset[C]. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 721–725. doi: 10.1109/icassp40776.2020.9053174.
[28]	SOOMRO K, ZAMIR A R, and SHAH M. A dataset of 101 human action classes from videos in the wild[J]. Center for Research in Computer Vision, 2012, 2(11): 1–7.
[29]	HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: A large-scale video benchmark for human activity understanding[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 961–970. doi: 10.1109/cvpr.2015.7298698.
[30]	CHAO Weilun, CHANGPINYO S, GONG Boqing, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]. 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 52–68. doi: 10.1007/978-3-319-46475-6_4.
[31]	FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]. The 27th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 2013: 2121–2129.
[32]	AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for image classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(7): 1425–1438. doi: 10.1109/TPAMI.2015.2487986.
[33]	AKATA Z, REED S, WALTER D, et al. Evaluation of output embeddings for fine-grained image classification[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 2927–2936. doi: 10.1109/cvpr.2015.7298911.
[34]	XU Wenjia, XIAN Yongqin, WANG Jiuniu, et al. Attribute prototype network for any-shot learning[J]. International Journal of Computer Vision, 2022, 130(7): 1735–1753. doi: 10.1007/s11263-022-01613-9.
[35]	XIAN Yongqin, SHARMA S, SCHIELE B, et al. F-VAEGAN-D2: A feature generating framework for any-shot learning[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 10267–10276. doi: 10.1109/cvpr.2019.01052.
[36]	LI Wenrui, WANG Penghong, XIONG Ruiqin, et al. Spiking tucker fusion transformer for audio-visual zero-shot learning[J]. IEEE Transactions on Image Processing, 2024, 33: 4840–4852. doi: 10.1109/tip.2024.3430080.
[37]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497. doi: 10.1109/iccv.2015.510.
[38]	HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 131–135. doi: 10.1109/icassp.2017.7952132.