Citation: | YANG Jing, LI Xiaoyong, RUAN Xiaoli, LI Shaobo, TANG Xianghong, XU Ji. An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer[J]. Journal of Electronics & Information Technology, 2025, 47(7): 2375-2384. doi: 10.11999/JEIT241090 |
[1] |
AN Hongchao, YANG Jing, ZHANG Xiuhua, et al. A class-incremental learning approach for learning feature-compatible embeddings[J]. Neural Networks, 2024, 180: 106685. doi: 10.1016/j.neunet.2024.106685.
|
[2] |
LI Qinglang, YANG Jing, RUAN Xiaoli, et al. SPIRF-CTA: Selection of parameter importance levels for reasonable forgetting in continuous task adaptation[J]. Knowledge-Based Systems, 2024, 305: 112575. doi: 10.1016/j.knosys.2024.112575.
|
[3] |
MA Xingjiang, YANG Jing, LIN Jiacheng, et al. LVAR-CZSL: Learning visual attributes representation for compositional zero-shot learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 13311–13323. doi: 10.1109/tcsvt.2024.3444782.
|
[4] |
DING Shujie, RUAN Xiaoli, YANG Jing, et al. LRDTN: Spectral-spatial convolutional fusion long-range dependence transformer network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5500821. doi: 10.1109/tgrs.2024.3510625.
|
[5] |
AFOURAS T, OWENS A, CHUNG J S, et al. Self-supervised learning of audio-visual objects from video[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 208–224. doi: 10.1007/978-3-030-58523-5_13.
|
[6] |
ASANO Y M, PATRICK M, RUPPRECHT C, et al. Labelling unlabelled videos from scratch with multi-modal self-supervision[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 4660–4671.
|
[7] |
MITTAL H, MORGADO P, JAIN U, et al. Learning state-aware visual representations from audible interactions[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 23765–23779.
|
[8] |
MORGADO P, MISRA I, and VASCONCELOS N. Robust audio-visual instance discrimination[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12929–12940. doi: 10.1109/cvpr46437.2021.01274.
|
[9] |
XU Hu, GHOSH G, HUANG Poyao, et al. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding[C]. The 2021 Conference on Empirical Methods in Natural Language Processing, Online, 2021: 6787–6800. doi: 10.18653/v1/2021.emnlp-main.544.
|
[10] |
LIN K Q, WANG A J, SOLDAN M, et al. Egocentric video-language pretraining[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 7575–7586.
|
[11] |
LIN K Q, ZHANG Pengchuan, CHEN J, et al. UniVTG: Towards unified video-language temporal grounding[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 2782–2792. doi: 10.1109/iccv51070.2023.00262.
|
[12] |
MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9876–9886. doi: 10.1109/cvpr42600.2020.00990.
|
[13] |
AFOURAS T, ASANO Y M, FAGAN F, et al. Self-supervised object detection from audio-visual correspondence[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10565–10576. doi: 10.1109/cvpr52688.2022.01032.
|
[14] |
CHEN Honglie, XIE Weidi, AFOURAS T, et al. Localizing visual sounds the hard way[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16862–16871. doi: 10.1109/cvpr46437.2021.01659.
|
[15] |
QIAN Rui, HU Di, DINKEL H, et al. Multiple sound sources localization from coarse to fine[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 292–308. doi: 10.1007/978-3-030-58565-5_18.
|
[16] |
XU Haoming, ZENG Runhao, WU Qingyao, et al. Cross-modal relation-aware networks for audio-visual event localization[C]. The 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 3893–3901. doi: 10.1145/3394171.3413581.
|
[17] |
PARIDA K K, MATIYALI N, GUHA T, et al. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos[C]. The 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 3240–3249. doi: 10.1109/wacv45572.2020.9093438.
|
[18] |
MAZUMDER P, SINGH P, PARIDA K K, et al. AVGZSLNet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings[C]. The 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3089–3098. doi: 10.1109/wacv48630.2021.00313.
|
[19] |
MERCEA O B, RIESCH L, KOEPKE A S, et al. Audiovisual generalised zero-shot learning with cross-modal attention and language[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10543–10553. doi: 10.1109/cvpr52688.2022.01030.
|
[20] |
MERCEA O B, HUMMEL T, KOEPKE A S, et al. Temporal and cross-modal attention for audio-visual zero-shot learning[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 488–505. doi: 10.1007/978-3-031-20044-1_28.
|
[21] |
ZHENG Qichen, HONG Jie, and FARAZI M. A generative approach to audio-visual generalized zero-shot learning: Combining contrastive and discriminative techniques[C]. 2023 International Joint Conference on Neural Networks, Gold Coast, Australia, 2023: 1–8. doi: 10.1109/ijcnn54540.2023.10191705.
|
[22] |
HONG Jie, HAYDER Z, HAN Junlin, et al. Hyperbolic audio-visual zero-shot learning[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7839–7849. doi: 10.1109/iccv51070.2023.00724.
|
[23] |
CHEN Haoxing, LI Yaohui, HONG Yan, et al. Boosting audio-visual zero-shot learning with large language models[J]. arXiv: 2311.12268, 2023. doi: 10.48550/arXiv.2311.12268.
|
[24] |
KURZENDÖRFER D, MERCEA O B, KOEPKE A S, et al. Audio-visual generalized zero-shot learning using pre-trained large multi-modal models[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 2627–2638. doi: 10.1109/cvprw63382.2024.00269.
|
[25] |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
|
[26] |
MEI Xinhao, MENG Chutong, LIU Haohe, et al. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3339–3354. doi: 10.1109/TASLP.2024.3419446.
|
[27] |
CHEN Honglie, XIE Weidi, VEDALDI A, et al. Vggsound: A large-scale audio-visual dataset[C]. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 721–725. doi: 10.1109/icassp40776.2020.9053174.
|
[28] |
SOOMRO K, ZAMIR A R, and SHAH M. A dataset of 101 human action classes from videos in the wild[J]. Center for Research in Computer Vision, 2012, 2(11): 1–7.
|
[29] |
HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: A large-scale video benchmark for human activity understanding[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 961–970. doi: 10.1109/cvpr.2015.7298698.
|
[30] |
CHAO Weilun, CHANGPINYO S, GONG Boqing, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]. 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 52–68. doi: 10.1007/978-3-319-46475-6_4.
|
[31] |
FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]. The 27th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 2013: 2121–2129.
|
[32] |
AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for image classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(7): 1425–1438. doi: 10.1109/TPAMI.2015.2487986.
|
[33] |
AKATA Z, REED S, WALTER D, et al. Evaluation of output embeddings for fine-grained image classification[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 2927–2936. doi: 10.1109/cvpr.2015.7298911.
|
[34] |
XU Wenjia, XIAN Yongqin, WANG Jiuniu, et al. Attribute prototype network for any-shot learning[J]. International Journal of Computer Vision, 2022, 130(7): 1735–1753. doi: 10.1007/s11263-022-01613-9.
|
[35] |
XIAN Yongqin, SHARMA S, SCHIELE B, et al. F-VAEGAN-D2: A feature generating framework for any-shot learning[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 10267–10276. doi: 10.1109/cvpr.2019.01052.
|
[36] |
LI Wenrui, WANG Penghong, XIONG Ruiqin, et al. Spiking tucker fusion transformer for audio-visual zero-shot learning[J]. IEEE Transactions on Image Processing, 2024, 33: 4840–4852. doi: 10.1109/tip.2024.3430080.
|
[37] |
TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497. doi: 10.1109/iccv.2015.510.
|
[38] |
HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 131–135. doi: 10.1109/icassp.2017.7952132.
|