高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多模态融合Transformer的视听广义零次学习方法

杨静 李小勇 阮小利 李少波 唐向红 徐计

杨静, 李小勇, 阮小利, 李少波, 唐向红, 徐计. 基于多模态融合Transformer的视听广义零次学习方法[J]. 电子与信息学报. doi: 10.11999/JEIT241090
引用本文: 杨静, 李小勇, 阮小利, 李少波, 唐向红, 徐计. 基于多模态融合Transformer的视听广义零次学习方法[J]. 电子与信息学报. doi: 10.11999/JEIT241090
YANG Jing, LI Xiaoyong, RUAN Xiaoli, LI Shaobo, TANG Xianghong, XU Ji. An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT241090
Citation: YANG Jing, LI Xiaoyong, RUAN Xiaoli, LI Shaobo, TANG Xianghong, XU Ji. An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT241090

基于多模态融合Transformer的视听广义零次学习方法

doi: 10.11999/JEIT241090
基金项目: 国家自然科学基金(62441608, 62166005),贵州省科技项目基金(QKHZC[2023]368),贵阳市科技人才培养对象及项目基金(ZKHT[2023]48-8),贵州大学基础科研基金([2024]08),公共大数据国家重点实验室开放项目(PBD2023-16)
详细信息
    作者简介:

    杨静:杨 静:男,副教授,研究方向为视觉计算

    李小勇:男,硕士生,研究方向为视听零次学习

    阮小利:女,副教授,研究方向为多模态大数据融合分析

    李少波:男,教授,研究方向为大数据

    唐向红:男,教授,研究方向为人工智能

    徐计:男,教授,研究方向为人工智能理论

    通讯作者:

    李小勇 gs.xiaoyongli22@gzu.edu.cn

  • 中图分类号: TN919.81; TP183

An Audio-visual Generalized Zero-Shot Learning Method Based on Multimodal Fusion Transformer

Funds: The National Natural Science Foundation of China (62441608,62166005), The Science and Technology Project of Guizhou Province (QKHZC[2023]368), The Developing Objects and Projects of Scientific and Technological Talents in Guiyang City (ZKHT[2023]48-8), Guizhou University Basic Research Fund ([2024]08), The Open Project of State Key Laboratory of Public Big Data (PBD2023-16)
  • 摘要: 视听零次学习需要理解音频和视觉信息之间的关系,以便能够推理未见过的类别。尽管领域做出了许多努力并取得了重大进展,但往往专注于学习强大的表征,从而忽视了音频和视频之间的依赖关系和输出分布与目标分布不一致的问题。因此,该文提出了基于Transformer的视听广义零次学习方法。具体来说,使用注意力机制来学习数据的内部信息,增强不同模态的信息交互,以捕捉视听数据之间的语义一致性;为了度量不同概率分布之间的差异和类别之间的一致性,引入了Kullback-Leibler(KL)散度和余弦相似度损失。为了评估所提方法,在VGGSound-GZSLcls, UCF-GZSLcls和ActivityNet-GZSLcls 3个基准数据集上进行测试。大量的实验结果表明,所提方法在3个数据集上都取得了最先进的性能。
  • 图  1  MFT方法示意图

    图  2  整体框架图

    表  1  数据集划分情况

    阶段
    数据集
    C CS CU 第1阶段 第2阶段
    训练 验证 训练 测试
    TrS ValS ValU TrS TeS TeU
    UCF-GZSLcls 51 30 21 30 30 12 42 42 9
    VGGSound-GZSLcls 276 138 138 138 138 69 207 207 69
    ActivityNet-GZSLcls 200 99 101 99 99 51 150 150 50
    下载: 导出CSV

    表  2  不同方法的性能对比分析

    VGGSound-GZSLclsUCF-GZSLclsActivityNet-GZSLcls
    SUHMZSLSUHMZSLSUHMZSL
    DeViSE[31]NeurIPS’1336.221.072.085.5955.5914.9423.5616.093.458.534.918.53
    ALE[32]T-PAMI’150.285.480.535.4857.5914.8923.6616.322.637.873.947.90
    SJE[33] CVPR’2048.331.102.154.0663.1016.7726.5018.934.617.045.577.08
    f-VAEGAN-D2[35]CVPR’1912.770.951.771.9117.298.4711.3711.114.362.142.872.40
    APN[34]IJCV’227.483.885.114.4928.4616.1620.6116.449.845.767.276.34
    †CJME[17]WACV’2011.965.417.456.8448.1817.6825.8720.4616.069.1311.649.92
    †AVGZSLNet[18]WACV’2113.022.884.715.4456.2634.3742.6735.6614.8111.1112.7012.39
    †AVCA[19]CVPR’2232.476.8111.268.1634.9038.6736.6938.6724.0419.8821.7620.88
    TCAF[20]ECCV’2212.636.728.777.4167.1440.8350.7844.6430.127.6512.207.96
    AVFS[21]IJCNN’2315.626.008.677.3154.5736.9444.0641.5514.418.9111.019.15
    †Hyper-multiple[22]ICCV’2321.998.1211.878.4743.5239.7741.5640.2820.5221.3020.9022.18
    KDA[23]ARXIV’2313.307.749.788.3275.8842.9754.8452.6637.5510.2517.9511.85
    STFT[36]TIP’2411.748.8310.088.7961.4243.8151.1449.7425.129.8314.139.46
    †ClipClap-GZSL[24]CVPR’2429.6811.1216.1811.5377.1443.9155.9746.9645.9820.0627.9322.76
    †MFT(本文)31.6313.3718.8013.9772.5247.1257.1248.5642.5825.6632.0227.88
    下载: 导出CSV

    表  3  注意力机制的影响

    模型 VGGSound-GZSLcls UCF-GZSLcls ActivityNet-GZSLcls
    S U HM ZSL S U HM ZSL S U HM ZSL
    w\o att 25.87 10.47 14.91 10.69 74.36 43.89 55.20 45.28 40.85 20.97 27.71 22.51
    MFT 31.63 13.37 18.80 13.97 72.52 47.12 57.12 48.56 42.58 25.66 32.02 27.88
    下载: 导出CSV

    表  4  比较使用全损失函数$l$和去除组件$ {l_{\cos }} $, $ {l_{kl}} $的影响

    Model VGGSound-GZSLcls UCF-GZSLcls ActivityNet-GZSLcls
    S U HM ZSL S U HM ZSL S U HM ZSL
    $ l - {l_{\cos }} $ 34.45 11.84 17.63 12.84 56.88 38.61 46.00 39.79 32.88 24.32 27.96 24.92
    $ l - {l_{kl}} $ 31.04 11.59 16.88 11.91 78.42 35.46 48.84 37.27 40.58 24.24 30.35 26.05
    $ l $ 31.63 13.37 18.80 13.97 72.52 47.12 57.12 48.56 42.58 25.66 32.02 27.88
    下载: 导出CSV

    表  5  分别使用不同模态输入的影响

    模型VGGSound-GZSLclsUCF-GZSLclsActivityNet-GZSLcls
    SUHMZSLSUHMZSLSUHMZSL
    Audio16.249.7812.219.9155.2334.1042.2539.0210.717.728.978.31
    Visual15.516.819.477.9154.9243.2348.3846.1733.2223.5227.5424.21
    两者31.6313.3718.8013.9772.5247.1257.1248.5642.5825.6632.0227.88
    下载: 导出CSV
  • [1] AN Hongchao, YANG Jing, ZHANG Xiuhua, et al. A class-incremental learning approach for learning feature-compatible embeddings[J]. Neural Networks, 2024, 180: 106685. doi: 10.1016/j.neunet.2024.106685.
    [2] LI Qinglang, YANG Jing, RUAN Xiaoli, et al. SPIRF-CTA: Selection of parameter importance levels for reasonable forgetting in continuous task adaptation[J]. Knowledge-Based Systems, 2024, 305: 112575. doi: 10.1016/j.knosys.2024.112575.
    [3] MA Xingjiang, YANG Jing, LIN Jiacheng, et al. LVAR-CZSL: Learning visual attributes representation for compositional zero-shot learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 13311–13323. doi: 10.1109/tcsvt.2024.3444782.
    [4] DING Shujie, RUAN Xiaoli, YANG Jing, et al. LRDTN: Spectral-spatial convolutional fusion long-range dependence transformer network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5500821. doi: 10.1109/tgrs.2024.3510625.
    [5] AFOURAS T, OWENS A, CHUNG J S, et al. Self-supervised learning of audio-visual objects from video[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 208–224. doi: 10.1007/978-3-030-58523-5_13.
    [6] ASANO Y M, PATRICK M, RUPPRECHT C, et al. Labelling unlabelled videos from scratch with multi-modal self-supervision[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 4660–4671.
    [7] MITTAL H, MORGADO P, JAIN U, et al. Learning state-aware visual representations from audible interactions[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 23765–23779.
    [8] MORGADO P, MISRA I, and VASCONCELOS N. Robust audio-visual instance discrimination[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12929–12940. doi: 10.1109/cvpr46437.2021.01274.
    [9] XU Hu, GHOSH G, HUANG Poyao, et al. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding[C]. The 2021 Conference on Empirical Methods in Natural Language Processing, Online, 2021: 6787–6800. doi: 10.18653/v1/2021.emnlp-main.544.
    [10] LIN K Q, WANG A J, SOLDAN M, et al. Egocentric video-language pretraining[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 7575–7586.
    [11] LIN K Q, ZHANG Pengchuan, CHEN J, et al. UniVTG: Towards unified video-language temporal grounding[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 2782–2792. doi: 10.1109/iccv51070.2023.00262.
    [12] MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9876–9886. doi: 10.1109/cvpr42600.2020.00990.
    [13] AFOURAS T, ASANO Y M, FAGAN F, et al. Self-supervised object detection from audio-visual correspondence[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10565–10576. doi: 10.1109/cvpr52688.2022.01032.
    [14] CHEN Honglie, XIE Weidi, AFOURAS T, et al. Localizing visual sounds the hard way[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16862–16871. doi: 10.1109/cvpr46437.2021.01659.
    [15] QIAN Rui, HU Di, DINKEL H, et al. Multiple sound sources localization from coarse to fine[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 292–308. doi: 10.1007/978-3-030-58565-5_18.
    [16] XU Haoming, ZENG Runhao, WU Qingyao, et al. Cross-modal relation-aware networks for audio-visual event localization[C]. The 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 3893–3901. doi: 10.1145/3394171.3413581.
    [17] PARIDA K K, MATIYALI N, GUHA T, et al. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos[C]. The 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 3240–3249. doi: 10.1109/wacv45572.2020.9093438.
    [18] MAZUMDER P, SINGH P, PARIDA K K, et al. AVGZSLNet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings[C]. The 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3089–3098. doi: 10.1109/wacv48630.2021.00313.
    [19] MERCEA O B, RIESCH L, KOEPKE A S, et al. Audiovisual generalised zero-shot learning with cross-modal attention and language[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10543–10553. doi: 10.1109/cvpr52688.2022.01030.
    [20] MERCEA O B, HUMMEL T, KOEPKE A S, et al. Temporal and cross-modal attention for audio-visual zero-shot learning[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 488–505. doi: 10.1007/978-3-031-20044-1_28.
    [21] ZHENG Qichen, HONG Jie, and FARAZI M. A generative approach to audio-visual generalized zero-shot learning: Combining contrastive and discriminative techniques[C]. 2023 International Joint Conference on Neural Networks, Gold Coast, Australia, 2023: 1–8. doi: 10.1109/ijcnn54540.2023.10191705.
    [22] HONG Jie, HAYDER Z, HAN Junlin, et al. Hyperbolic audio-visual zero-shot learning[C]. The 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7839–7849. doi: 10.1109/iccv51070.2023.00724.
    [23] CHEN Haoxing, LI Yaohui, HONG Yan, et al. Boosting audio-visual zero-shot learning with large language models[J]. arXiv: 2311.12268, 2023. doi: 10.48550/arXiv.2311.12268.
    [24] KURZENDÖRFER D, MERCEA O B, KOEPKE A S, et al. Audio-visual generalized zero-shot learning using pre-trained large multi-modal models[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 2627–2638. doi: 10.1109/cvprw63382.2024.00269.
    [25] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
    [26] MEI Xinhao, MENG Chutong, LIU Haohe, et al. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3339–3354. doi: 10.1109/TASLP.2024.3419446.
    [27] CHEN Honglie, XIE Weidi, VEDALDI A, et al. Vggsound: A large-scale audio-visual dataset[C]. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 721–725. doi: 10.1109/icassp40776.2020.9053174.
    [28] SOOMRO K, ZAMIR A R, and SHAH M. A dataset of 101 human action classes from videos in the wild[J]. Center for Research in Computer Vision, 2012, 2(11): 1–7.
    [29] HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: A large-scale video benchmark for human activity understanding[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 961–970. doi: 10.1109/cvpr.2015.7298698.
    [30] CHAO Weilun, CHANGPINYO S, GONG Boqing, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]. 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 52–68. doi: 10.1007/978-3-319-46475-6_4.
    [31] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]. The 27th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 2013: 2121–2129.
    [32] AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for image classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(7): 1425–1438. doi: 10.1109/TPAMI.2015.2487986.
    [33] AKATA Z, REED S, WALTER D, et al. Evaluation of output embeddings for fine-grained image classification[C]. The 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 2927–2936. doi: 10.1109/cvpr.2015.7298911.
    [34] XU Wenjia, XIAN Yongqin, WANG Jiuniu, et al. Attribute prototype network for any-shot learning[J]. International Journal of Computer Vision, 2022, 130(7): 1735–1753. doi: 10.1007/s11263-022-01613-9.
    [35] XIAN Yongqin, SHARMA S, SCHIELE B, et al. F-VAEGAN-D2: A feature generating framework for any-shot learning[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 10267–10276. doi: 10.1109/cvpr.2019.01052.
    [36] LI Wenrui, WANG Penghong, XIONG Ruiqin, et al. Spiking tucker fusion transformer for audio-visual zero-shot learning[J]. IEEE Transactions on Image Processing, 2024, 33: 4840–4852. doi: 10.1109/tip.2024.3430080.
    [37] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497. doi: 10.1109/iccv.2015.510.
    [38] HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 131–135. doi: 10.1109/icassp.2017.7952132.
  • 加载中
图(2) / 表(5)
计量
  • 文章访问数:  140
  • HTML全文浏览量:  38
  • PDF下载量:  24
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-12-10
  • 修回日期:  2025-04-09
  • 网络出版日期:  2025-04-29

目录

    /

    返回文章
    返回