高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多尺度时空卷积的唇语识别方法

叶鸿 危劲松 贾兆红 郑辉 梁栋 唐俊

叶鸿, 危劲松, 贾兆红, 郑辉, 梁栋, 唐俊. 基于多尺度时空卷积的唇语识别方法[J]. 电子与信息学报. doi: 10.11999/JEIT240161
引用本文: 叶鸿, 危劲松, 贾兆红, 郑辉, 梁栋, 唐俊. 基于多尺度时空卷积的唇语识别方法[J]. 电子与信息学报. doi: 10.11999/JEIT240161
YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240161
Citation: YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240161

基于多尺度时空卷积的唇语识别方法

doi: 10.11999/JEIT240161
基金项目: 国家自然科学基金(71971002, 62273001),安徽省自然科学基金(2108085QA35),安徽省重点研究与开发计划(202004a07020050),安徽省科技重大专项(202003A06020016),安徽省高校优秀科研创新团队(2022AH010005)
详细信息
    作者简介:

    叶鸿:男,硕士生导师,研究方向为深度学习、人工智能、体系架构优化、并行计算

    危劲松:男,硕士生,研究方向为计算机视觉

    贾兆红:女,教授,研究方向为人工智能、决策支持、多目标优化

    郑辉:男,讲师,研究方向为多模态感知、计算机视觉

    梁栋:男,教授,研究方向为计算机视觉与模式识别、信号处理与智能系统

    唐俊:男,教授,研究方向为计算机视觉与机器学习

    通讯作者:

    郑辉 huizheng@ahu.edu.cn

  • 中图分类号: TN911.73; TP391.41

Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

Funds: The National Natural Science Foundation of China (71971002, 62273001), The Provincial Natural Science Foundation of Anhui (2108085QA35), Anhui Provincial Key Research and Development Project (202004a07020050), Anhui Provincial Major Science and Technology Project (202003A06020016), The Excellent Research and Innovation Teams in Anhui Province’s Universities (2022AH010005)
  • 摘要: 现有的唇语识别模型大多采用将单层的3维卷积与2维卷积神经网络结合的方式,从唇语视频序列中挖掘出时空联合特征。然而,由于单层的3维卷积不能很好地提取时间信息,同时2维卷积神经网络对细粒度的唇语特征的挖掘能力有限,该文提出一种多尺度唇语识别网络(MS-LipNet)以改善唇语识别任务。该文在Res2Net网络中,采用3维时空卷积替代传统的2维卷积以更好地提取时空联合特征,同时提出时空坐标注意力模块,使网络关注于任务相关的重要区域特征。在LRW和LRW-1000数据集上进行实验,验证了所提方法的有效性。
  • 图  1  MS-LipNet整体框架

    图  2  STCA注意力结构图

    图  3  STCA子模块结构图

    图  4  ST-Res2Net结构图

    图  5  :模型的显著性图对比

    表  1  不同方法在LRW和LRW-1000数据集上的识别准确率对比(%)

    方法 LRW LRW-1000
    Two-Stream ResNet18+BiLSTM[19] 84.07
    2×ResNet18+BiGRU[20] 84.13 41.93
    ResNet18+3×BiGRU+MI[32] 84.41 38.79
    ResNet18+MS-TCN[9] 85.30 41.40
    SE-ResNet18+BiGRU[22] 85.00 48.00
    3D-ResNet18+BiGRU+TSM[21] 86.23 44.60
    ResNet18+HPConv+self-attention[20] 86.83
    WPCL+APFF[33] 88.30 49.40
    ResNet-18+DC-TCN[10] 88.36 43.65
    2DCNN+BiGRU+Lip Segmentation[34] 90.38
    ResNet18+DC-TCN+TimeMask[35] 90.40
    MS-LipNet 91.56 50.68
    下载: 导出CSV

    表  2  MS-LipNet网络不同组件的消融实验结果(%)

    模型 数据增强 注意力模块 LRW LRW-1000
    Mixup Cutout STCA
    MS-LipNet × × × 90.95 50.12
    × × 91.06 50.40
    × × 91.01 50.42
    × × 91.21 50.25
    × 91.18 50.06
    × 91.39 50.56
    × 91.48 50.50
    91.56 50.68
    下载: 导出CSV

    表  3  Cutout的不同取值对实验结果的影响(%)

    n_holeslengthLRWLRW-1000
    0091.3950.50
    11191.4150.51
    12291.4450.53
    14491.4250.55
    21191.4950.54
    22291.5650.68
    24491.4350.58
    31191.5050.65
    32291.3750.59
    34490.7250.51
    下载: 导出CSV
  • [1] TAYLOR S L, MAHLER M, THEOBALD B J, et al. Dynamic units of visual speech[C]. Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Lausanne, Switzerland, 2012: 275–284.
    [2] LI Dengshi, GAO Yu, ZHU Chenyi, et al. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy[J]. Sensors, 2023, 23(4): 2053. doi: 10.3390/s23042053.
    [3] IVANKO D, RYUMIN D, and KARPOV A. Automatic lip-reading of hearing impaired people[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2019, XLII-2/W12: 97–101. doi: 10.5194/isprs-archives-XLII-2-W12-97-2019.
    [4] GONZALEZ-LOPEZ J A, GOMEZ-ALANIS A, DOÑAS J M M, et al. Silent speech interfaces for speech restoration: A review[J]. IEEE Access, 2020, 8: 177995–178021. doi: 10.1109/ACCESS.2020.3026579.
    [5] EZZ M, MOSTAFA A M, and NASR A A. A silent password recognition framework based on lip analysis[J]. IEEE Access, 2020, 8: 55354–55371. doi: 10.1109/ACCESS.2020.2982359.
    [6] 王昌海, 许昱玮, 张建忠. 基于层次分类的手机位置无关的动作识别[J]. 电子与信息学报, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.

    WANG Changhai, XU Yuwei, and ZHANG Jianzhong. Hierarchical classification-based smartphone displacement free activity recognition[J]. Journal of Electronics & Information Technology, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.
    [7] STAFYLAKIS T and TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017.
    [8] STAFYLAKIS T, KHAN M H, and TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176/177: 22–32. doi: 10.1016/j.cviu.2018.10.003.
    [9] MARTINEZ B, MA Pingchuan, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6319–6323. doi: 10.1109/ICASSP40776.2020.9053841.
    [10] MA Pingchuan, WANG Yijiang, SHEN Jie, et al. Lip-reading with densely connected temporal convolutional networks[C]. Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2021: 2856–2865. doi: 10.1109/WACV48630.2021.00290.
    [11] TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459. doi: 10.1109/CVPR.2018.00675.
    [12] QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5534–5542. doi: 10.1109/ICCV.2017.590.
    [13] MARTINEZ B, MA Pingchuan, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6319–6323. doi: 10.1109/ICASSP40776.2020.9053841. (查阅网上资料,本条文献与第9条文献重复,请确认) .
    [14] CHEN Hang, DU Jun, HU Yu, et al. Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 3001–3005.
    [15] GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758.
    [16] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
    [17] HOU Qibin, ZHOU Daquan, and FENG Jiashi. Coordinate attention for efficient mobile network design[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13708–13717. doi: 10.1109/CVPR46437.2021.01350.
    [18] CHUNG J S and ZISSERMAN A. Lip reading in the wild[C]. 13th Asian Conference on Computer Vision, Taipei, China, 2017: 87–103. doi: 10.1007/978-3-319-54184-6_6.
    [19] WENG Xinshuo and KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]. 30th British Machine Vision Conference 2019, Cardiff, UK, 2019.
    [20] XIAO Jingyun, YANG Shuang, ZHANG Yuanhang, et al. Deformation flow based two-stream network for lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina , 2020: 364–370. doi: 10.1109/FG47880.2020.00132.
    [21] HAO Mingfeng, MAMUT M, YADIKAR N, et al. How to use time information effectively? Combining with time shift module for lipreading[C]. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 7988–7992. doi: 10.1109/ICASSP39728.2021.9414659.
    [22] CHEN Hang, DU Jun, HU Yu, et al. Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 3001–3005. (查阅网上资料, 本条文献与第14条文献重复, 请确认) .
    [23] 任永梅, 杨杰, 郭志强, 等. 基于多尺度卷积神经网络的自适应熵加权决策融合船舶图像分类方法[J]. 电子与信息学报, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.

    REN Yongmei, YANG Jie, GUO Zhiqiang, et al. Self-adaptive entropy weighted decision fusion method for ship image classification based on multi-scale convolutional neural network[J]. Journal of Electronics & Information Technology, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.
    [24] FENG Dalu, YANG Shuang, SHAN Shiguang, et al. Learn an effective lip reading model without pains[EB/OL]. https://arxiv.org/abs/2011.07557, 2020.
    [25] XUE Feng, YANG Tian, LIU Kang, et al. LCSNet: End-to-end lipreading with channel-aware feature selection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(1s): 28. doi: 10.1145/3524620.
    [26] FU Yixian, LU Yuanyao, and NI Ran. Chinese lip-reading research based on ShuffleNet and CBAM[J]. Applied Sciences, 2023, 13(2): 1106. doi: 10.3390/app13021106.
    [27] DEVRIES T and TAYLOR G W. Improved regularization of convolutional neural networks with cutout[EB/OL]. https://arxiv.org/abs/1708.04552, 2017.
    [28] ZHANG Hongyi, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [29] YANG Shuang, ZHANG Yuanhang, FENG Dalu, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 2019: 1–8. doi: 10.1109/FG.2019.8756582.
    [30] KING D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755–1758.
    [31] LOSHCHILOV I and HUTTER F. Decoupled weight decay regularization[C]. 7th International Conference on Learning Representations, New Orleans, USA, 2017.
    [32] ZHAO Xing, YANG Shuang, SHAN Shiguang, et al. Mutual information maximization for effective lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 2020: 420–427. doi: 10.1109/FG47880.2020.00133.
    [33] TIAN Weidong, ZHANG Housen, PENG Chen, et al. Lipreading model based on whole-part collaborative learning[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 2425–2429. doi: 10.1109/ICASSP43922.2022.9747052.
    [34] MILED M, MESSAOUD M A B, and BOUZID A. Lip reading of words with lip segmentation and deep learning[J]. Multimedia Tools and Applications, 2023, 82(1): 551–571. doi: 10.1007/s11042-022-13321-0.
    [35] MA Pingchuan, WANG Yujiang, PETRIDIS S, et al. Training strategies for improved lip-reading[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 8472–8476. doi: 10.1109/ICASSP43922.2022.9746706.
  • 加载中
图(5) / 表(3)
计量
  • 文章访问数:  56
  • HTML全文浏览量:  22
  • PDF下载量:  7
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-03-12
  • 修回日期:  2024-09-11
  • 网络出版日期:  2024-09-16

目录

    /

    返回文章
    返回