Lipreading Method Based on Multi-Scale Spatiotemporal Convolution
-
摘要: 现有的唇语识别模型大多采用将单层的3维卷积与2维卷积神经网络结合的方式,从唇语视频序列中挖掘出时空联合特征。然而,由于单层的3维卷积不能很好地提取时间信息,同时2维卷积神经网络对细粒度的唇语特征的挖掘能力有限,该文提出一种多尺度唇语识别网络(MS-LipNet)以改善唇语识别任务。该文在Res2Net网络中,采用3维时空卷积替代传统的2维卷积以更好地提取时空联合特征,同时提出时空坐标注意力模块,使网络关注于任务相关的重要区域特征。在LRW和LRW-1000数据集上进行实验,验证了所提方法的有效性。Abstract: Most of the existing lipreading models use a combination of single-layer 3D convolution and 2D convolutional neural networks to extract spatio-temporal joint features from lip video sequences. However, due to the limitations of single-layer 3D convolutions in capturing temporal information and the restricted capability of 2D convolutional neural networks in exploring fine-grained lipreading features, a Multi-Scale Lipreading Network (MS-LipNet) is proposed to improve lip reading tasks. In this paper, 3D spatio-temporal convolution is used to replace traditional two-dimensional convolution in Res2Net network to better extract spatio-temporal joint features, and a spatio-temporal coordinate attention module is proposed to make the network focus on task-related important regional features. The effectiveness of the proposed method was verified through experiments conducted on the LRW and LRW-1000 datasets.
-
表 1 不同方法在LRW和LRW-
1000 数据集上的识别准确率对比(%)方法 LRW LRW- 1000 Two-Stream ResNet18+BiLSTM[18] 84.07 – 2×ResNet18+BiGRU[19] 84.13 41.93 ResNet18+3×BiGRU+MI[30] 84.41 38.79 ResNet18+MS-TCN[9] 85.30 41.40 SE-ResNet18+BiGRU[13] 85.00 48.00 3D-ResNet18+BiGRU+TSM[20] 86.23 44.60 ResNet18+HPConv+self-attention[19] 86.83 – WPCL+APFF[31] 88.30 49.40 ResNet-18+DC-TCN[10] 88.36 43.65 2DCNN+BiGRU+Lip Segmentation[32] 90.38 – ResNet18+DC-TCN+TimeMask[33] 90.40 – MS-LipNet 91.56 50.68 表 2 MS-LipNet网络不同组件的消融实验结果(%)
数据增强 注意力模块 LRW LRW- 1000 Mixup Cutout STCA × × × 90.95 50.12 √ × × 91.06 50.40 × √ × 91.01 50.42 × × √ 91.21 50.25 √ √ × 91.18 50.06 √ × √ 91.39 50.56 × √ √ 91.48 50.50 √ √ √ 91.56 50.68 表 3 Cutout的不同取值对实验结果的影响(%)
n_holes length LRW LRW- 1000 0 0 91.39 50.50 1 11 91.41 50.51 1 22 91.44 50.53 1 44 91.42 50.55 2 11 91.49 50.54 2 22 91.56 50.68 2 44 91.43 50.58 3 11 91.50 50.65 3 22 91.37 50.59 3 44 90.72 50.51 -
[1] TAYLOR S L, MAHLER M, THEOBALD B J, et al. Dynamic units of visual speech[C]. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Lausanne, Switzerland, 2012: 275–284. [2] LI Dengshi, GAO Yu, ZHU Chenyi, et al. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy[J]. Sensors, 2023, 23(4): 2053. doi: 10.3390/s23042053. [3] IVANKO D, RYUMIN D, and KARPOV A. Automatic lip-reading of hearing impaired people[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2019, XLII-2/W12: 97–101. doi: 10.5194/isprs-archives-XLII-2-W12-97-2019. [4] GONZALEZ-LOPEZ J A, GOMEZ-ALANIS A, DOÑAS J M M, et al. Silent speech interfaces for speech restoration: A review[J]. IEEE Access, 2020, 8: 177995–178021. doi: 10.1109/ACCESS.2020.3026579. [5] EZZ M, MOSTAFA A M, and NASR A A. A silent password recognition framework based on lip analysis[J]. IEEE Access, 2020, 8: 55354–55371. doi: 10.1109/ACCESS.2020.2982359. [6] 王昌海, 许昱玮, 张建忠. 基于层次分类的手机位置无关的动作识别[J]. 电子与信息学报, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.WANG Changhai, XU Yuwei, and ZHANG Jianzhong. Hierarchical classification-based smartphone displacement free activity recognition[J]. Journal of Electronics & Information Technology, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253. [7] STAFYLAKIS T and TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017. [8] STAFYLAKIS T, KHAN M H, and TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176/177: 22–32. doi: 10.1016/j.cviu.2018.10.003. [9] MARTINEZ B, MA Pingchuan, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6319–6323. doi: 10.1109/ICASSP40776.2020.9053841. [10] MA Pingchuan, WANG Yijiang, SHEN Jie, et al. Lip-reading with densely connected temporal convolutional networks[C]. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2021: 2856–2865. doi: 10.1109/WACV48630.2021.00290. [11] TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459. doi: 10.1109/CVPR.2018.00675. [12] QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5534–5542. doi: 10.1109/ICCV.2017.590. [13] CHEN Hang, DU Jun, HU Yu, et al. Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 3001–3005. [14] GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758. [15] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745. [16] HOU Qibin, ZHOU Daquan, and FENG Jiashi. Coordinate attention for efficient mobile network design[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13708–13717. doi: 10.1109/CVPR46437.2021.01350. [17] CHUNG J S and ZISSERMAN A. Lip reading in the wild[C]. 13th Asian Conference on Computer Vision, Taipei, China, 2017: 87–103. doi: 10.1007/978-3-319-54184-6_6. [18] WENG Xinshuo and KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]. 30th British Machine Vision Conference 2019, Cardiff, UK, 2019. [19] XIAO Jingyun, YANG Shuang, ZHANG Yuanhang, et al. Deformation flow based two-stream network for lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina , 2020: 364–370. doi: 10.1109/FG47880.2020.00132. [20] HAO Mingfeng, MAMUT M, YADIKAR N, et al. How to use time information effectively? Combining with time shift module for lipreading[C]. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 7988–7992. doi: 10.1109/ICASSP39728.2021.9414659. [21] 任永梅, 杨杰, 郭志强, 等. 基于多尺度卷积神经网络的自适应熵加权决策融合船舶图像分类方法[J]. 电子与信息学报, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.REN Yongmei, YANG Jie, GUO Zhiqiang, et al. Self-adaptive entropy weighted decision fusion method for ship image classification based on multi-scale convolutional neural network[J]. Journal of Electronics & Information Technology, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102. [22] FENG Dalu, YANG Shuang, SHAN Shiguang, et al. Learn an effective lip reading model without pains[EB/OL]. https://arxiv.org/abs/2011.07557, 2020. [23] XUE Feng, YANG Tian, LIU Kang, et al. LCSNet: End-to-end lipreading with channel-aware feature selection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(1s): 28. doi: 10.1145/3524620. [24] FU Yixian, LU Yuanyao, and NI Ran. Chinese lip-reading research based on ShuffleNet and CBAM[J]. Applied Sciences, 2023, 13(2): 1106. doi: 10.3390/app13021106. [25] DEVRIES T and TAYLOR G W. Improved regularization of convolutional neural networks with cutout[EB/OL]. https://arxiv.org/abs/1708.04552, 2017. [26] ZHANG Hongyi, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018. [27] YANG Shuang, ZHANG Yuanhang, FENG Dalu, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 2019: 1–8. doi: 10.1109/FG.2019.8756582. [28] KING D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755–1758. [29] LOSHCHILOV I and HUTTER F. Decoupled weight decay regularization[C]. 7th International Conference on Learning Representations, New Orleans, USA, 2017. [30] ZHAO Xing, YANG Shuang, SHAN Shiguang, et al. Mutual information maximization for effective lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 2020: 420–427. doi: 10.1109/FG47880.2020.00133. [31] TIAN Weidong, ZHANG Housen, PENG Chen, et al. Lipreading model based on whole-part collaborative learning[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 2425–2429. doi: 10.1109/ICASSP43922.2022.9747052. [32] MILED M, MESSAOUD M A B, and BOUZID A. Lip reading of words with lip segmentation and deep learning[J]. Multimedia Tools and Applications, 2023, 82(1): 551–571. doi: 10.1007/s11042-022-13321-0. [33] MA Pingchuan, WANG Yujiang, PETRIDIS S, et al. Training strategies for improved lip-reading[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022: 8472–8476. doi: 10.1109/ICASSP43922.2022.9746706. 期刊类型引用(19)
1. 孙顺远,魏志涛. 基于二次移动平均法估计背景光照的二值化方法. 计算机与数字工程. 2024(06): 1830-1836 . 百度学术
2. 赵孔卫,徐广标. 基于像素分析的针织面料卷边性评价研究. 针织工业. 2024(10): 11-14 . 百度学术
3. 卢晓波,徐海,朱俊召,张宇,谭健,高冠男,胡军华,林龙. 基于机器视觉的加热卷烟烟支端部质量检测系统设计. 轻工学报. 2024(06): 101-107+115 . 百度学术
4. 韩海豹,化荣,张虎,陈杰. 量产活禽(肉鸡)智能化运输装备控制系统的设计. 农业技术与装备. 2023(01): 20-22 . 百度学术
5. 支亚京,汤宁,吴兴洋,汪华,胡兴炜,张军. 基于支持向量机的气温自记纸图像数字化. 计算机技术与发展. 2023(10): 216-220 . 百度学术
6. 魏兴凯,蒋峥,傅呈勋,刘斌. 基于光照影响因子的动态Niblack算法研究及应用. 计算机工程与设计. 2022(04): 1066-1073 . 百度学术
7. 徐浩,章明希. 高精密齿轮小缺陷的智能视觉测量. 兵器材料科学与工程. 2021(01): 83-87 . 百度学术
8. 贺欢,吐尔洪江·阿布都克力木,何笑. 一种基于MALLAT算法的图像去雾方法. 新疆师范大学学报(自然科学版). 2020(01): 23-27 . 百度学术
9. 赵琛,张血琴,刘凯,郭裕钧. 基于正则化的多光谱图像二值化处理. 计算机仿真. 2020(04): 436-440 . 百度学术
10. 杜炤鑫,谢海宁,宋杰,周德生,邹晓峰,陈冉,曾平. 基于图像处理和深度学习的配网跳闸故障识别方法. 中国科学技术大学学报. 2020(01): 39-48 . 百度学术
11. 蒋鹏程,熊礼治,韩啸. 一种基于内容保护与优化识别的二维码方案. 软件导刊. 2019(02): 119-122 . 百度学术
12. 安建尧,李金新,孙双平. 基于Prewitt算子的红外图像边缘检测改进算法. 杭州电子科技大学学报(自然科学版). 2018(05): 18-23+39 . 百度学术
13. 陈志伟,徐世许,刘云鹏,曾祥晓. 基于视觉筛选的并联机器人平面抓取系统设计. 制造业自动化. 2018(05): 44-47 . 百度学术
14. 熊炜,徐晶晶,赵诗云,王改华,刘敏,赵楠,刘聪. 基于支持向量机的低质量文档图像二值化. 计算机应用与软件. 2018(02): 218-223+241 . 百度学术
15. 李昌利,周晓晓,张振,樊棠怀. Retinex模型下基于融合策略的雾霾图像增强. 工程科学与技术. 2018(05): 202-208 . 百度学术
16. 于晓,闫振雷,周子杰. 指纹识别网页登录器设计. 实验室研究与探索. 2018(10): 85-88+128 . 百度学术
17. 宋巧君,张东. 基于双边滤波和Black-hat变换的OSTU裂缝分割算法. 信息技术. 2017(12): 90-92 . 百度学术
18. 谢芳娟,曾萍萍,谭菊华. 低分辨率灰度图像传输真实度优化仿真研究. 计算机仿真. 2017(12): 183-186 . 百度学术
19. 田敬波. 基于模板算子边缘检测的图像二值化算法. 信息技术与信息化. 2017(09): 98-101 . 百度学术
其他类型引用(33)
-