Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

YE Hong; WEI Jinsong; JIA Zhaohong; ZHENG Hui; LIANG Dong; TANG Jun

doi:10.11999/JEIT240161

Volume 46 Issue 11

Nov. 2024

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2024 > 46(11): 4170-4177

YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161

Citation:

YE Hong, WEI Jinsong, JIA Zhaohong, ZHENG Hui, LIANG Dong, TANG Jun. Lipreading Method Based on Multi-Scale Spatiotemporal Convolution[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4170-4177. doi: 10.11999/JEIT240161

Citation:

PDF( 2428 KB)

Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

doi: 10.11999/JEIT240161

1.
School of Internet, Anhui University, Hefei 230039, China
2.
School of Electronic and Information Engineering, Anhui University, Hefei 230601, China

Funds: The National Natural Science Foundation of China (71971002, 62273001), The Provincial Natural Science Foundation of Anhui (2108085QA35), Anhui Provincial Key Research and Development Project (202004a07020050), Anhui Provincial Major Science and Technology Project (202003A06020016), The Excellent Research and Innovation Teams in Anhui Province’s Universities (2022AH010005)

Received Date: 2024-03-12
Rev Recd Date: 2024-09-10

Available Online: 2024-09-16

Publish Date: 2024-11-10

Abstract

Abstract

Most of the existing lipreading models use a combination of single-layer 3D convolution and 2D convolutional neural networks to extract spatio-temporal joint features from lip video sequences. However, due to the limitations of single-layer 3D convolutions in capturing temporal information and the restricted capability of 2D convolutional neural networks in exploring fine-grained lipreading features, a Multi-Scale Lipreading Network (MS-LipNet) is proposed to improve lip reading tasks. In this paper, 3D spatio-temporal convolution is used to replace traditional two-dimensional convolution in Res2Net network to better extract spatio-temporal joint features, and a spatio-temporal coordinate attention module is proposed to make the network focus on task-related important regional features. The effectiveness of the proposed method was verified through experiments conducted on the LRW and LRW-1000 datasets.
- Lipreading,
- Multi-scale spatiotemporal convolutional network,
- Res2Net,
- Spatiotemporal coordinate attention,
- Data augmentation

FullText(HTML)

References(33)

References

[1]	TAYLOR S L, MAHLER M, THEOBALD B J, et al. Dynamic units of visual speech[C]. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Lausanne, Switzerland, 2012: 275–284.
[2]	LI Dengshi, GAO Yu, ZHU Chenyi, et al. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy[J]. Sensors, 2023, 23(4): 2053. doi: 10.3390/s23042053.
[3]	IVANKO D, RYUMIN D, and KARPOV A. Automatic lip-reading of hearing impaired people[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2019, XLII-2/W12: 97–101. doi: 10.5194/isprs-archives-XLII-2-W12-97-2019.
[4]	GONZALEZ-LOPEZ J A, GOMEZ-ALANIS A, DOÑAS J M M, et al. Silent speech interfaces for speech restoration: A review[J]. IEEE Access, 2020, 8: 177995–178021. doi: 10.1109/ACCESS.2020.3026579.
[5]	EZZ M, MOSTAFA A M, and NASR A A. A silent password recognition framework based on lip analysis[J]. IEEE Access, 2020, 8: 55354–55371. doi: 10.1109/ACCESS.2020.2982359.
[6]	王昌海, 许昱玮, 张建忠. 基于层次分类的手机位置无关的动作识别[J]. 电子与信息学报, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253. WANG Changhai, XU Yuwei, and ZHANG Jianzhong. Hierarchical classification-based smartphone displacement free activity recognition[J]. Journal of Electronics & Information Technology, 2017, 39(1): 191–197. doi: 10.11999/JEIT160253.
[7]	STAFYLAKIS T and TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017.
[8]	STAFYLAKIS T, KHAN M H, and TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176/177: 22–32. doi: 10.1016/j.cviu.2018.10.003.
[9]	MARTINEZ B, MA Pingchuan, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6319–6323. doi: 10.1109/ICASSP40776.2020.9053841.
[10]	MA Pingchuan, WANG Yijiang, SHEN Jie, et al. Lip-reading with densely connected temporal convolutional networks[C]. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2021: 2856–2865. doi: 10.1109/WACV48630.2021.00290.
[11]	TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459. doi: 10.1109/CVPR.2018.00675.
[12]	QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5534–5542. doi: 10.1109/ICCV.2017.590.
[13]	CHEN Hang, DU Jun, HU Yu, et al. Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 3001–3005.
[14]	GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758.
[15]	HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
[16]	HOU Qibin, ZHOU Daquan, and FENG Jiashi. Coordinate attention for efficient mobile network design[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13708–13717. doi: 10.1109/CVPR46437.2021.01350.
[17]	CHUNG J S and ZISSERMAN A. Lip reading in the wild[C]. 13th Asian Conference on Computer Vision, Taipei, China, 2017: 87–103. doi: 10.1007/978-3-319-54184-6_6.
[18]	WENG Xinshuo and KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]. 30th British Machine Vision Conference 2019, Cardiff, UK, 2019.
[19]	XIAO Jingyun, YANG Shuang, ZHANG Yuanhang, et al. Deformation flow based two-stream network for lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina , 2020: 364–370. doi: 10.1109/FG47880.2020.00132.
[20]	HAO Mingfeng, MAMUT M, YADIKAR N, et al. How to use time information effectively? Combining with time shift module for lipreading[C]. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 7988–7992. doi: 10.1109/ICASSP39728.2021.9414659.
[21]	任永梅, 杨杰, 郭志强, 等. 基于多尺度卷积神经网络的自适应熵加权决策融合船舶图像分类方法[J]. 电子与信息学报, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102. REN Yongmei, YANG Jie, GUO Zhiqiang, et al. Self-adaptive entropy weighted decision fusion method for ship image classification based on multi-scale convolutional neural network[J]. Journal of Electronics & Information Technology, 2021, 43(5): 1424–1431. doi: 10.11999/JEIT200102.
[22]	FENG Dalu, YANG Shuang, SHAN Shiguang, et al. Learn an effective lip reading model without pains[EB/OL]. https://arxiv.org/abs/2011.07557, 2020.
[23]	XUE Feng, YANG Tian, LIU Kang, et al. LCSNet: End-to-end lipreading with channel-aware feature selection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(1s): 28. doi: 10.1145/3524620.
[24]	FU Yixian, LU Yuanyao, and NI Ran. Chinese lip-reading research based on ShuffleNet and CBAM[J]. Applied Sciences, 2023, 13(2): 1106. doi: 10.3390/app13021106.
[25]	DEVRIES T and TAYLOR G W. Improved regularization of convolutional neural networks with cutout[EB/OL]. https://arxiv.org/abs/1708.04552, 2017.
[26]	ZHANG Hongyi, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[C]. 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
[27]	YANG Shuang, ZHANG Yuanhang, FENG Dalu, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 2019: 1–8. doi: 10.1109/FG.2019.8756582.
[28]	KING D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755–1758.
[29]	LOSHCHILOV I and HUTTER F. Decoupled weight decay regularization[C]. 7th International Conference on Learning Representations, New Orleans, USA, 2017.
[30]	ZHAO Xing, YANG Shuang, SHAN Shiguang, et al. Mutual information maximization for effective lip reading[C]. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 2020: 420–427. doi: 10.1109/FG47880.2020.00133.
[31]	TIAN Weidong, ZHANG Housen, PENG Chen, et al. Lipreading model based on whole-part collaborative learning[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 2425–2429. doi: 10.1109/ICASSP43922.2022.9747052.
[32]	MILED M, MESSAOUD M A B, and BOUZID A. Lip reading of words with lip segmentation and deep learning[J]. Multimedia Tools and Applications, 2023, 82(1): 551–571. doi: 10.1007/s11042-022-13321-0.
[33]	MA Pingchuan, WANG Yujiang, PETRIDIS S, et al. Training strategies for improved lip-reading[C]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022: 8472–8476. doi: 10.1109/ICASSP43922.2022.9746706.