Citation: | LAN Chaofeng, JIANG Pengwei, CHEN Huan, HAN Chuang, GUO Xiaoxia. Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3538-3546. doi: 10.11999/JEIT221500 |
[1] |
WANG Deliang and BROWN G J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications[M]. Hoboken: Wiley, 2006: 1–14.
|
[2] |
SCHMIDT M N and OLSSON R K. Single-channel speech separation using sparse non-negative matrix factorization[C]. The INTERSPEECH 2006, Pittsburgh, USA, 2006.
|
[3] |
ZHOU Weili, ZHU Zhen, and LIANG Peiying. Speech denoising using Bayesian NMF with online base update[J]. Multimedia Tools and Applications, 2019, 78(11): 15647–15664. doi: 10.1007/s11042-018-6990-5
|
[4] |
SUN Lei, DU Jun, DAI Lirong, et al. Multiple-target deep learning for LSTM-RNN based speech enhancement[C]. 2017 Hands-free Speech Communications and Microphone Arrays, San Francisco, USA, 2017: 136–140.
|
[5] |
HERSHEY J R, CHEN Zhuo, ROUX J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016: 31–35.
|
[6] |
YU Dong, KOLBÆK M, TAN Zhenghua, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 241–245.
|
[7] |
GOLUMBIC E Z, COGAN G B, SCHROEDER C E, et al. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”[J]. The Journal of Neuroscience, 2013, 33(4): 1417–1426. doi: 10.1523/jneurosci.3675-12.2013
|
[8] |
SUSSMAN E S. Integration and segregation in auditory scene analysis[J]. The Journal of the Acoustical Society of America, 2005, 117(3): 1285–1298. doi: 10.1121/1.1854312
|
[9] |
TAO Ruijie, PAN Zexu, DAS R K, et al. Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection[C]. The ACM Multimedia Conference, Chengdu, China, 2021: 3927–3935.
|
[10] |
LAKHAN A, MOHAMMED M A, KADRY S, et al. Federated Learning-Aware Multi-Objective Modeling and blockchain-enable system for IIoT applications[J]. Computers and Electrical Engineering, 2022, 100: 107839. doi: 10.1016/j.compeleceng.2022.107839
|
[11] |
MORGADO P, LI Yi, and VASCONCELOS N. Learning representations from audio-visual spatial alignment[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 397.
|
[12] |
GABBAY A, EPHRAT A, HALPERIN T, et al. Seeing through noise: Visually driven speaker separation and enhancement[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 3051–3055.
|
[13] |
AFOURAS T, CHUNG J S, and ZISSERMAN A. The conversation: Deep audio-visual speech enhancement[C]. The 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 3244–3248.
|
[14] |
EPHRAT A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation[J]. ACM Transactions on Graphics, 2018, 37(4): 112. doi: 10.1145/3197517.3201357
|
[15] |
GAO Ruohan and GRAUMAN K. VisualVoice: Audio-visual speech separation with cross-modal consistency[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 15490–15500.
|
[16] |
XIONG Junwen, ZHANG Peng, XIE Lei, et al. Audio-visual speech separation based on joint feature representation with cross-modal attention[J]. arXiv preprint arVix: 2203.02655, 2022.
|
[17] |
杨子龙, 朱付平, 田金文, 等. 基于显著性与稠密光流的红外船只烟幕检测方法研究[J]. 红外与激光工程, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496
YANG Zilong, ZHU Fuping, TIAN Jinwen, et al. Ship smoke detection method based on saliency and dense optical flow[J]. Infrared and Laser Engineering, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496
|
[18] |
田佳莉. 基于多特征的考场异常行为识别研究[D]. [硕士论文], 沈阳工业大学, 2021.
TIAN Jiali. Research on abnormal behavior recognition of examination room based on multi-features[D]. [Master dissertation], Shenyang University of Technology, 2021.
|
[19] |
欧阳玉梅. 基于稠密光流算法的运动目标检测的Python实现[J]. 现代电子技术, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017
OUYANG Yumei. DOF algorithm based moving object detection programmed by Python[J]. Modern Electronics Technique, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017
|
[20] |
MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 116–131.
|
[21] |
CARREIRA J and ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4724–4733.
|
[22] |
RONNEBERGER O, FISCHER P, and BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]. The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241.
|
[23] |
PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]. The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 3942–3951.
|
[24] |
BAI Shaojie, KOLTER J Z, and KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint arXiv: 1803.01271, 2018.
|
[25] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
|
[26] |
THIEDE T, TREURNIET W, BITTO R, et al. PEAQ - The ITU standard for objective measurement of perceived audio quality[J]. Journal of the Audio Engineering Society, 2000, 48(1): 3–29.
|
[27] |
TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125–2136. doi: 10.1109/TASL.2011.2114881
|
[28] |
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done?[C]. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 626–630.
|
[29] |
LUO Y and MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167
|