残差网络在婴幼儿哭声识别中的应用

谢湘; 张立强; 王晶

doi:10.11999/JEIT180276

残差网络在婴幼儿哭声识别中的应用

doi: 10.11999/JEIT180276 cstr: 32379.14.JEIT180276

北京理工大学信息与电子学院北京 100081

基金项目: 国家自然科学基金(61473041, 11590772, 61571044)

详细信息

作者简介:
谢湘：男，1976年生，副教授，研究方向为语音识别

张立强：男，1995年生，硕士生，研究方向为语音人格感知

王晶：女，1980年生，副教授，研究方向为音频信号处理

通讯作者:
谢湘　xiexiang@bit.edu.cn

中图分类号: TP391.42
计量
- 文章访问数: 3193
- HTML全文浏览量: 1255
- PDF下载量: 108
- 被引次数: 0
出版历程
- 收稿日期: 2018-03-23
- 修回日期: 2018-09-04
- 网络出版日期: 2018-09-11
- 刊出日期: 2019-01-01

Application of Residual Network to Infant Crying Recognition

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

Funds: The National Natural Science Foundation of China (61473041, 11590772, 61571044)

摘要

摘要:
该文使用语谱图结合残差网络的深度学习模型进行婴幼儿哭声的识别，使用婴幼儿哭声与非哭声样本比例均衡的语料库，经过五折交叉验证，与支持向量机(SVM)，卷积神经网络(CNN)，基于Gammatone滤波器的听觉谱残差网络(GT-Resnet)3种模型相比，基于语谱图的残差网络取得了最优结果，F1-score达到0.9965，满足实时性要求，证明了语谱图在婴幼儿哭声识别任务中能直观地反映声学特征，基于语谱图的残差网络是解决婴幼儿哭声识别任务的优秀方法。
- 婴儿哭声识别 /
- 深度学习 /
- 残差网络 /
- 语谱图
Abstract:
The deep learning model based on the residual network and the spectrogram is used to recognize infant crying. The corpus has balanced proportion of infant crying and non-crying samples. Finally, through the 5-fold cross validation, compared with three models of Support Vector Machine (SVM), Convolutional Neural Network (CNN) and the cochleagram residual network based on Gammatone filters (GT-Resnet), the spectrogram based residual network gets the best F1-score of 0.9965 and satisfies requirements of real time. It is proved that the spectrogram can react acoustics features intuitively and comprehensively in the recognition of infant crying. The residual network based on spectrogram is a good solution to infant crying recognition problem.
- Infant crying recognition /
- Deep learning /
- Residual network /
- Spectrogram

HTML全文

图 1 婴幼儿哭声，成人说话声和铃声语谱图对比

下载: 全尺寸图片幻灯片

图 2 残差模块

下载: 全尺寸图片幻灯片

图 3 CNN-5模型结构

下载: 全尺寸图片幻灯片

图 4 3种模型测试集F1-score对比

下载: 全尺寸图片幻灯片

图 5 3种层数残差网络测试集F1-score对比

下载: 全尺寸图片幻灯片

图 6 残差网络模型

下载: 全尺寸图片幻灯片

表 1 五折交叉验证数据集平均规模(条)

	婴幼儿哭声	非哭声	总计
训练集规模	1243	1148	2391
测试集规模	310	286	596

下载: 导出CSV

表 2 SVM实验特征提取

提取特征类型	统计处理方法	维数
MFCC及其1阶2阶差分	均值、方差	72
短时能量	均值、方差	2
基音频率	均值、方差、最大值、最小值、极差	5

下载: 导出CSV

表 3 SVM不同核函数性能比较

核函数类型	F1-score	参数
线性核函数	0.8717	c=0.68
多项式核函数	0.9316	c=0.30, g=0.35, r=–0.20, d=3.00
高斯核函数	0.9458	c=0.98, g=1.71
Sigmod核函数	0.8874	c=5.00, g=0.04, r=1.80

下载: 导出CSV

表 4 不同层数CNN性能对比

CNN模型	输入特征	F1-score
CNN-4-MEL	40×128Mel语谱图	0.9184
CNN-4-227	227×227语谱图	0.9233
CNN-4	128×128语谱图	0.9229
CNN-5-227	227×227语谱图	0.9482
CNN-5	128×128语谱图	0.9489
CNN-6	128×128语谱图	0.9365
CNN-7	128×128语谱图	0.9398

下载: 导出CSV

表 5 模型性能对比

模型	网络结构	输入特征	生成模型大小(MB)	平均测试时间(s)	F1-score
SVM	单层网络	统计特征	0.7	0.0910+0.0001	0.9458
CNN-5	4conv+1fc	语谱图	10	0.1251+0.0093	0.9489
Resnet15	3resblock+1fc	语谱图	48	0.1251+0.0281	0.9836
Resnet19	4resblock+1fc	语谱图	87	0.1251+0.0315	0.9965
Resnet27	6resblock+1fc	语谱图	171	0.1251+0.0355	0.9965
GT-Resnet15	3resblock+1fc	听觉谱	48	0.1933+0.0218	0.9803
GT-Resnet19	4resblock+1fc	听觉谱	87	0.1933+0.0237	0.9782
GT-Resnet27	6resblock+1fc	听觉谱	171	0.1933+0.0285	0.9719
注：平均测试时间=特征提取时间+模型预测时间

下载: 导出CSV

参考文献(19)

于洪志, 刘思思. 三个月婴儿啼哭声的声学分析[C]. 全国人机语音通讯学术会议, 西安, 2011: 1–4.

YU Hongzhi and LIU Sisi. Crying sound learning analysis of three months baby[C]. National Conference on Man-Machine Speech Communication, Xi’an, China, 2011: 1–4.

王之禹, 雷云珊. 婴儿啼哭声的声学特征[C]. 中国声学学会2006年全国声学学术会议, 厦门, 2006: 389–390.

WANG Zhiyu and LEI Yunshan. Acoustic characteristic of infant cries[C]. National Conference on Acoustics. Acoustical Society of China, Xiamen, China, 2006: 389–390.

ABDULAZIZ Y and AHMAD S M S. Infant cry recognition system: A comparison of system performance based on mel frequency and linear prediction cepstral coefficients[C]. International Conference on Information Retrieval & Knowledge Management, Shah Alam, Malaysia, 2010: 260–263. doi: 10.1109/INFRKM.2010.5466907.

COHEN R and LAVNER Y. Infant cry analysis and detection[C]. Electrical & Electronics Engineers in Israel, Eilat, Israel, 2012: 1–5.

LAVNER Y, COHEN R, RUINSKIY D, et al. Baby cry detection in domestic environment using deep learning[C]. 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE), Eilat, Israel, 2016: 1–5. doi: 10.1109/EEEI.2012.6376996.

TORRES R, BATTAGLINO D, and LEPAULOUX L. Baby cry sound detection: A comparison of hand crafted features and deep learning approach[C]. International Conference on Engineering Applications of Neural Networks. Springer, Cham, 2017: 168–179. doi: 10.1007/978-3-319-65172-9_15.

CHANG Chuanyu and LI Jiajing. Application of deep learning for recognizing infant cries[C]. IEEE International Conference on Consumer Electronics, Nantou, China, 2016: 1–2. doi: 10.1109/ICCE-TW.2016.7520947.

SHARAN R V and MOIR T J. Cochleagram image feature for improved robustness in sound recognition[C]. IEEE International Conference on Digital Signal Processing, Singapore, 2015: 441–444.

PATTERSON R D, NIMMO-SMITH I, HOLDSWORTH J, et al. An efficient auditory filterbank based on the gammatone function[C]. Proceedings of the 1987 Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, RSRE, Malvern, 1987: 2–18.

刘文举, 聂帅, 梁山, 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734

LIU Wenju, NIE Shuai, LIANG Shan, et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734

MITTAL V K. Discriminating features of infant cry acoustic signal for automated detection of cause of crying[C]. International Symposium on Chinese Spoken Language Processing, Tianjin, China, 2017: 1–5. doi: 10.1109/ISCSLP.2016.7918391.

RPSITA Y D and JUNAEDI H. Infant’s cry sound classification using Mel-Frequency Cepstrum Coefficients feature extraction and Backpropagation Neural Network[C]. International Conference on Science and Technology-Computer, Yogyakarta, Indonesia, 2017: 160–166. doi: 10.1109/ICSTC.2016.7877367.

雷云珊. 婴儿啼哭声分析与模式分类[D]. [硕士论文], 山东科技大学, 2006.

LEI Yunshan. Analysis and pattern classification of infants’ cry[D]. [Master dissertation], Shandong University of Science and Technology, 2006.

KRIZHEVAKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. International Conference on Neural Information Processing Systems, Nevada, USA, 2012: 1097–1105.

HE Kaiming, ZHANG Xianyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Computer Vision and Pattern Recognition, Nevada, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.

GVERES. donateacry-corpus[OL]. https://github.com/gveres/donateacry-corpus, 2017.3.

彭天强, 栗芳. 基于深度卷积神经网络和二进制哈希学习的图像检索方法[J]. 电子与信息学报, 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346

PENG Tianqiang and LI Fang. Image retrieval based on deep convolutional neural networks and binary hashing learning[J]. Journal of Electronics &Information Technology, 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346

CHANG Chihchung and LIN Chihjen. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1–27. doi: 10.1145/1961189.1961199

徐利强, 谢湘, 黄石磊, 等. 连续语音中的笑声检测研究与实现[C]. 全国声学学术会议, 武汉, 2016: 581–584.

XU Liqiang, XIE Xiang, HUANG Shilei, et al. Research and implementation of laughter detection in continuous speech[C]. National Conference on Acoustics. Acoustical Society of China, Wuhan, China, 2016: 581–584.

施引文献

资源附件(0)

访问统计