A Multiscale Feature Extraction Method for Text-independent Speaker Recognition

Zhigao CHEN; Peng LI; Runqiu XIAO; Ta LI; Wenchao WANG

doi:10.11999/JEIT200917

Volume 43 Issue 11

Nov. 2021

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2021 > 43(11): 3266-3271

Zhigao CHEN, Peng LI, Runqiu XIAO, Ta LI, Wenchao WANG. A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271. doi: 10.11999/JEIT200917

Citation:

Zhigao CHEN, Peng LI, Runqiu XIAO, Ta LI, Wenchao WANG. A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271. doi: 10.11999/JEIT200917

Citation:

PDF( 778 KB)

A Multiscale Feature Extraction Method for Text-independent Speaker Recognition

doi: 10.11999/JEIT200917

Zhigao CHEN^{1, 2},
Peng LI³,
Runqiu XIAO^{1, 2},
Ta LI¹,
Wenchao WANG^{1
,
,}

1.
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
2.
University of Chinese Academy of Sciences, Beijing 100049, China
3.
National Computer Network Emergency Response Technical Team/ Coordination Center of China, Beijing 100029, China

Funds: The National Natural Science Foundation of China (11590772, 11590774, 11590770)

Received Date: 2020-10-26
Rev Recd Date: 2021-03-13

Available Online: 2021-03-25

Publish Date: 2021-11-23

Abstract

Abstract

Recently in speaker recognition tasks, consistent performance gains have been continually achieved by various Convolutional Neural Networks (CNNs), which have shown increasingly stronger multiscale representation abilities. However, most existing methods enhance their strength with more layers and deeper structures. In this paper, a unique multiscale backbone architecture, Res2Net, is introduced for speaker recognition tasks, and its blocks are modified for assessment. This architecture works at a more granular level than most layer-wise networks. It improves the system by combining many equivalent receptive fields, resulting in a combination of different feature scales. The experiments results demonstrate that this architecture steadily achieves a 20% improvement on the Equal Error Rate (EER) over the baseline without additional computational burden. Its effectiveness and robustness are also verified in different environments and tasks, such as VoxCeleb and Speakers In The Wild (SITW). The modified full-connection block can make sure a more sufficient use of information and improves the performance obviously in more complex tasks. The code is available at https://github.com/czg0326/Res2Net-Speaker-Recognition.
- Speaker recognition,
- Multiscale features,
- Robustness,
- Efficiency

FullText(HTML)

References(18)

References

[1]	郭武, 戴礼荣, 王仁华. 采用因子分析和支持向量机的说话人确认系统[J]. 电子与信息学报, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289 GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics &Information Technology, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289
[2]	VARIANI E, LEI Xin, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 2014: 4052–4056.
[3]	SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]. The Interspeech 2017, Stockholm, Sweden, 2017: 999–1003.
[4]	王文超, 黎塔. 基于多时间尺度的深层说话人特征提取研究[J]. 网络新媒体技术, 2019, 8(5): 21–26. WANG Wenchao and LI Ta. Research on deep speaker embeddings extraction based on multiple temporal scales[J]. Journal of Network New Media, 2019, 8(5): 21–26.
[5]	NAGRANI A, CHUNG J S, and ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[EB/OL]. https://arxiv.org/abs/1706.08612, 2017.
[6]	HUANG Zili, WANG Shuai, and YU Kai. Angular softmax for short-duration text-independent speaker verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 3623–3627.
[7]	YADAV S and RAI A. Learning discriminative features for speaker identification and verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 2237–2241.
[8]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[9]	GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652–662.
[10]	柳长源, 王琪, 毕晓君. 基于多通道多尺度卷积神经网络的单幅图像去雨方法[J]. 电子与信息学报, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755 LIU Changyuan, WANG Qi, and BI Xiaojun. Research on rain removal method for single image based on multi-channel and multi-scale CNN[J]. Journal of Electronics &Information Technology, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755
[11]	CAI Weicheng, CHEN Jinkun, and LI Ming. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system[EB/OL]. https://arxiv.org/abs/1804.05160, 2018.
[12]	HEO H S, JUNG J W, YANG I H, et al. End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification[EB/OL]. https://arxiv.org/abs/1902.02455, 2019.
[13]	CHUNG J S, NAGRANI A, and ZISSERMAN A. Voxceleb2: Deep speaker recognition[EB/OL]. https://arxiv.org/abs/1806.05622, 2018.
[14]	ZAGORUYKO S and KOMODAKIS N. Wide residual networks[EB/OL]. https://arxiv.org/abs/1605.07146, 2016.
[15]	XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1492–1500.
[16]	MCLAREN M, FERRER L, CASTAN D, et al. The speakers in the wild (SITW) speaker recognition database[C]. The Interspeech 2016, San Francisco, USA, 2016: 818–822.
[17]	ZEINALI H, WANG Shuai, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019[EB/OL]. https://arxiv.org/abs/1910.12592, 2019.
[18]	OKABE K, KOSHINAKA T, and SHINODA K. Attentive statistics pooling for deep speaker embedding[EB/OL]. https://arxiv.org/abs/1803.10963, 2018.