Citation: | Zhigao CHEN, Peng LI, Runqiu XIAO, Ta LI, Wenchao WANG. A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271. doi: 10.11999/JEIT200917 |
[1] |
郭武, 戴礼荣, 王仁华. 采用因子分析和支持向量机的说话人确认系统[J]. 电子与信息学报, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289
GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics &Information Technology, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289
|
[2] |
VARIANI E, LEI Xin, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 2014: 4052–4056.
|
[3] |
SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]. The Interspeech 2017, Stockholm, Sweden, 2017: 999–1003.
|
[4] |
王文超, 黎塔. 基于多时间尺度的深层说话人特征提取研究[J]. 网络新媒体技术, 2019, 8(5): 21–26.
WANG Wenchao and LI Ta. Research on deep speaker embeddings extraction based on multiple temporal scales[J]. Journal of Network New Media, 2019, 8(5): 21–26.
|
[5] |
NAGRANI A, CHUNG J S, and ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[EB/OL]. https://arxiv.org/abs/1706.08612, 2017.
|
[6] |
HUANG Zili, WANG Shuai, and YU Kai. Angular softmax for short-duration text-independent speaker verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 3623–3627.
|
[7] |
YADAV S and RAI A. Learning discriminative features for speaker identification and verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 2237–2241.
|
[8] |
HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
|
[9] |
GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652–662.
|
[10] |
柳长源, 王琪, 毕晓君. 基于多通道多尺度卷积神经网络的单幅图像去雨方法[J]. 电子与信息学报, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755
LIU Changyuan, WANG Qi, and BI Xiaojun. Research on rain removal method for single image based on multi-channel and multi-scale CNN[J]. Journal of Electronics &Information Technology, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755
|
[11] |
CAI Weicheng, CHEN Jinkun, and LI Ming. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system[EB/OL]. https://arxiv.org/abs/1804.05160, 2018.
|
[12] |
HEO H S, JUNG J W, YANG I H, et al. End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification[EB/OL]. https://arxiv.org/abs/1902.02455, 2019.
|
[13] |
CHUNG J S, NAGRANI A, and ZISSERMAN A. Voxceleb2: Deep speaker recognition[EB/OL]. https://arxiv.org/abs/1806.05622, 2018.
|
[14] |
ZAGORUYKO S and KOMODAKIS N. Wide residual networks[EB/OL]. https://arxiv.org/abs/1605.07146, 2016.
|
[15] |
XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1492–1500.
|
[16] |
MCLAREN M, FERRER L, CASTAN D, et al. The speakers in the wild (SITW) speaker recognition database[C]. The Interspeech 2016, San Francisco, USA, 2016: 818–822.
|
[17] |
ZEINALI H, WANG Shuai, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019[EB/OL]. https://arxiv.org/abs/1910.12592, 2019.
|
[18] |
OKABE K, KOSHINAKA T, and SHINODA K. Attentive statistics pooling for deep speaker embedding[EB/OL]. https://arxiv.org/abs/1803.10963, 2018.
|