| Citation: | CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713 |
| [1] |
王伟, 韩纪庆, 郑铁然, 等. 基于Fisher判别字典学习的说话人识别[J]. 电子与信息学报, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.
WANG Wei, HAN Jiqing, ZHENG Tieran, et al. Speaker recognition based on fisher discrimination dictionary learning[J]. Journal of Electronics & Information Technology, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.
|
| [2] |
NAGRANI A, CHUNG J S, and ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]. The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017: 2616–2620. doi: 10.21437/Interspeech.2017-950.
|
| [3] |
NAGRANI A, CHUNG J S, XIE Weidi, et al. VoxCeleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027. doi: 10.1016/j.csl.2019.101027.
|
| [4] |
THIENPONDT J and DEMUYNCK K. ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings[C]. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 2023: 1–8. doi: 10.1109/ASRU57964.2023.10389750.
|
| [5] |
SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-Vectors: Robust DNN embeddings for speaker recognition[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
|
| [6] |
THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification[C]. The 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 2302–2306.
|
| [7] |
DESPLANQUES B, THIENPONDT J, and DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]. The 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 3830–3834. doi: 10.21437/Interspeech.2020-2650.
|
| [8] |
ZHAO Zhenduo, LI Zhuo, WANG Wenchao, et al. PCF: ECAPA-TDNN with progressive channel fusion for speaker verification[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095051.
|
| [9] |
HEO H J, SHIN U H, LEE R, et al. NeXt-TDNN: Modernizing Multi-Scale temporal convolution backbone for speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 11186–11190. doi: 10.1109/ICASSP48485.2024.10447037.
|
| [10] |
WANG Rui, WEI Zhihua, DUAN Haoran, et al. EfficientTDNN: Efficient architecture search for speaker recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2267–2279. doi: 10.1109/TASLP.2022.3182856.
|
| [11] |
ABROL V, THAKUR A, GUPTA A, et al. Sampling rate adaptive speaker verification from raw waveforms[C]. The 27th International Conference on Pattern Recognition, Kolkata, India, 2024: 367–382. doi: 10.1007/978-3-031-78104-9_25.
|
| [12] |
MUN S H, JUNG J W, HAN M H, et al. Frequency and multi-scale selective kernel attention for speaker verification[C]. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023: 548–554. doi: 10.1109/SLT54892.2023.10023305.
|
| [13] |
CHUNG J S, NAGRANI A, and ZISSERMAN A. VoxCeleb2: Deep speaker recognition[C]. The 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 1086–1090.
|
| [14] |
DENG Jiankang, GUO Jia, LIU Tongliang, et al. Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 741–757. doi: 10.1007/978-3-030-58621-8_43.
|
| [15] |
HU Jie, SHEN Li, and SUN Gang, et al. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
|
| [16] |
ZHANG Yang, LÜ Zhiqiang, WU Haibin, et al. MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification[C]. The 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 2022: 306–310.
|
| [17] |
ZHANG Ruiteng, WEI Jianguo, LU Xugang, et al. TMS: A temporal multi-scale backbone design for speaker embedding[EB/OL]. https://doi.org/10.48550/arXiv.2203.09098, 2022.
|
| [18] |
FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 363–380. doi: 10.1007/978-3-031-72949-2_21.
|
| [19] |
LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
|
| [20] |
XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5987–5995. doi: 10.1109/CVPR.2017.634.
|
| [21] |
DENG Jiankang, GUO Jia, XUE Niannan, et al. ArcFace: Additive angular margin loss for deep face recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4685–4694. doi: 10.1109/CVPR.2019.00482.
|
| [22] |
THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. The IDLAB VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5814–5818. doi: 10.1109/ICASSP39728.2021.9414600.
|
| [23] |
KOUDOUNAS A, GIOBERGIA F, PASTOR E, et al. A contrastive learning approach to mitigate bias in speech models[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 827–831. doi: 10.21437/Interspeech.2024-1219.
|
| [24] |
SUN Yifen, CHENG Changmao, ZHANG Yuhan, et al. Circle loss: A unified perspective of pair similarity optimization[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6397–6406. doi: 10.1109/CVPR42600.2020.00643.
|
| [25] |
SOHN K. Improved deep metric learning with Multi-class N-pair loss objective[C]. The 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 1857–1865.
|
| [26] |
TRUONG D T, TAO Ruijie, YIP J Q, et al. Emphasized non-target speaker knowledge in knowledge distillation for automatic speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 10336–10340. doi: 10.1109/ICASSP48485.2024.10447160.
|
| [27] |
SNYDER D, CHEN Guoguo, and POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. https://doi.org/10.48550/arXiv.1510.08484, 2015.
|
| [28] |
KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Canada, 2017: 5220–5224. doi: 10.1109/ICASSP.2017.7953152.
|
| [29] |
ZHENG Qiuyu, CHEN Zengzhao, WANG Zhifeng, et al. MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder[J]. Expert Systems with Applications, 2024, 244: 123004. doi: 10.1016/j.eswa.2023.123004.
|
| [30] |
LIU Tianchi, LEE K A, WANG Qiongqiong, et al. Golden Gemini is all you need: Finding the sweet spots for speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2324–2337. doi: 10.1109/TASLP.2024.3385277.
|
| [31] |
CHEN Yafeng, ZHENG Siqi, WANG Hui, et al. ERes2NetV2: Boosting short-duration speaker verification performance with computational efficiency[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3245–3249. doi: 10.21437/Interspeech.2024-742.
|
| [32] |
STAFYLAKIS T, SILNOVA A, ROHDIN J, et al. Challenging margin-based speaker embedding extractors by using the variational information bottleneck[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3220–3224. doi: 10.21437/Interspeech.2024-2058.
|
| [33] |
CAI Danwei and LI Ming. Leveraging ASR pretrained conformers for speaker verification through transfer learning and knowledge distillation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3532–3545. doi: 10.1109/TASLP.2024.3419426.
|
| [34] |
KOBAK D and BERENS P. The art of using t-SNE for single-cell transcriptomics[J]. Nature Communications, 2019, 10(1): 5416. doi: 10.1038/s41467-019-13056-x.
|