Speaker Verification Based on Tide-Ripple Convolution Neural Network
-
摘要: 近年来,最先进的说话人确认模型大多数以牺牲参数量和计算量的代价来实现感受野的固定获取,鉴于语音信号内部蕴含着丰富且多层次的信息,然而通过高度自主选择的动态感受野来描绘复杂信息是相对未被探索的,更没有直观的解释是什么构成了关于有效感受野的最佳实践。潮涌现象表现为潮水前端形成陡立水墙并伴随轰鸣声高速推进,受其非线性耦合行为的启发,提出潮涌卷积(TR-Conv)“使用潮涌感受野(T-RRF),获得更有效感受野”。首先采用二幂插值操作构建窗口内的主/从感受野,随后分别采用扫描-池化机制聚焦提取窗口外的关键信息、算子机制精细感知窗口内的差异信息,最后融合三重感受野,得到兼具多尺度、动态性、有效性的可变感受野。为全面验证潮涌卷积的表现,建立潮涌卷积神经网络(TR-CNN)。另外,针对数据集的错误标签问题,提出动态归一化的非目标(NTDN)损失与具有两个子中心的加性角边距(Sub-Center AAM)损失变体加权融合的总损失,以提升模型性能。实验结果表明,与ECAPA-TDNN(C=512)相比,TR-CNN(C=512, n=1)分别在测试集Vox1-O、Vox1-E、Vox1-H上的等错误率(EER)和最小检测代价函数(MinDCF)相对降低了4.95%、31.55%,4.03%、17.14%和6.03%、17.42%,参数量和乘加累积操作次数相对减少了32.7%、23.5%。进一步,TR-CNN(C=
1024 , n=1)的EER/MinDCF分别是0.85%、0.0762 ,1.10%、0.1048 ,2.05%、0.1739 。本研究代码已开源:https://github.com/splab-HRBUST/TR-CNN。-
关键词:
- 说话人确认 /
- 潮涌卷积 /
- 轻量化网络 /
- 二幂插值 /
- 动态归一化的非目标损失
Abstract:Objective State-of-the-art speaker verification models often achieve high performance by using fixed receptive fields, at the expense of significant parameter counts and computational loads. Given the rich, multi-layered nature of speech, employing dynamic receptive fields to capture complex information remains a relatively unexplored area, with little intuition on what constitutes an effective design. Methods Inspired by the non-linear coupling behavior of a tidal surge, this study proposes Tide-Ripple Convolution (TR-Conv) to create a more "effective receptive field". TR-Conv first constructs primary/auxiliary receptive fields within a window using power-of-two interpolation. It then employs a scan-pooling mechanism to extract key information from outside the window and an operator mechanism to perceive fine-grained differences within it. Fusing these three components yields a variable receptive field that is multi-scale, dynamic, and effective. A Tide-Ripple Convolutional Neural Network (TR-CNN) is established to validate this approach. To address the issue of label noise in datasets, a total loss function is proposed, which fuses a None-Target with Dynamic Normalization (NTDN) loss with a weighted Sub-center AAM Loss variant to enhance model performance. Results and Discussions The proposed Tide-Ripple Convolutional Neural Network (TR-CNN) is systematically validated on VoxCeleb1-O/E/H benchmarks. Results confirm TR-CNN achieves a superior balance of accuracy, computation, and parameter efficiency ( Table 1 ). Compared to the strong ECAPA-TDNN baseline, the TR-CNN (C=512, n=1) model yields significant relative EER reductions of 4.95%, 4.03%, and 6.03%, and MinDCF reductions of 31.55%, 17.14%, and 17.42% across the test sets, while using 32.7% fewer parameters and 23.5% less computation (Table 2 ). The optimal TR-CNN (C=1024 , n=1) model sets a new performance benchmark, achieving EERs of 0.85%, 1.10%, and 2.05%. Robustness is enhanced via a novel total loss setup, which provides stable EER and MinDCF improvements during fine-tuning (Table 3 ). Further analysis, including ablation studies (Table 5 ,6 ), component explorations (Fig 3 ,Table 4 ), and t-SNE visualizations (Fig. 4 ), collectively validate the effectiveness and robustness of each module within the TR-CNN architecture.Conclusions This research proposes a simple and effective Tide-Ripple Convolution (TR-Conv) layer, built upon the T-RRF. Experiments demonstrate that this approach captures a more expressive and effective receptive field, significantly reducing both parameter count and computational costs. It outperforms traditional one-dimensional convolution in speech signal modeling, demonstrating excellent lightweight characteristics and scalability. Furthermore, a total loss function, comprising the NTDN loss and a Sub-center AAM loss variant, is introduced. This ensures that the speaker embeddings generated by the network are more discriminative and robust, especially in the presence of mislabeled data. Looking ahead, TR-Conv holds promise as a general-purpose module for integration into deeper and more complex neural network architectures. -
表 1 不同主流网络的性能对比
网络 MACs Params Vox1_O
EER%/DCF0.01Vox1_E
EER%/DCF0.01Vox1_H
EER%/DCF0.01MEConformer[29] 33.6964G 167.52M 3.86/ 0.1070 3.72/ 0.1036 5.95/ 0.1049 x-vector+KD[26] 0.2650G 4.61M 1.32/ 0.1600 1.39/ 0.1550 2.44/ 0.2260 Gemini SD-ResNet38[30] 2.4850G 6.72M 1.09/ 0.0990 1.98/ 0.1850 1.13/ 0.1170 MFA-Conformer[16] 1.3345G 17.05M 0.97/ 0.0910 1.14/ 0.1210 2.17/ 0.1990 ERes2NetV2 w/o(BL+BD)[31] 10.2000G 22.40M 0.94/ 0.0930 1.05/ 0.1190 2.01/ 0.2050 PLDA+VIB_LN[32] - - 0.88/ 0.0940 1.02/ 0.1150 1.82/ 0.1740 NEMO Small[33] 1.1200G 15.88M 0.88/ 0.1367 1.08/ 0.1342 2.20/ 0.2245 TR-CNN(C=512, n=1) 1.1998G 4.17M 0.96/ 0.0872 1.19/ 0.1175 2.18/ 0.1801 TR-CNN(C= 1024 , n=1)3.0145G 10.42M 0.85/ 0.0762 1.10/ 0.1048 2.05/ 0.1739 表 2 不同通道配置下的性能对比
网络 ACC MACs Params Vox1_O
EER%/DCF0.01Vox1_E
EER%/DCF0.01Vox1_H
EER%/DCF0.01ECAPA(C=512)+AAM 73.54 1.5690G 6.20M 1.01/ 0.1274 1.24/ 0.1418 2.32/ 0.2181 ECAPA(C= 1024 )+AAM75.86 2.7123G 14.73M 0.87/ 0.1066 1.12/ 0.1318 2.12/ 0.2101 TR-CNN(C=512, n=1) 74.04 1.1998G 4.17M 0.96/ 0.0872 1.19/ 0.1175 2.18/ 0.1801 TR-CNN(C= 1024 , n=1)75.08 3.0145G 10.43M 0.85/ 0.0762 1.10/ 0.1048 2.05/ 0.1739 表 3 不同训练阶段引入NTDN损失的性能对比
不同训练阶段 Vox1_O EER%/DCF0.01 Vox1_E EER%/DCF0.01 Vox1_H EER%/DCF0.01 ECAPA(C=512)+AAM 1.01/ 0.1274 1.24/ 0.1418 2.32/ 0.2181 ECAPA(C=512)(3/3) 0.99/ 0.0929 1.23/ 0.1187 2.25/ 0.1873 ECAPA(C=512)(1/3) 1.12/ 0.0793 1.31/ 0.1075 2.45/ 0.1701 表 4 不同融合方式的性能对比
网络 ACC MACs Params Vox1_O
EER%/DCF0.01Vox1_E
EER%/DCF0.01Vox1_H
EER%/DCF0.01$ {{\boldsymbol{F}}^{(C)}} $(C=512, n=1) 73.28 0.8755G 3.87M 1.02/ 0.1058 1.28/ 0.1265 2.23/ 0.2091 $ {F^{(C)}} $(C=512, n=2) 72.36 0.9067G 3.76M 1.03/ 0.1074 1.28/ 0.1325 2.25/ 0.2167 $ {{\boldsymbol{F}}^{(C)}} $(C=512, n=3) 71.92 0.9595G 3.71M 1.05/ 0.1147 1.31/ 0.1365 2.24/ 0.2267 $ {{\boldsymbol{F}}^{(C)}} $(C= 1024 , n=1)74.83 2.4080G 9.84M 0.91/ 0.0842 1.12/ 0.1086 2.07/ 0.1882 $ {{\boldsymbol{F}}^{(C)}} $(C= 1024 , n=2)73.80 2.4368G 9.49M 0.93/ 0.0896 1.13/ 0.1091 2.09/ 0.1902 $ {{\boldsymbol{F}}^{(C)}} $(C= 1024 , n=3)72.29 2.5958G 9.32M 0.93/ 0.0817 1.15/ 0.1159 2.12/ 0.1935 $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=1) 71.17 1.3325G 4.54M 1.02/ 0.1017 1.29/ 0.1241 2.21/ 0.2049 $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=2) 70.65 1.3528G 4.32M 1.03/ 0.1082 1.31/ 0.1278 2.25/ 0.2098 $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=3) 68.52 1.4585G 4.21M 1.02/ 0.1035 1.29/ 0.1252 2.23/ 0.2053 表 5 T-RRF关键主感受野的消融实验
消融实验 MACs Params EER%/DCF0.01 F(C=512, n=1) 1.1998G 4.17M 0.96/ 0.0872 -(1) - 4.17M 0.98/ 0.0956 -(2) - 4.17M 0.98/ 0.0922 -(3) - 4.17M 0.97/ 0.0897 表 6 $ {\boldsymbol{F}} $(C=512, n=1)中不同模块在的消融实验
消融实验 MACs Params EER%/DCF0.01 F(C=512, n=1) 1.1998G 4.17M 0.96/ 0.0872 -PUB 1.1514G 4.18M 0.98/ 0.0932 -GFDL 1.0839G 5.86M 0.95/ 0.0861 F(C=512, n=1) 0.8755G 3.87M 1.02/ 0.1058 -
[1] 王伟, 韩纪庆, 郑铁然, 等. 基于Fisher判别字典学习的说话人识别[J]. 电子与信息学报, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.WANG Wei, HAN Jiqing, ZHENG Tieran, et al. Speaker recognition based on fisher discrimination dictionary learning[J]. Journal of Electronics & Information Technology, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566. [2] NAGRANI A, CHUNG J S, and ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017: 2616–2620. doi: 10.21437/Interspeech.2017-950. [3] NAGRANI A, CHUNG J S, XIE Weidi, et al. VoxCeleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027. doi: 10.1016/j.csl.2019.101027. [4] THIENPONDT J and DEMUYNCK K. ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings[C]. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 2023: 1–8. doi: 10.1109/ASRU57964.2023.10389750. [5] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-Vectors: Robust DNN embeddings for speaker recognition[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 5329–5333. doi: 10.1109/ICASSP.2018.8461375. [6] THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 2302–2306. [7] DESPLANQUES B, THIENPONDT J, and DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]. 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 3830–3834. doi: 10.21437/Interspeech.2020-2650. [8] ZHAO Zhenduo, LI Zhuo, WANG Wenchao, et al. PCF: ECAPA-TDNN with progressive channel fusion for speaker verification[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095051. [9] HEO H J, SHIN U H, LEE R, et al. NeXt-TDNN: Modernizing Multi-Scale temporal convolution backbone for speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 11186–11190. doi: 10.1109/ICASSP48485.2024.10447037. [10] WANG Rui, WEI Zhihua, DUAN Haoran, et al. EfficientTDNN: Efficient architecture search for speaker recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2267–2279. doi: 10.1109/TASLP.2022.3182856. [11] ABROL V, THAKUR A, GUPTA A, et al. Sampling rate adaptive speaker verification from raw waveforms[C]. 27th International Conference on Pattern Recognition, Kolkata, India, 2024: 367–382. doi: 10.1007/978-3-031-78104-9_25. [12] MUN S H, JUNG J W, HAN M H, et al. Frequency and multi-scale selective kernel attention for speaker verification[C]. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023: 548–554. doi: 10.1109/SLT54892.2023.10023305. [13] CHUNG J S, NAGRANI A, and ZISSERMAN A. VoxCeleb2: Deep speaker recognition[C]. 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 1086–1090. [14] DENG Jiankang, GUO Jia, LIU Tongliang, et al. Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 741–757. doi: 10.1007/978-3-030-58621-8_43. [15] HU Jie, SHEN Li, and SUN Gang, et al. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745. [16] ZHANG Yang, LV Zhiqiang, WU Haibin, et al. MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification[C]. 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 2022: 306–310. [17] ZHANG Ruiteng, WEI Jianguo, LU Xugang, et al. TMS: A temporal multi-scale backbone design for speaker embedding[EB/OL]. https://doi.org/10.48550/arXiv.2203.09098, 2022. [18] FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 363–380. doi: 10.1007/978-3-031-72949-2_21. [19] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167. [20] XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5987–5995. doi: 10.1109/CVPR.2017.634. [21] DENG Jiankang, GUO Jia, XUE Niannan, et al. ArcFace: Additive angular margin loss for deep face recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4685–4694. doi: 10.1109/CVPR.2019.00482. [22] THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. The IDLAB VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5814–5818. doi: 10.1109/ICASSP39728.2021.9414600. [23] KOUDOUNAS A, GIOBERGIA F, PASTOR E, et al. A contrastive learning approach to mitigate bias in speech models[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 827–831. doi: 10.21437/Interspeech.2024-1219. [24] SUN Yifen, CHENG Changmao, ZHANG Yuhan, et al. Circle loss: A unified perspective of pair similarity optimization[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6397–6406. doi: 10.1109/CVPR42600.2020.00643. [25] SOHN K. Improved deep metric learning with Multi-class N-pair loss objective[C]. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 1857–1865. [26] TRUONG D T, TAO Ruijie, YIP J Q, et al. Emphasized non-target speaker knowledge in knowledge distillation for automatic speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 10336–10340. doi: 10.1109/ICASSP48485.2024.10447160. [27] SNYDER D, CHEN Guoguo, and POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. https://doi.org/10.48550/arXiv.1510.08484, 2015. [28] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Canada, 2017: 5220–5224. doi: 10.1109/ICASSP.2017.7953152. [29] ZHENG Qiuyu, CHEN Zengzhao, WANG Zhifeng, et al. MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder[J]. Expert Systems with Applications, 2024, 244: 123004. doi: 10.1016/j.eswa.2023.123004. [30] LIU Tianchi, LEE K A, WANG Qiongqiong, et al. Golden Gemini is all you need: Finding the sweet spots for speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2324–2337. doi: 10.1109/TASLP.2024.3385277. [31] CHEN Yafeng, ZHENG Siqi, WANG Hui, et al. ERes2NetV2: Boosting short-duration speaker verification performance with computational efficiency[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3245–3249. doi: 10.21437/Interspeech.2024-742. [32] STAFYLAKIS T, SILNOVA A, ROHDIN J, et al. Challenging margin-based speaker embedding extractors by using the variational information bottleneck[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3220–3224. doi: 10.21437/Interspeech.2024-2058. [33] CAI Danwei and LI Ming. Leveraging ASR pretrained conformers for speaker verification through transfer learning and knowledge distillation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3532–3545. doi: 10.1109/TASLP.2024.3419426. [34] KOBAK D and BERENS P. The art of using t-SNE for single-cell transcriptomics[J]. Nature Communications, 2019, 10(1): 5416. doi: 10.1038/s41467-019-13056-x. -
下载:
下载: