基于潮涌卷积神经网络的说话人确认

陈晨; 仪志鑫; 李东源; 陈德运

doi:10.11999/JEIT250713

基于潮涌卷积神经网络的说话人确认

doi: 10.11999/JEIT250713 cstr: 32379.14.JEIT250713

哈尔滨理工大学计算机科学与技术学院哈尔滨 150080

基金项目: 国家自然科学基金(62101163)，黑龙江省自然科学基金(YQ2024F018)，黑龙江省重点研发计划项目(JD2023SJ20)

详细信息

作者简介:
陈晨：女，副教授，博士生导师，研究方向为计算机听觉、语音信号处理、音频信息分析、模式识别

仪志鑫：男，硕士生，研究方向为说话人识别方向

李东源：男，硕士生，研究方向为说话人识别方向

陈德运：男，教授，博士生导师，研究方向为机器学习、模式识别、数据挖掘

通讯作者:
陈晨　chenc@hrbust.edu.cn

中图分类号: TN912.3
计量
- 文章访问数: 16
- HTML全文浏览量: 9
- PDF下载量: 1
- 被引次数: 0
出版历程
- 收稿日期: 2025-07-30
- 修回日期: 2025-11-17
- 录用日期: 2025-11-17
- 网络出版日期: 2025-11-25

Speaker Verification Based on Tide-Ripple Convolution Neural Network

College of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150000, China

Funds: National Natural Science Foundation of China (62101163), Heilongjiang Provincial Natural Science Foundation (YQ2024F018), Key Research and development project of Heilongjiang Province (JD2023SJ20)

摘要

摘要: 近年来，最先进的说话人确认模型大多数以牺牲参数量和计算量的代价来实现感受野的固定获取，鉴于语音信号内部蕴含着丰富且多层次的信息，然而通过高度自主选择的动态感受野来描绘复杂信息是相对未被探索的，更没有直观的解释是什么构成了关于有效感受野的最佳实践。潮涌现象表现为潮水前端形成陡立水墙并伴随轰鸣声高速推进，受其非线性耦合行为的启发，提出潮涌卷积(TR-Conv)“使用潮涌感受野(T-RRF)，获得更有效感受野”。首先采用二幂插值操作构建窗口内的主/从感受野，随后分别采用扫描-池化机制聚焦提取窗口外的关键信息、算子机制精细感知窗口内的差异信息，最后融合三重感受野，得到兼具多尺度、动态性、有效性的可变感受野。为全面验证潮涌卷积的表现，建立潮涌卷积神经网络(TR-CNN)。另外，针对数据集的错误标签问题，提出动态归一化的非目标(NTDN)损失与具有两个子中心的加性角边距(Sub-Center AAM)损失变体加权融合的总损失，以提升模型性能。实验结果表明，与ECAPA-TDNN(C=512)相比，TR-CNN(C=512, n=1)分别在测试集Vox1-O、Vox1-E、Vox1-H上的等错误率(EER)和最小检测代价函数(MinDCF)相对降低了4.95%、31.55%，4.03%、17.14%和6.03%、17.42%，参数量和乘加累积操作次数相对减少了32.7%、23.5%。进一步，TR-CNN(C=1024, n=1)的EER/MinDCF分别是0.85%、0.0762，1.10%、0.1048，2.05%、0.1739。本研究代码已开源：https://github.com/splab-HRBUST/TR-CNN。
- 说话人确认 /
- 潮涌卷积 /
- 轻量化网络 /
- 二幂插值 /
- 动态归一化的非目标损失
Abstract: Objective State-of-the-art speaker verification models often achieve high performance by using fixed receptive fields, at the expense of significant parameter counts and computational loads. Given the rich, multi-layered nature of speech, employing dynamic receptive fields to capture complex information remains a relatively unexplored area, with little intuition on what constitutes an effective design. Methods Inspired by the non-linear coupling behavior of a tidal surge, this study proposes Tide-Ripple Convolution (TR-Conv) to create a more "effective receptive field". TR-Conv first constructs primary/auxiliary receptive fields within a window using power-of-two interpolation. It then employs a scan-pooling mechanism to extract key information from outside the window and an operator mechanism to perceive fine-grained differences within it. Fusing these three components yields a variable receptive field that is multi-scale, dynamic, and effective. A Tide-Ripple Convolutional Neural Network (TR-CNN) is established to validate this approach. To address the issue of label noise in datasets, a total loss function is proposed, which fuses a None-Target with Dynamic Normalization (NTDN) loss with a weighted Sub-center AAM Loss variant to enhance model performance. Results and Discussions The proposed Tide-Ripple Convolutional Neural Network (TR-CNN) is systematically validated on VoxCeleb1-O/E/H benchmarks. Results confirm TR-CNN achieves a superior balance of accuracy, computation, and parameter efficiency (Table 1). Compared to the strong ECAPA-TDNN baseline, the TR-CNN (C=512, n=1) model yields significant relative EER reductions of 4.95%, 4.03%, and 6.03%, and MinDCF reductions of 31.55%, 17.14%, and 17.42% across the test sets, while using 32.7% fewer parameters and 23.5% less computation (Table 2). The optimal TR-CNN (C=1024, n=1) model sets a new performance benchmark, achieving EERs of 0.85%, 1.10%, and 2.05%. Robustness is enhanced via a novel total loss setup, which provides stable EER and MinDCF improvements during fine-tuning (Table 3). Further analysis, including ablation studies (Table 5, 6), component explorations (Fig 3, Table 4), and t-SNE visualizations (Fig. 4), collectively validate the effectiveness and robustness of each module within the TR-CNN architecture. Conclusions This research proposes a simple and effective Tide-Ripple Convolution (TR-Conv) layer, built upon the T-RRF. Experiments demonstrate that this approach captures a more expressive and effective receptive field, significantly reducing both parameter count and computational costs. It outperforms traditional one-dimensional convolution in speech signal modeling, demonstrating excellent lightweight characteristics and scalability. Furthermore, a total loss function, comprising the NTDN loss and a Sub-center AAM loss variant, is introduced. This ensures that the speaker embeddings generated by the network are more discriminative and robust, especially in the presence of mislabeled data. Looking ahead, TR-Conv holds promise as a general-purpose module for integration into deeper and more complex neural network architectures.
- Speaker Verification /
- Tide-Ripple Convolution /
- Lightweight Network /
- Power-of-Two Interpolation /
- None-Target with Dynamic Normalization

HTML全文

图 1 网络结构对比图

下载: 全尺寸图片幻灯片

图 2 主/从感受野的形成过程

下载: 全尺寸图片幻灯片

图 3 两种潮涌卷积变体的结构示意图

下载: 全尺寸图片幻灯片

图 4 不同通道配置下的TSNE可视化

下载: 全尺寸图片幻灯片

表 1 不同主流网络的性能对比

网络	MACs	Params	Vox1_O EER_%/DCF_0.01	Vox1_E EER_%/DCF_0.01	Vox1_H EER_%/DCF_0.01
MEConformer^[29]	33.6964G	167.52M	3.86/0.1070	3.72/0.1036	5.95/0.1049
x-vector+KD^[26]	0.2650G	4.61M	1.32/0.1600	1.39/0.1550	2.44/0.2260
Gemini SD-ResNet38^[30]	2.4850G	6.72M	1.09/0.0990	1.98/0.1850	1.13/0.1170
MFA-Conformer^[16]	1.3345G	17.05M	0.97/0.0910	1.14/0.1210	2.17/0.1990
ERes2NetV2 w/o(BL+BD)^[31]	10.2000G	22.40M	0.94/0.0930	1.05/0.1190	2.01/0.2050
PLDA+VIB_LN^[32]	-	-	0.88/0.0940	1.02/0.1150	1.82/0.1740
NEMO Small^[33]	1.1200G	15.88M	0.88/0.1367	1.08/0.1342	2.20/0.2245
TR-CNN(C=512, n=1)	1.1998G	4.17M	0.96/0.0872	1.19/0.1175	2.18/0.1801
TR-CNN(C=1024, n=1)	3.0145G	10.42M	0.85/0.0762	1.10/0.1048	2.05/0.1739

下载: 导出CSV

表 2 不同通道配置下的性能对比

网络	ACC	MACs	Params	Vox1_O EER_%/DCF_0.01	Vox1_E EER_%/DCF_0.01	Vox1_H EER_%/DCF_0.01
ECAPA(C=512)+AAM	73.54	1.5690G	6.20M	1.01/0.1274	1.24/0.1418	2.32/0.2181
ECAPA(C=1024)+AAM	75.86	2.7123G	14.73M	0.87/0.1066	1.12/0.1318	2.12/0.2101
TR-CNN(C=512, n=1)	74.04	1.1998G	4.17M	0.96/0.0872	1.19/0.1175	2.18/0.1801
TR-CNN(C=1024, n=1)	75.08	3.0145G	10.43M	0.85/0.0762	1.10/0.1048	2.05/0.1739

下载: 导出CSV

表 3 不同训练阶段引入NTDN损失的性能对比

不同训练阶段	Vox1_O EER_%/DCF_0.01	Vox1_E EER_%/DCF_0.01	Vox1_H EER_%/DCF_0.01
ECAPA(C=512)+AAM	1.01/0.1274	1.24/0.1418	2.32/0.2181
ECAPA(C=512)(3/3)	0.99/0.0929	1.23/0.1187	2.25/0.1873
ECAPA(C=512)(1/3)	1.12/0.0793	1.31/0.1075	2.45/0.1701

下载: 导出CSV

表 4 不同融合方式的性能对比

网络	ACC	MACs	Params	Vox1_O EER_%/DCF_0.01	Vox1_E EER_%/DCF_0.01	Vox1_H EER_%/DCF_0.01
$ {{\boldsymbol{F}}^{(C)}} $(C=512, n=1)	73.28	0.8755G	3.87M	1.02/0.1058	1.28/0.1265	2.23/0.2091
$ {F^{(C)}} $(C=512, n=2)	72.36	0.9067G	3.76M	1.03/0.1074	1.28/0.1325	2.25/0.2167
$ {{\boldsymbol{F}}^{(C)}} $(C=512, n=3)	71.92	0.9595G	3.71M	1.05/0.1147	1.31/0.1365	2.24/0.2267
$ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=1)	74.83	2.4080G	9.84M	0.91/0.0842	1.12/0.1086	2.07/0.1882
$ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=2)	73.80	2.4368G	9.49M	0.93/0.0896	1.13/0.1091	2.09/0.1902
$ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=3)	72.29	2.5958G	9.32M	0.93/0.0817	1.15/0.1159	2.12/0.1935
$ {{\boldsymbol{F}}^{(A)}} $(C=512, n=1)	71.17	1.3325G	4.54M	1.02/0.1017	1.29/0.1241	2.21/0.2049
$ {{\boldsymbol{F}}^{(A)}} $(C=512, n=2)	70.65	1.3528G	4.32M	1.03/0.1082	1.31/0.1278	2.25/0.2098
$ {{\boldsymbol{F}}^{(A)}} $(C=512, n=3)	68.52	1.4585G	4.21M	1.02/0.1035	1.29/0.1252	2.23/0.2053

下载: 导出CSV

表 5 T-RRF关键主感受野的消融实验

消融实验	MACs	Params	EER_%/DCF_0.01
F(C=512, n=1)	1.1998G	4.17M	0.96/0.0872
-(1)	-	4.17M	0.98/0.0956
-(2)	-	4.17M	0.98/0.0922
-(3)	-	4.17M	0.97/0.0897

下载: 导出CSV

表 6 $ {\boldsymbol{F}} $(C=512, n=1)中不同模块在的消融实验

消融实验	MACs	Params	EER_%/DCF_0.01
F(C=512, n=1)	1.1998G	4.17M	0.96/0.0872
-PUB	1.1514G	4.18M	0.98/0.0932
-GFDL	1.0839G	5.86M	0.95/0.0861
F(C=512, n=1)	0.8755G	3.87M	1.02/0.1058

下载: 导出CSV

参考文献(34)

[1]	王伟, 韩纪庆, 郑铁然, 等. 基于Fisher判别字典学习的说话人识别[J]. 电子与信息学报, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566. WANG Wei, HAN Jiqing, ZHENG Tieran, et al. Speaker recognition based on fisher discrimination dictionary learning[J]. Journal of Electronics & Information Technology, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.
[2]	NAGRANI A, CHUNG J S, and ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017: 2616–2620. doi: 10.21437/Interspeech.2017-950.
[3]	NAGRANI A, CHUNG J S, XIE Weidi, et al. VoxCeleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027. doi: 10.1016/j.csl.2019.101027.
[4]	THIENPONDT J and DEMUYNCK K. ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings[C]. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 2023: 1–8. doi: 10.1109/ASRU57964.2023.10389750.
[5]	SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-Vectors: Robust DNN embeddings for speaker recognition[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[6]	THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 2302–2306.
[7]	DESPLANQUES B, THIENPONDT J, and DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]. 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 3830–3834. doi: 10.21437/Interspeech.2020-2650.
[8]	ZHAO Zhenduo, LI Zhuo, WANG Wenchao, et al. PCF: ECAPA-TDNN with progressive channel fusion for speaker verification[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095051.
[9]	HEO H J, SHIN U H, LEE R, et al. NeXt-TDNN: Modernizing Multi-Scale temporal convolution backbone for speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 11186–11190. doi: 10.1109/ICASSP48485.2024.10447037.
[10]	WANG Rui, WEI Zhihua, DUAN Haoran, et al. EfficientTDNN: Efficient architecture search for speaker recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2267–2279. doi: 10.1109/TASLP.2022.3182856.
[11]	ABROL V, THAKUR A, GUPTA A, et al. Sampling rate adaptive speaker verification from raw waveforms[C]. 27th International Conference on Pattern Recognition, Kolkata, India, 2024: 367–382. doi: 10.1007/978-3-031-78104-9_25.
[12]	MUN S H, JUNG J W, HAN M H, et al. Frequency and multi-scale selective kernel attention for speaker verification[C]. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023: 548–554. doi: 10.1109/SLT54892.2023.10023305.
[13]	CHUNG J S, NAGRANI A, and ZISSERMAN A. VoxCeleb2: Deep speaker recognition[C]. 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 1086–1090.
[14]	DENG Jiankang, GUO Jia, LIU Tongliang, et al. Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 741–757. doi: 10.1007/978-3-030-58621-8_43.
[15]	HU Jie, SHEN Li, and SUN Gang, et al. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
[16]	ZHANG Yang, LV Zhiqiang, WU Haibin, et al. MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification[C]. 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 2022: 306–310.
[17]	ZHANG Ruiteng, WEI Jianguo, LU Xugang, et al. TMS: A temporal multi-scale backbone design for speaker embedding[EB/OL]. https://doi.org/10.48550/arXiv.2203.09098, 2022.
[18]	FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 363–380. doi: 10.1007/978-3-031-72949-2_21.
[19]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
[20]	XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5987–5995. doi: 10.1109/CVPR.2017.634.
[21]	DENG Jiankang, GUO Jia, XUE Niannan, et al. ArcFace: Additive angular margin loss for deep face recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4685–4694. doi: 10.1109/CVPR.2019.00482.
[22]	THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. The IDLAB VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5814–5818. doi: 10.1109/ICASSP39728.2021.9414600.
[23]	KOUDOUNAS A, GIOBERGIA F, PASTOR E, et al. A contrastive learning approach to mitigate bias in speech models[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 827–831. doi: 10.21437/Interspeech.2024-1219.
[24]	SUN Yifen, CHENG Changmao, ZHANG Yuhan, et al. Circle loss: A unified perspective of pair similarity optimization[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6397–6406. doi: 10.1109/CVPR42600.2020.00643.
[25]	SOHN K. Improved deep metric learning with Multi-class N-pair loss objective[C]. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 1857–1865.
[26]	TRUONG D T, TAO Ruijie, YIP J Q, et al. Emphasized non-target speaker knowledge in knowledge distillation for automatic speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 10336–10340. doi: 10.1109/ICASSP48485.2024.10447160.
[27]	SNYDER D, CHEN Guoguo, and POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. https://doi.org/10.48550/arXiv.1510.08484, 2015.
[28]	KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Canada, 2017: 5220–5224. doi: 10.1109/ICASSP.2017.7953152.
[29]	ZHENG Qiuyu, CHEN Zengzhao, WANG Zhifeng, et al. MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder[J]. Expert Systems with Applications, 2024, 244: 123004. doi: 10.1016/j.eswa.2023.123004.
[30]	LIU Tianchi, LEE K A, WANG Qiongqiong, et al. Golden Gemini is all you need: Finding the sweet spots for speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2324–2337. doi: 10.1109/TASLP.2024.3385277.
[31]	CHEN Yafeng, ZHENG Siqi, WANG Hui, et al. ERes2NetV2: Boosting short-duration speaker verification performance with computational efficiency[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3245–3249. doi: 10.21437/Interspeech.2024-742.
[32]	STAFYLAKIS T, SILNOVA A, ROHDIN J, et al. Challenging margin-based speaker embedding extractors by using the variational information bottleneck[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3220–3224. doi: 10.21437/Interspeech.2024-2058.
[33]	CAI Danwei and LI Ming. Leveraging ASR pretrained conformers for speaker verification through transfer learning and knowledge distillation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3532–3545. doi: 10.1109/TASLP.2024.3419426.
[34]	KOBAK D and BERENS P. The art of using t-SNE for single-cell transcriptomics[J]. Nature Communications, 2019, 10(1): 5416. doi: 10.1038/s41467-019-13056-x.