高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于潮涌卷积神经网络的说话人确认

陈晨 仪志鑫 李东源 陈德运

陈晨, 仪志鑫, 李东源, 陈德运. 基于潮涌卷积神经网络的说话人确认[J]. 电子与信息学报. doi: 10.11999/JEIT250713
引用本文: 陈晨, 仪志鑫, 李东源, 陈德运. 基于潮涌卷积神经网络的说话人确认[J]. 电子与信息学报. doi: 10.11999/JEIT250713
CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713
Citation: CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713

基于潮涌卷积神经网络的说话人确认

doi: 10.11999/JEIT250713 cstr: 32379.14.JEIT250713
基金项目: 国家自然科学基金(62101163),黑龙江省自然科学基金(YQ2024F018),黑龙江省重点研发计划项目(JD2023SJ20)
详细信息
    作者简介:

    陈晨:女,副教授,博士生导师,研究方向为计算机听觉、语音信号处理、音频信息分析、模式识别

    仪志鑫:男,硕士生,研究方向为说话人识别方向

    李东源:男,硕士生,研究方向为说话人识别方向

    陈德运:男,教授,博士生导师,研究方向为机器学习、模式识别、数据挖掘

    通讯作者:

    陈晨 chenc@hrbust.edu.cn

  • 中图分类号: TN912.3

Speaker Verification Based on Tide-Ripple Convolution Neural Network

Funds: National Natural Science Foundation of China (62101163), Heilongjiang Provincial Natural Science Foundation (YQ2024F018), Key Research and development project of Heilongjiang Province (JD2023SJ20)
  • 摘要: 近年来,最先进的说话人确认模型大多数以牺牲参数量和计算量的代价来实现感受野的固定获取,鉴于语音信号内部蕴含着丰富且多层次的信息,然而通过高度自主选择的动态感受野来描绘复杂信息是相对未被探索的,更没有直观的解释是什么构成了关于有效感受野的最佳实践。潮涌现象表现为潮水前端形成陡立水墙并伴随轰鸣声高速推进,受其非线性耦合行为的启发,提出潮涌卷积(TR-Conv)“使用潮涌感受野(T-RRF),获得更有效感受野”。首先采用二幂插值操作构建窗口内的主/从感受野,随后分别采用扫描-池化机制聚焦提取窗口外的关键信息、算子机制精细感知窗口内的差异信息,最后融合三重感受野,得到兼具多尺度、动态性、有效性的可变感受野。为全面验证潮涌卷积的表现,建立潮涌卷积神经网络(TR-CNN)。另外,针对数据集的错误标签问题,提出动态归一化的非目标(NTDN)损失与具有两个子中心的加性角边距(Sub-Center AAM)损失变体加权融合的总损失,以提升模型性能。实验结果表明,与ECAPA-TDNN(C=512)相比,TR-CNN(C=512, n=1)分别在测试集Vox1-O、Vox1-E、Vox1-H上的等错误率(EER)和最小检测代价函数(MinDCF)相对降低了4.95%、31.55%,4.03%、17.14%和6.03%、17.42%,参数量和乘加累积操作次数相对减少了32.7%、23.5%。进一步,TR-CNN(C=1024, n=1)的EER/MinDCF分别是0.85%、0.0762,1.10%、0.1048,2.05%、0.1739。本研究代码已开源:https://github.com/splab-HRBUST/TR-CNN。
  • 图  1  网络结构对比图

    图  2  主/从感受野的形成过程

    图  3  两种潮涌卷积变体的结构示意图

    图  4  不同通道配置下的TSNE可视化

    表  1  不同主流网络的性能对比

    网络MACsParamsVox1_O
    EER%/DCF0.01
    Vox1_E
    EER%/DCF0.01
    Vox1_H
    EER%/DCF0.01
    MEConformer[29]33.6964G167.52M3.86/0.10703.72/0.10365.95/0.1049
    x-vector+KD[26]0.2650G4.61M1.32/0.16001.39/0.15502.44/0.2260
    Gemini SD-ResNet38[30]2.4850G6.72M1.09/0.09901.98/0.18501.13/0.1170
    MFA-Conformer[16]1.3345G17.05M0.97/0.09101.14/0.12102.17/0.1990
    ERes2NetV2 w/o(BL+BD)[31]10.2000G22.40M0.94/0.09301.05/0.11902.01/0.2050
    PLDA+VIB_LN[32]--0.88/0.09401.02/0.11501.82/0.1740
    NEMO Small[33]1.1200G15.88M0.88/0.13671.08/0.13422.20/0.2245
    TR-CNN(C=512, n=1)1.1998G4.17M0.96/0.08721.19/0.11752.18/0.1801
    TR-CNN(C=1024, n=1)3.0145G10.42M0.85/0.07621.10/0.10482.05/0.1739
    下载: 导出CSV

    表  2  不同通道配置下的性能对比

    网络ACCMACsParamsVox1_O
    EER%/DCF0.01
    Vox1_E
    EER%/DCF0.01
    Vox1_H
    EER%/DCF0.01
    ECAPA(C=512)+AAM73.541.5690G6.20M1.01/0.12741.24/0.14182.32/0.2181
    ECAPA(C=1024)+AAM75.862.7123G14.73M0.87/0.10661.12/0.13182.12/0.2101
    TR-CNN(C=512, n=1)74.041.1998G4.17M0.96/0.08721.19/0.11752.18/0.1801
    TR-CNN(C=1024, n=1)75.083.0145G10.43M0.85/0.07621.10/0.10482.05/0.1739
    下载: 导出CSV

    表  3  不同训练阶段引入NTDN损失的性能对比

    不同训练阶段 Vox1_O EER%/DCF0.01 Vox1_E EER%/DCF0.01 Vox1_H EER%/DCF0.01
    ECAPA(C=512)+AAM 1.01/0.1274 1.24/0.1418 2.32/0.2181
    ECAPA(C=512)(3/3) 0.99/0.0929 1.23/0.1187 2.25/0.1873
    ECAPA(C=512)(1/3) 1.12/0.0793 1.31/0.1075 2.45/0.1701
    下载: 导出CSV

    表  4  不同融合方式的性能对比

    网络 ACC MACs Params Vox1_O
    EER%/DCF0.01
    Vox1_E
    EER%/DCF0.01
    Vox1_H
    EER%/DCF0.01
    $ {{\boldsymbol{F}}^{(C)}} $(C=512, n=1) 73.28 0.8755G 3.87M 1.02/0.1058 1.28/0.1265 2.23/0.2091
    $ {F^{(C)}} $(C=512, n=2) 72.36 0.9067G 3.76M 1.03/0.1074 1.28/0.1325 2.25/0.2167
    $ {{\boldsymbol{F}}^{(C)}} $(C=512, n=3) 71.92 0.9595G 3.71M 1.05/0.1147 1.31/0.1365 2.24/0.2267
    $ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=1) 74.83 2.4080G 9.84M 0.91/0.0842 1.12/0.1086 2.07/0.1882
    $ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=2) 73.80 2.4368G 9.49M 0.93/0.0896 1.13/0.1091 2.09/0.1902
    $ {{\boldsymbol{F}}^{(C)}} $(C=1024, n=3) 72.29 2.5958G 9.32M 0.93/0.0817 1.15/0.1159 2.12/0.1935
    $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=1) 71.17 1.3325G 4.54M 1.02/0.1017 1.29/0.1241 2.21/0.2049
    $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=2) 70.65 1.3528G 4.32M 1.03/0.1082 1.31/0.1278 2.25/0.2098
    $ {{\boldsymbol{F}}^{(A)}} $(C=512, n=3) 68.52 1.4585G 4.21M 1.02/0.1035 1.29/0.1252 2.23/0.2053
    下载: 导出CSV

    表  5  T-RRF关键主感受野的消融实验

    消融实验 MACs Params EER%/DCF0.01
    F(C=512, n=1) 1.1998G 4.17M 0.96/0.0872
    -(1) - 4.17M 0.98/0.0956
    -(2) - 4.17M 0.98/0.0922
    -(3) - 4.17M 0.97/0.0897
    下载: 导出CSV

    表  6  $ {\boldsymbol{F}} $(C=512, n=1)中不同模块在的消融实验

    消融实验 MACs Params EER%/DCF0.01
    F(C=512, n=1) 1.1998G 4.17M 0.96/0.0872
    -PUB 1.1514G 4.18M 0.98/0.0932
    -GFDL 1.0839G 5.86M 0.95/0.0861
    F(C=512, n=1) 0.8755G 3.87M 1.02/0.1058
    下载: 导出CSV
  • [1] 王伟, 韩纪庆, 郑铁然, 等. 基于Fisher判别字典学习的说话人识别[J]. 电子与信息学报, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.

    WANG Wei, HAN Jiqing, ZHENG Tieran, et al. Speaker recognition based on fisher discrimination dictionary learning[J]. Journal of Electronics & Information Technology, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.
    [2] NAGRANI A, CHUNG J S, and ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]. 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017: 2616–2620. doi: 10.21437/Interspeech.2017-950.
    [3] NAGRANI A, CHUNG J S, XIE Weidi, et al. VoxCeleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027. doi: 10.1016/j.csl.2019.101027.
    [4] THIENPONDT J and DEMUYNCK K. ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings[C]. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 2023: 1–8. doi: 10.1109/ASRU57964.2023.10389750.
    [5] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-Vectors: Robust DNN embeddings for speaker recognition[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
    [6] THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification[C]. 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 2302–2306.
    [7] DESPLANQUES B, THIENPONDT J, and DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]. 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 3830–3834. doi: 10.21437/Interspeech.2020-2650.
    [8] ZHAO Zhenduo, LI Zhuo, WANG Wenchao, et al. PCF: ECAPA-TDNN with progressive channel fusion for speaker verification[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095051.
    [9] HEO H J, SHIN U H, LEE R, et al. NeXt-TDNN: Modernizing Multi-Scale temporal convolution backbone for speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 11186–11190. doi: 10.1109/ICASSP48485.2024.10447037.
    [10] WANG Rui, WEI Zhihua, DUAN Haoran, et al. EfficientTDNN: Efficient architecture search for speaker recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2267–2279. doi: 10.1109/TASLP.2022.3182856.
    [11] ABROL V, THAKUR A, GUPTA A, et al. Sampling rate adaptive speaker verification from raw waveforms[C]. 27th International Conference on Pattern Recognition, Kolkata, India, 2024: 367–382. doi: 10.1007/978-3-031-78104-9_25.
    [12] MUN S H, JUNG J W, HAN M H, et al. Frequency and multi-scale selective kernel attention for speaker verification[C]. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023: 548–554. doi: 10.1109/SLT54892.2023.10023305.
    [13] CHUNG J S, NAGRANI A, and ZISSERMAN A. VoxCeleb2: Deep speaker recognition[C]. 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 1086–1090.
    [14] DENG Jiankang, GUO Jia, LIU Tongliang, et al. Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 741–757. doi: 10.1007/978-3-030-58621-8_43.
    [15] HU Jie, SHEN Li, and SUN Gang, et al. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
    [16] ZHANG Yang, LV Zhiqiang, WU Haibin, et al. MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification[C]. 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 2022: 306–310.
    [17] ZHANG Ruiteng, WEI Jianguo, LU Xugang, et al. TMS: A temporal multi-scale backbone design for speaker embedding[EB/OL]. https://doi.org/10.48550/arXiv.2203.09098, 2022.
    [18] FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 363–380. doi: 10.1007/978-3-031-72949-2_21.
    [19] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
    [20] XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5987–5995. doi: 10.1109/CVPR.2017.634.
    [21] DENG Jiankang, GUO Jia, XUE Niannan, et al. ArcFace: Additive angular margin loss for deep face recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4685–4694. doi: 10.1109/CVPR.2019.00482.
    [22] THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. The IDLAB VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5814–5818. doi: 10.1109/ICASSP39728.2021.9414600.
    [23] KOUDOUNAS A, GIOBERGIA F, PASTOR E, et al. A contrastive learning approach to mitigate bias in speech models[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 827–831. doi: 10.21437/Interspeech.2024-1219.
    [24] SUN Yifen, CHENG Changmao, ZHANG Yuhan, et al. Circle loss: A unified perspective of pair similarity optimization[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6397–6406. doi: 10.1109/CVPR42600.2020.00643.
    [25] SOHN K. Improved deep metric learning with Multi-class N-pair loss objective[C]. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 1857–1865.
    [26] TRUONG D T, TAO Ruijie, YIP J Q, et al. Emphasized non-target speaker knowledge in knowledge distillation for automatic speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 10336–10340. doi: 10.1109/ICASSP48485.2024.10447160.
    [27] SNYDER D, CHEN Guoguo, and POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. https://doi.org/10.48550/arXiv.1510.08484, 2015.
    [28] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Canada, 2017: 5220–5224. doi: 10.1109/ICASSP.2017.7953152.
    [29] ZHENG Qiuyu, CHEN Zengzhao, WANG Zhifeng, et al. MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder[J]. Expert Systems with Applications, 2024, 244: 123004. doi: 10.1016/j.eswa.2023.123004.
    [30] LIU Tianchi, LEE K A, WANG Qiongqiong, et al. Golden Gemini is all you need: Finding the sweet spots for speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2324–2337. doi: 10.1109/TASLP.2024.3385277.
    [31] CHEN Yafeng, ZHENG Siqi, WANG Hui, et al. ERes2NetV2: Boosting short-duration speaker verification performance with computational efficiency[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3245–3249. doi: 10.21437/Interspeech.2024-742.
    [32] STAFYLAKIS T, SILNOVA A, ROHDIN J, et al. Challenging margin-based speaker embedding extractors by using the variational information bottleneck[C]. 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3220–3224. doi: 10.21437/Interspeech.2024-2058.
    [33] CAI Danwei and LI Ming. Leveraging ASR pretrained conformers for speaker verification through transfer learning and knowledge distillation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3532–3545. doi: 10.1109/TASLP.2024.3419426.
    [34] KOBAK D and BERENS P. The art of using t-SNE for single-cell transcriptomics[J]. Nature Communications, 2019, 10(1): 5416. doi: 10.1038/s41467-019-13056-x.
  • 加载中
图(4) / 表(6)
计量
  • 文章访问数:  16
  • HTML全文浏览量:  9
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-07-30
  • 修回日期:  2025-11-17
  • 录用日期:  2025-11-17
  • 网络出版日期:  2025-11-25

目录

    /

    返回文章
    返回