Advanced Search
Turn off MathJax
Article Contents
CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713
Citation: CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713

Speaker Verification Based on Tide-Ripple Convolution Neural Network

doi: 10.11999/JEIT250713 cstr: 32379.14.JEIT250713
Funds:  The National Natural Science Foundation of China (62101163), Heilongjiang Provincial Natural Science Foundation (YQ2024F018), The Key Research and Development Project of Heilongjiang Province (JD2023SJ20)
  • Received Date: 2025-07-30
  • Accepted Date: 2025-11-17
  • Rev Recd Date: 2025-11-17
  • Available Online: 2025-11-25
  •   Objective  State-of-the-art speaker verification models typically rely on fixed receptive fields, which limits their ability to represent multi-scale acoustic patterns while increasing parameter counts and computational loads. Speech contains layered temporal–spectral structures, yet the use of dynamic receptive fields to characterize these structures is still not well explored. The design principles for effective dynamic receptive field mechanisms also remain unclear.  Methods  Inspired by the non-linear coupling behavior of tidal surges, a Tide-Ripple Convolution (TR-Conv) layer is proposed to form a more effective receptive field. TR-Conv constructs primary and auxiliary receptive fields within a window by applying power-of-two interpolation. It then employs a scan-pooling mechanism to capture salient information outside the window and an operator mechanism to perceive fine-grained variations within it. The fusion of these components produces a variable receptive field that is multi-scale and dynamic. A Tide-Ripple Convolutional Neural Network (TR-CNN) is developed to validate this design. To mitigate label noise in training datasets, a total loss function is introduced by combining a NoneTarget with Dynamic Normalization (NTDN) loss and a weighted Sub-center AAM Loss variant, improving model robustness and performance.  Results and Discussions  The TR-CNN is evaluated on the VoxCeleb1-O/E/H benchmarks. The results show that TR-CNN achieves a competitive balance of accuracy, computation, and parameter efficiency (Table 1). Compared with the strong ECAPA-TDNN baseline, the TR-CNN (C=512, n=1) model attains relative EER reductions of 4.95%, 4.03%, and 6.03%, and MinDCF reductions of 31.55%, 17.14%, and 17.42% across the three test sets, while requiring 32.7% fewer parameters and 23.5% less computation (Table 2). The optimal TR-CNN (C=1024, n=1) model further improves performance, achieving EERs of 0.85%, 1.10%, and 2.05%. Robustness is strengthened by the proposed total loss function, which yields consistent improvements in EER and MinDCF during fine-tuning (Table 3). Additional evaluations, including ablation studies (Tables 5 and 6), component analyses (Fig. 3 and Table 4), and t-SNE visualizations (Fig. 4), confirm the effectiveness and robustness of each module in the TR-CNN architecture.  Conclusions  This research proposes a simple and effective TR-Conv layer built on the T-RRF mechanism. Experimental results show that TR-Conv forms a more expressive and effective receptive field, reducing parameter count and computational cost while exceeding conventional one-dimensional convolution in speech feature modeling. It also exhibits strong lightweight characteristics and scalability. Furthermore, a total loss function combining the NTDN loss and a Sub-center AAM loss variant is proposed to enhance the discriminability and robustness of speaker embeddings, particularly under label noise. TR-Conv shows potential as a general-purpose module for integration into deeper and more complex network architectures.
  • loading
  • [1]
    王伟, 韩纪庆, 郑铁然, 等. 基于Fisher判别字典学习的说话人识别[J]. 电子与信息学报, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.

    WANG Wei, HAN Jiqing, ZHENG Tieran, et al. Speaker recognition based on fisher discrimination dictionary learning[J]. Journal of Electronics & Information Technology, 2016, 38(2): 367–372. doi: 10.11999/JEIT150566.
    [2]
    NAGRANI A, CHUNG J S, and ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]. The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 2017: 2616–2620. doi: 10.21437/Interspeech.2017-950.
    [3]
    NAGRANI A, CHUNG J S, XIE Weidi, et al. VoxCeleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027. doi: 10.1016/j.csl.2019.101027.
    [4]
    THIENPONDT J and DEMUYNCK K. ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings[C]. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 2023: 1–8. doi: 10.1109/ASRU57964.2023.10389750.
    [5]
    SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-Vectors: Robust DNN embeddings for speaker recognition[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
    [6]
    THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification[C]. The 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 2021: 2302–2306.
    [7]
    DESPLANQUES B, THIENPONDT J, and DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]. The 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 3830–3834. doi: 10.21437/Interspeech.2020-2650.
    [8]
    ZHAO Zhenduo, LI Zhuo, WANG Wenchao, et al. PCF: ECAPA-TDNN with progressive channel fusion for speaker verification[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095051.
    [9]
    HEO H J, SHIN U H, LEE R, et al. NeXt-TDNN: Modernizing Multi-Scale temporal convolution backbone for speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 11186–11190. doi: 10.1109/ICASSP48485.2024.10447037.
    [10]
    WANG Rui, WEI Zhihua, DUAN Haoran, et al. EfficientTDNN: Efficient architecture search for speaker recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2267–2279. doi: 10.1109/TASLP.2022.3182856.
    [11]
    ABROL V, THAKUR A, GUPTA A, et al. Sampling rate adaptive speaker verification from raw waveforms[C]. The 27th International Conference on Pattern Recognition, Kolkata, India, 2024: 367–382. doi: 10.1007/978-3-031-78104-9_25.
    [12]
    MUN S H, JUNG J W, HAN M H, et al. Frequency and multi-scale selective kernel attention for speaker verification[C]. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023: 548–554. doi: 10.1109/SLT54892.2023.10023305.
    [13]
    CHUNG J S, NAGRANI A, and ZISSERMAN A. VoxCeleb2: Deep speaker recognition[C]. The 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 1086–1090.
    [14]
    DENG Jiankang, GUO Jia, LIU Tongliang, et al. Sub-center ArcFace: Boosting face recognition by large-scale noisy web faces[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 741–757. doi: 10.1007/978-3-030-58621-8_43.
    [15]
    HU Jie, SHEN Li, and SUN Gang, et al. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
    [16]
    ZHANG Yang, LÜ Zhiqiang, WU Haibin, et al. MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification[C]. The 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 2022: 306–310.
    [17]
    ZHANG Ruiteng, WEI Jianguo, LU Xugang, et al. TMS: A temporal multi-scale backbone design for speaker embedding[EB/OL]. https://doi.org/10.48550/arXiv.2203.09098, 2022.
    [18]
    FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 363–380. doi: 10.1007/978-3-031-72949-2_21.
    [19]
    LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
    [20]
    XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5987–5995. doi: 10.1109/CVPR.2017.634.
    [21]
    DENG Jiankang, GUO Jia, XUE Niannan, et al. ArcFace: Additive angular margin loss for deep face recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4685–4694. doi: 10.1109/CVPR.2019.00482.
    [22]
    THIENPONDT J, DESPLANQUES B, and DEMUYNCK K. The IDLAB VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5814–5818. doi: 10.1109/ICASSP39728.2021.9414600.
    [23]
    KOUDOUNAS A, GIOBERGIA F, PASTOR E, et al. A contrastive learning approach to mitigate bias in speech models[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 827–831. doi: 10.21437/Interspeech.2024-1219.
    [24]
    SUN Yifen, CHENG Changmao, ZHANG Yuhan, et al. Circle loss: A unified perspective of pair similarity optimization[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6397–6406. doi: 10.1109/CVPR42600.2020.00643.
    [25]
    SOHN K. Improved deep metric learning with Multi-class N-pair loss objective[C]. The 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 1857–1865.
    [26]
    TRUONG D T, TAO Ruijie, YIP J Q, et al. Emphasized non-target speaker knowledge in knowledge distillation for automatic speaker verification[C]. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 2024: 10336–10340. doi: 10.1109/ICASSP48485.2024.10447160.
    [27]
    SNYDER D, CHEN Guoguo, and POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. https://doi.org/10.48550/arXiv.1510.08484, 2015.
    [28]
    KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Canada, 2017: 5220–5224. doi: 10.1109/ICASSP.2017.7953152.
    [29]
    ZHENG Qiuyu, CHEN Zengzhao, WANG Zhifeng, et al. MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder[J]. Expert Systems with Applications, 2024, 244: 123004. doi: 10.1016/j.eswa.2023.123004.
    [30]
    LIU Tianchi, LEE K A, WANG Qiongqiong, et al. Golden Gemini is all you need: Finding the sweet spots for speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2324–2337. doi: 10.1109/TASLP.2024.3385277.
    [31]
    CHEN Yafeng, ZHENG Siqi, WANG Hui, et al. ERes2NetV2: Boosting short-duration speaker verification performance with computational efficiency[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3245–3249. doi: 10.21437/Interspeech.2024-742.
    [32]
    STAFYLAKIS T, SILNOVA A, ROHDIN J, et al. Challenging margin-based speaker embedding extractors by using the variational information bottleneck[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024: 3220–3224. doi: 10.21437/Interspeech.2024-2058.
    [33]
    CAI Danwei and LI Ming. Leveraging ASR pretrained conformers for speaker verification through transfer learning and knowledge distillation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3532–3545. doi: 10.1109/TASLP.2024.3419426.
    [34]
    KOBAK D and BERENS P. The art of using t-SNE for single-cell transcriptomics[J]. Nature Communications, 2019, 10(1): 5416. doi: 10.1038/s41467-019-13056-x.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(6)

    Article Metrics

    Article views (55) PDF downloads(2) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return