Advanced Search
Turn off MathJax
Article Contents
TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, FU Zhangjie. Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250709
Citation: TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, FU Zhangjie. Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250709

Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction

doi: 10.11999/JEIT250709 cstr: 32379.14.JEIT250709
Funds:  The National Natural Science Foundation of China (62572251, U22B2062)
  • Received Date: 2025-07-30
  • Accepted Date: 2025-11-05
  • Rev Recd Date: 2025-10-29
  • Available Online: 2025-11-14
  •   Objective  Voice deep generation technology is able to produce speech that is perceptually realistic. Although it enriches entertainment and everyday applications, it is also exploited for voice forgery, creating risks to personal privacy and social security. Existing active defense techniques serve as a major line of protection against such forgery, yet their performance remains limited in balancing defensive strength with the imperceptibility of defensive speech examples, and in maintaining robustness.  Methods  An active defense method against voice conversion forgery is proposed on the basis of diffusion reconstruction. The diffusion vocoder PriorGrad is used as the generator, and the gradual denoising process is guided by the diffusion prior of the target speech so that the protected speech is reconstructed and defensive speech examples are obtained directly. A multi-scale auditory perceptual loss is further introduced to suppress perturbation amplitudes in frequency bands sensitive to the human auditory system, which improves the imperceptibility of the defensive examples.  Results and Discussions  Defense experiments conducted on four leading voice conversion models show that the proposed method maintains the imperceptibility of defensive speech examples and, when speaker verification accuracy is used as the evaluation metric, improves defense ability by about 32% on average in white-box scenarios and about 16% in black-box scenarios compared with the second-best method, achieving a stronger balance between defense ability and imperceptibility (Table 2). In robustness experiments, the proposed method yields an average improvement of about 29% in white-box scenarios and about 18% in black-box scenarios under three compression attacks (Table 3), and an average improvement of about 35% in the white-box scenario and about 17% in the black-box scenario under Gaussian filtering attack (Table 4). Ablation experiments further show that the use of multi-scale auditory perceptual loss improves defense ability by 5% to 10% compared with the use of single-scale auditory perceptual loss (Table 5).  Conclusions  An active defense method against voice conversion forgery based on diffusion reconstruction is proposed. Defensive speech examples are reconstructed directly through a diffusion vocoder so that the generated audio better approximates the distribution of the original target speech, and a multi-scale auditory perceptual loss is integrated to improve the imperceptibility of the defensive speech. Experimental results show that the proposed method achieves stronger defense performance than existing approaches in both white-box and black-box scenarios and remains robust under compression coding and smoothing filtering. Although the method demonstrates clear advantages in defense performance and robustness, its computational efficiency requires further improvement. Future work is directed toward diffusion generators that operate with a single time step or fewer time steps to enhance computational efficiency while maintaining defense performance.
  • loading
  • [1]
    KIM J, KIM J H, CHOI Y, et al. AdaptVC: High quality voice conversion with adaptive learning[C]. 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889396.
    [2]
    李旭嵘, 纪守领, 吴春明, 等. 深度伪造与检测技术综述[J]. 软件学报, 2021, 32(2): 496–518. doi: 10.13328/j.cnki.jos.006140.

    LI Xurong, JI Shouling, WU Chunming, et al. Survey on deepfakes and detection techniques[J]. Journal of Software, 2021, 32(2): 496–518. doi: 10.13328/j.cnki.jos.006140.
    [3]
    ZHANG Bowen, CUI Hui, NGUYEN V, et al. Audio deepfake detection: What has been achieved and what lies ahead[J]. Sensors, 2025, 25(7): 1989. doi: 10.3390/s25071989.
    [4]
    FAN Cunhang, DING Mingming, TAO Jianhua, et al. Dual-branch knowledge distillation for noise-robust synthetic speech detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2453–2466. doi: 10.1109/TASLP.2024.3389643.
    [5]
    钱亚冠, 张锡敏, 王滨, 等. 基于二阶对抗样本的对抗训练防御[J]. 电子与信息学报, 2021, 43(11): 3367–3373. doi: 10.11999/JEIT200723.

    QIAN Yaguan, ZHANG Ximin, WANG Bin, et al. Adversarial training defense based on second-order adversarial examples[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3367–3373. doi: 10.11999/JEIT200723.
    [6]
    胡军, 石艺杰. 基于动量增强特征图的对抗防御算法[J]. 电子与信息学报, 2023, 45(12): 4548–4555. doi: 10.11999/JEIT221414.

    HU Jun and SHI Yijie. Adversarial defense algorithm based on momentum enhanced future map[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4548–4555. doi: 10.11999/JEIT221414.
    [7]
    张思思, 左信, 刘建伟. 深度学习中的对抗样本问题[J]. 计算机学报, 2019, 42(8): 1886–1904. doi: 10.11897/SP.J.1016.2019.01886.

    ZHANG Sisi, ZUO Xin, and LIU Jianwei. The problem of the adversarial examples in deep learning[J]. Chinese Journal of Computers, 2019, 42(8): 1886–1904. doi: 10.11897/SP.J.1016.2019.01886.
    [8]
    HUANG C Y, LIN Y Y, LEE H Y, et al. Defending your voice: Adversarial attack on voice conversion[C]. 2021 IEEE Spoken Language Technology Workshop, Shenzhen, China, 2021: 552–559. doi: 10.1109/SLT48900.2021.9383529.
    [9]
    LI Jingyang, YE Dengpan, TANG Long, et al. Voice Guard: Protecting voice privacy with strong and imperceptible adversarial perturbation in the time domain[C]. The Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 2023: 4812–4820. doi: 10.24963/ijcai.2023/535.
    [10]
    DONG Shihang, CHEN Beijing, MA Kaijie, et al. Active defense against voice conversion through generative adversarial network[J]. IEEE Signal Processing Letters, 2024, 31: 706–710. doi: 10.1109/LSP.2024.3365034.
    [11]
    QIAN Kaizhi, ZHANG Yang, CHANG Shiyu, et al. AutoVC: Zero-shot voice style transfer with only autoencoder loss[C]. The 36th International Conference on Machine Learning, Long Beach, USA, 2019: 5210–5219.
    [12]
    CHOU J C and LEE H Y. One-shot voice conversion by separating speaker and content representations with instance normalization[C]. The 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 2019: 664–668. doi: 10.21437/Interspeech.2019-2663.
    [13]
    WU Dayi, CHEN Yenhao, and LEE H Y. VQVC+: One-shot voice conversion by vector quantization and U-Net architecture[C]. The 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 4691–4695. doi: 10.21437/INTERSPEECH.2020-1443.
    [14]
    PARK H J, YANG S W, KIM J S, et al. TriAAN-VC: Triple adaptive attention normalization for any-to-any voice conversion[C]. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096642.
    [15]
    HUANG Fan, ZENG Kun, and ZHU Wei. DiffVC+: Improving diffusion-based voice conversion for speaker anonymization[C]. The 25th Annual Conference of the International Speech Communication Association, Kos Island, Greece, 2024: 4453–4457. doi: 10.21437/Interspeech.2024-502.
    [16]
    MENG Dongyu and CHEN Hao. MagNet: A two-pronged defense against adversarial examples[C]. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York, USA, 2017: 135–147. doi: 10.1145/3133956.3134057.
    [17]
    HO J, JAIN A, and ABBEEL P. Denoising diffusion probabilistic models[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 574.
    [18]
    LEE S G, KIM H, SHIN C, et al. PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior[C]. The 10th International Conference on Learning Representations, 2022: 1–18.
    [19]
    SUZUKI  Y and TAKESHIMA  H. Equal-loudness-level contours for pure tones[J]. The Journal of the Acoustical Society of America, 2004, 116(2):  918–933. doi: 10.1121/1.1763601.
    [20]
    YAMAMOTO R, SONG E, and KIM J M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 6199–6203. doi: 10.1109/ICASSP40776.2020.9053795.
    [21]
    WANG Yulong and ZHANG Xueliang. MFT-CRN: Multi-scale Fourier transform for monaural speech enhancement[C]. The 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 1060–1064. doi: 10.21437/Interspeech.2023-865.
    [22]
    YAMAGISHI J, VEAUX C, and MACDONALD K. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)[EB/OL]. https://datashare.ed.ac.uk/handle/10283/3443, 2019.
    [23]
    CHEN Y H, WU Dayi, WU T H, et al. Again-VC: A one-shot voice conversion using activation guidance and adaptive instance normalization[C]. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5954–5958. doi: 10.1109/ICASSP39728.2021.9414257.
    [24]
    GOODFELLOW I J, SHLENS J, and SZEGEDY C. Explaining and harnessing adversarial examples[C]. The 3rd International Conference on Learning Representations, San Diegoa, USA, 2015.
    [25]
    WANG Run, HUANG Ziheng, CHEN Zhikai, et al. Anti-forgery: Towards a stealthy and robust DeepFake disruption attack via adversarial perceptual-aware perturbations[C]. The Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 2022: 761–767. doi: 10.24963/ijcai.2022/107.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(5)

    Article Metrics

    Article views (120) PDF downloads(29) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return