Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction

TIAN Haoyuan; CHEN Yuxuan; CHEN Beijing; FU Zhangjie

doi:10.11999/JEIT250709

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, FU Zhangjie. Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250709

Citation:

TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, FU Zhangjie. Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250709

Citation:

PDF( 1825 KB)

Defeating Voice Conversion Forgery by Active Defense with Diffusion Reconstruction

doi: 10.11999/JEIT250709 cstr: 32379.14.JEIT250709

Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China

Funds: The National Natural Science Foundation of China (62572251, U22B2062)

Received Date: 2025-07-30
Accepted Date: 2025-11-05
Rev Recd Date: 2025-11-05

Available Online: 2025-11-14

Abstract

Abstract

Objective Voice deep generation technology has been able to generate realistic speech. While enriching people’s entertainment and daily lives, it is also easily abused by malicious actors for voice forgery, thereby posing significant risks to personal privacy and social security. As one of the mainstream defense technologies against voice forgery, the existing active defense techniques have achieved certain achievements, but their performance remains average in balancing defense ability with the imperceptibility of defensive examples, as well as in robustness. Methods This paper proposes an active defense method against voice conversion forgery by diffusion reconstruction. The proposed method utilizes the diffusion vocoder PriorGrad as a generator, which guiding the gradual denoising process based on the diffusion prior of the speech to be protected, and reconstructs the speech to be protected, directly obtaining defensive speech examples. Moreover, the proposed method introduces a multi-scale auditory perceptual loss, suppressing the perturbation amplitude of frequency bands sensitive to the human auditory system, thereby enhancing the imperceptibility of defensive examples. Results and Discussions The defense experiments on four leading voice conversion models show that, while maintaining the imperceptibility of defensive speech examples and using speaker verification accuracy as the objective metric, compared with the second-best method, the proposed method improves defense ability on average by about 32% in white-box scenarios and about 16% in black-box scenarios, and achieves a better balance between defense ability and imperceptibility (Table 2). In the robustness experiment, compared with the second-best method, the proposed method achieves an average improvement of about 29% in white-box scenarios and about 18% in black-box scenarios under three types of compression attacks (Table 3), as well as an average improvement of about 35% in the white-box scenario and about 17% in the black-box scenario under Gaussian filtering attack (Table 4); In the ablation experiments, the proposed method using the multi-scale auditory perceptual loss achieves a 5% to 10% improvement in defense ability compared with the method using a single-scale auditory perceptual loss (Table 5). Conclusions An active defense method against voice conversion forgery by diffusion reconstruction is proposed in this paper. The method directly reconstructs defensive speech examples that better approximate the distribution of the original target speech data through the diffusion vocoder, and combines a multi-scale auditory perceptual loss to further enhance the imperceptibility of the defensive speech. Experimental results show that, compared with existing methods, the proposed method not only achieves superior defense performance in both white-box and black-box scenarios, but also exhibits robustness against compression coding and smoothing filtering. Although the proposed method attains significant results in defense performance and robustness, its computational efficiency still needs to be further improved. Therefore, future work will focus on exploring diffusion generators with a single time step or fewer time steps in order to improve computational efficiency while maintaining defense performance as much as possible.
- Voice conversion,
- Deepfake,
- Active defense,
- Diffusion model

FullText(HTML)

References(26)

References

[1]	KIM J, KIM J H, CHOI Y, et al. AdaptVC: High quality voice conversion with adaptive learning[C]. Proceedings of 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889396.
[2]	李旭嵘, 纪守领, 吴春明, 等. 深度伪造与检测技术综述[J]. 软件学报, 2021, 32(2): 496–518. doi: 10.13328/j.cnki.jos.006140. LI Xurong, JI Shouling, WU Chunming, et al. Survey on deepfakes and detection techniques[J]. Journal of Software, 2021, 32(2): 496–518. doi: 10.13328/j.cnki.jos.006140.
[3]	ZHANG Bowen, CUI Hui, NGUYEN V, et al. Audio deepfake detection: What has been achieved and what lies ahead[J]. Sensors, 2025, 25(7): 1989. doi: 10.3390/s25071989.
[4]	FAN Cunhang, DING Mingming, TAO Jianhua, et al. Dual-branch knowledge distillation for noise-robust synthetic speech detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2453–2466. doi: 10.1109/TASLP.2024.3389643.
[5]	钱亚冠, 张锡敏, 王滨, 等. 基于二阶对抗样本的对抗训练防御[J]. 电子与信息学报, 2021, 43(11): 3367–3373. doi: 10.11999/JEIT200723. QIAN Yaguan, ZHANG Ximin, WANG Bin, et al. Adversarial training defense based on second-order adversarial examples[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3367–3373. doi: 10.11999/JEIT200723.
[6]	胡军, 石艺杰. 基于动量增强特征图的对抗防御算法[J]. 电子与信息学报, 2023, 45(12): 4548–4555. doi: 10.11999/JEIT221414. HU Jun and SHI Yijie. Adversarial defense algorithm based on momentum enhanced future map[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4548–4555. doi: 10.11999/JEIT221414.
[7]	张思思, 左信, 刘建伟. 深度学习中的对抗样本问题[J]. 计算机学报, 2019, 42(8): 1886–1904. doi: 10.11897/SP.J.1016.2019.01886. ZHANG Sisi, ZUO Xin, and LIU Jianwei. The problem of the adversarial examples in deep learning[J]. Chinese Journal of Computers, 2019, 42(8): 1886–1904. doi: 10.11897/SP.J.1016.2019.01886.
[8]	HUANG C Y, LIN Y Y, LEE H Y, et al. Defending your voice: Adversarial attack on voice conversion[C]. Proceedings of 2021 IEEE Spoken Language Technology Workshop, Shenzhen, China, 2021: 552–559. doi: 10.1109/SLT48900.2021.9383529.
[9]	LI Jingyang, YE Dengpan, TANG Long, et al. Voice Guard: Protecting voice privacy with strong and imperceptible adversarial perturbation in the time domain[C]. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 2023: 4812–4820. doi: 10.24963/ijcai.2023/535.
[10]	DONG Shihang, CHEN Beijing, MA Kaijie, et al. Active defense against voice conversion through generative adversarial network[J]. IEEE Signal Processing Letters, 2024, 31: 706–710. doi: 10.1109/LSP.2024.3365034.
[11]	QIAN Kaizhi, ZHANG Yang, CHANG Shiyu, et al. AutoVC: Zero-shot voice style transfer with only autoencoder loss[C]. Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, 2019: 5210–5219.
[12]	CHOU J C and LEE H Y. One-shot voice conversion by separating speaker and content representations with instance normalization[C]. Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 2019: 664–668. doi: 10.21437/Interspeech.2019-2663.
[13]	WU Dayi, CHEN Yenhao, and LEE H Y. VQVC+: One-shot voice conversion by vector quantization and U-Net architecture[C]. Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 4691–4695. doi: 10.21437/INTERSPEECH.2020-1443.
[14]	PARK H J, YANG S W, KIM J S, et al. TriAAN-VC: Triple adaptive attention normalization for any-to-any voice conversion[C]. Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096642.
[15]	HUANG Fan, ZENG Kun, and ZHU Wei. DiffVC+: Improving diffusion-based voice conversion for speaker anonymization[C]. Proceedings of the 25th Annual Conference of the International Speech Communication Association, Kos Island, Greece, 2024: 4453–4457. doi: 10.21437/Interspeech.2024-502.
[16]	LEMERCIER J M, RICHTER J, WELKER S, et al. StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2724–2737. doi: 10.1109/TASLP.2023.3294692.
[17]	MENG Dongyu and CHEN Hao. MagNet: A two-pronged defense against adversarial examples[C]. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, USA, 2017: 135–147. doi: 10.1145/3133956.3134057.(查阅网上资料,请核对文献类型及格式).
[18]	HO J, JAIN A, and ABBEEL P. Denoising diffusion probabilistic models[C]. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 574.
[19]	LEE S G, KIM H, SHIN C, et al. PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior[C]. Proceedings of the 10th International Conference on Learning Representations, 2022: 1–18. (查阅网上资料, 未找到对应的出版地信息, 请确认).
[20]	SUZUKI  Y and TAKESHIMA  H. Equal-loudness-level contours for pure tones[J]. The Journal of the Acoustical Society of America, 2004, 116(2):  918–933. doi: 10.1121/1.1763601.
[21]	YAMAMOTO R, SONG E, and KIM J M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 6199–6203. doi: 10.1109/ICASSP40776.2020.9053795.
[22]	WANG Yulong and ZHANG Xueliang. MFT-CRN: Multi-scale Fourier transform for monaural speech enhancement[C]. Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 1060–1064. doi: 10.21437/Interspeech.2023-865.
[23]	YAMAGISHI J, VEAUX C, and MACDONALD K. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)[EB/OL]. https://datashare.ed.ac.uk/handle/10283/3443, 2019.
[24]	CHEN Y H, WU Dayi, WU T H, et al. Again-VC: A one-shot voice conversion using activation guidance and adaptive instance normalization[C]. Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 5954–5958. doi: 10.1109/ICASSP39728.2021.9414257.
[25]	GOODFELLOW I J, SHLENS J, and SZEGEDY C. Explaining and harnessing adversarial examples[C]. Proceedings of the 3rd International Conference on Learning Representations, San Diegoa, USA, 2015.
[26]	WANG Run, HUANG Ziheng, CHEN Zhikai, et al. Anti-forgery: Towards a stealthy and robust DeepFake disruption attack via adversarial perceptual-aware perturbations[C]. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 2022: 761–767. doi: 10.24963/ijcai.2022/107.