CLIP视觉语义驱动的图像去雨模型

孙瑾; 崔云通; 田宏伟; 黄长城; 汪纪钢

doi:10.11999/JEIT251066

CLIP视觉语义驱动的图像去雨模型

doi: 10.11999/JEIT251066 cstr: 32379.14.JEIT251066

南京航空航天大学民航学院南京 211106

基金项目: 国家自然科学基金(61702260)

详细信息

作者简介:
孙瑾：女，副教授，研究方向为计算机视觉，图像处理与分析

崔云通：男，硕士生，研究方向为图像增强与复原

田宏伟：男，硕士生，研究方向为图像处理与分析，目标跟踪

黄长城：男，硕士生，研究方向为图像增强与复原

汪纪钢：男，硕士生，研究方向为图像增强与复原

通讯作者:
孙瑾　sunjinly@nuaa.edu.cn

中图分类号: TN919.8; TP391.4
计量
- 文章访问数: 394
- HTML全文浏览量: 267
- PDF下载量: 60
- 被引次数: 0
出版历程
- 收稿日期: 2025-10-10
- 修回日期: 2026-02-02
- 录用日期: 2026-02-05
- 网络出版日期: 2026-02-13
- 刊出日期: 2026-04-10

Image Deraining Driven by CLIP Visual Embedding

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

Funds: The National Natural Science Foundation of China (61702260)

摘要

摘要: 图像去雨是计算机视觉领域的基础任务，现有方法过度依赖假设雨模型或合成雨数据集，导致真实场景去雨效果泛化性能不足。该文分析发现对比语言-图像预训练(CLIP)模型图像编码器对雨纹干扰的鲁棒性，将去雨任务转化为基于视觉语义引导的像素级回归问题，提出基于冻结对比语言-图像预训练(FCLIP)策略的图像去雨模型FCLIP-UNet。该模型采用对称的编码器-解码器结构(U-Net)：编码器截取CLIP-RN50图像编码器的4层结构实现雨纹与图像内容语义的自动解耦；解码阶段采用(ConvNeXt-T)与上采样深度卷积模块(UpDWBlock)串行结构，结合跳跃连接中嵌入层级差异化扰动策略实现高层语义引导下的细节恢复与泛化能力的协同增强。定量和定性实验结果表明，FCLIP-UNet在公开合成数据集和真实雨图数据集上均取得最优或具有竞争力的性能，并在包含真实雨图的多个独立数据集上表现出良好的泛化性能。
- 图像去雨 /
- 对比语言-图像预训练 /
- 卷积神经网络 /
- U型网络结构 /
- 泛化性能
Abstract: Objective Rain streaks introduce visual distortions that degrade image quality and significantly impair downstream vision tasks such as feature extraction and object detection. This work addresses the problem of single-image rain streak removal. Existing methods often rely heavily on restrictive priors or synthetic datasets. This dependence limits robustness and generalization because such data differ from complex and unstructured real-world scenarios. Contrastive Language-Image Pre-training(CLIP) demonstrates strong zero-shot generalization through large-scale image-text contrastive learning. Motivated by this property, this study proposes FCLIP-UNet, a visual-semantic-driven deraining architecture designed to improve rain removal and generalization in real-world rainy environments. Methods FCLIP-UNet adopts a U-Net encoder-decoder architecture and formulates deraining as pixel-level detail regression guided by high-level semantic features. During the encoding stage, textual queries are omitted. Instead, the first four layers of a frozen CLIP-RN50 are employed to extract robust features that are decoupled from rain distribution. These features exploit the semantic representation capability of CLIP to suppress diverse rain patterns. To guide accurate image restoration, a collaborative decoding architecture that integrates ConvNeXt-T and an Upsampling DepthWise convolution Block (UpDWBlock) is adopted. The decoder employs ConvNeXt-T in place of conventional convolution modules to expand the receptive field and capture global contextual information. It parses rain streak patterns by using semantic priors extracted from the encoder. Under the constraint of these priors, UpDWBlock reduces information loss during upsampling and reconstructs fine-grained image details. Multi-level skip connections compensate for information loss introduced during encoding. In addition, a Layer-wise Differentiated Feature Perturbation Strategy (LDFPS) is incorporated to enhance robustness and adaptability in complex real-world rainy scenes. Results and Discussions Comprehensive evaluations are conducted on the Rain13K composite dataset by comparing the proposed model with ten state-of-the-art deraining algorithms. FCLIP-UNet shows consistently superior performance across all five testing subsets of Rain13K. In particular, the method outperforms the second-best approach on both datasets: on Test100 by 0.32 dB in Peak Signal-to-Noise Ratio (PSNR) and 0.006 in Structural Similarity Index Measure (SSIM); on Test2800 by 0.14 dB and 0.002, respectively. On Rain100H and Rain100L, FCLIP-UNet achieves competitive results, including the best SSIM on Rain100H and comparable results on other metrics (Table 3). To evaluate model generalization, the Rain13K-pretrained FCLIP-UNet is further tested on three datasets with different rainfall distribution characteristics: SPA-Data, HQ-RAIN, and MPID (Table 4, Fig. 7). Qualitative and quantitative evaluations are also conducted on the real-world NTURain-R dataset (Table 5, Figs. 8$ \sim $10). These results consistently demonstrate the strong generalization capability of FCLIP-UNet. Ablation experiments on Rain100H validate the proposed encoder design and confirm the effectiveness of both UpDWBlock and LDFPS (Tables 6, 8～11). Additional ablation studies show that the use of LDFPS, combined with a 1:1 weighting ratio between L₁ loss and perceptual loss, provides the best performance for FCLIP-UNet (Tables 9$ \sim $11). Conclusions This study proposes FCLIP-UNet, a deraining network designed for real-world generalization by leveraging the CLIP paradigm. Three main contributions are presented. First, image deraining is formulated as a pixel-level regression task that reconstructs rain-free images from high-level semantic features. A frozen CLIP image encoder extracts representations that remain stable across different rain distributions, thereby reducing domain shifts caused by diverse rain models. Second, a decoder that integrates ConvNeXt-T with an UpDWBlock is designed, and an LDFPS is proposed to improve robustness to unseen rain distributions. Third, a composite loss function jointly optimizes pixel-level accuracy and perceptual consistency. Experiments on both synthetic and real-world rainy datasets show that FCLIP-UNet effectively removes rain streaks, preserves fine image details, and achieves strong deraining performance with reliable generalization capability.
- Image deraining /
- Contrastive Language-Image Pre-training (CLIP) /
- Convolutional Neural Network (CNN) /
- U-Net /
- Generalization performance

HTML全文

图 1 FCLIP-UNet网络结构

下载: 全尺寸图片幻灯片

图 2 图像语义特征与文本相似度(热力图)及分类概率(柱状图)

下载: 全尺寸图片幻灯片

图 3 不同密度雨纹下5种CLIP ResNet编码器提取特征分析

下载: 全尺寸图片幻灯片

图 4 解码器结构

下载: 全尺寸图片幻灯片

图 5 Test2800测试数据集图例去雨效果(图中PSNR的单位为dB)

下载: 全尺寸图片幻灯片

图 6 Test1200测试数据集图例去雨效果(图中PSNR的单位为dB)

下载: 全尺寸图片幻灯片

图 7 不同方法在多数据集的泛化能力比较

下载: 全尺寸图片幻灯片

图 8 真实雨图实例1去雨结果

下载: 全尺寸图片幻灯片

图 10 真实雨图实例3去雨结果

下载: 全尺寸图片幻灯片

图 9 真实雨图实例2去雨结果

下载: 全尺寸图片幻灯片

表 1 Test1200数据集图像与细粒度文本标签匹配情况

图像	文本标签
图像	light rain	moderate rain	heavy rain
低密度雨纹图像	196	97	107
中等密度雨纹图像	68	188	144
高密度雨纹图像	95	157	148

下载: 导出CSV

表 2 Rain13K数据集组成

	Rain800	Rain100H	Rain100L	Rain14000	Rain1200	Rain12	总计
训练样本数	700	1800	0	11200	0	12	13712
测试样本数	100	100	100	2800	1200	0	4300
测试集名称	Test100	Rain100H	Rain100L	Test2800	Test1200	-	-

下载: 导出CSV

表 3 Rain13K数据集上对比实验结果(PSNR的单位为dB)

算法	Test100		Rain100H		Rain100L		Test2800		Test1200		平均值
算法	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DerainNet^[4]	22.77	0.810	14.92	0.592	27.03	0.884	24.31	0.861	23.38	0.835	22.48	0.796
DID-MDN^[24]	22.56	0.818	17.35	0.524	25.23	0.741	28.13	0.867	29.95	0.901	24.58	0.770
RESCAN^[26]	25.00	0.835	26.36	0.785	29.80	0.881	31.29	0.904	30.51	0.882	28.59	0.857
MSPFN^[27]	27.50	0.876	28.66	0.860	32.40	0.933	32.82	0.930	32.39	0.916	30.75	0.903
MPRNet^[23]	30.27	0.897	30.41	0.890	36.40	0.965	33.64	0.938	32.91	0.916	32.73	0.921
Uformer-B^[28]	29.90	0.906	30.31	0.900	36.86	0.972	33.53	0.939	29.45	0.903	32.01	0.924
IDT^[11]	29.69	0.905	29.95	0.898	37.01	0.971	33.38	0.937	31.38	0.908	32.28	0.924
DCTR^[1]	30.91	0.912	30.74	0.892	38.19	0.974	33.89	0.941	33.57	0.926	33.46	0.929
AFENet^[29]	30.51	0.918	31.22	0.901	37.66	0.978	33.13	0.925	33.82	0.944	33.27	0.933
DPCNet^[30]	30.59	0.914	30.73	0.899	37.96	0.974	33.23	0.928	33.87	0.941	33.28	0.931
FCLIP-UNet(本文)	31.23	0.924	30.82	0.903	38.06	0.972	34.03	0.943	33.27	0.928	33.48	0.934
注：加粗字体为每列最优值，下划线为每列次优值(本文其他表格设置相同)。

下载: 导出CSV

表 4 跨数据集上的对比实验结果

算法	SPA-Data		HQ-RAIN		MPID
算法	PSNR(dB)	SSIM	PSNR(dB)	SSIM	PSNR(dB)	SSIM
DID-MDN	31.12	0.937	23.62	0.640	23.09	0.794
RESCAN	34.57	0.958	23.79	0.519	26.74	0.823
MSPFN	34.55	0.961	23.99	0.572	27.48	0.849
MPRNet	35.16	0.954	26.36	0.681	31.53	0.896
Uformer-B	35.03	0.948	26.67	0.685	31.47	0.893
IDT	35.61	0.957	26.88	0.679	31.63	0.899
DCTR	35.87	0.963	27.33	0.684	31.81	0.905
DPCNet	35.64	0.958	28.56	0.786	31.59	0.894
FCLIP-UNet(本文)	36.39	0.967	30.36	0.858	31.96	0.913

下载: 导出CSV

表 5 NTURain-R上无参考图像质量评价指标对比结果

指标	输入	DerainNet	DID-MDN	RESCAN	MSPFN	MPRNet	Uformer-B	IDT	DCTR	DPCNet	本文
NIQE	6.211	6.432	5.988	5.745	4.873	4.834	4.473	4.352	4.533	4.378	4.286
BRISQUE	33.156	31.167	30.896	31.766	29.866	30.651	28.768	27.378	26.245	26.509	25.676

下载: 导出CSV

表 6 CLIP不同编码器在Rain100H上的消融实验结果

编码器	PSNR(dB)	SSIM
ResNet50	16.53	0.546
CLIP-ViT-B/32	29.78	0.878
CLIP-ViT-B/16	30.25	0.885
CLIP-RN50	30.82	0.903

下载: 导出CSV

表 7 CLIP不同编码器的FLOPs和推理速度对比

CLIP编码器	RN50	ViT-B/32	ViT-B/16
FLOPs	2.36×10¹⁰	2.43×10¹¹	9.76×10¹¹
推理速度(s/frame)	0.23	0.56	1.06

下载: 导出CSV

表 8 Rain100H消融实验结果

网络	UpDWBlock	LDFPS	PSNR(dB)	SSIM
N1	-	-	30.02	0.879
N2	√	-	30.42	0.893
N3	-	√	30.29	0.887
N4	√	√	30.82	0.903

下载: 导出CSV

表 9 不同损失函数的消融实验

损失函数	PSNR (dB)	SSIM
L_mse	29.44	0.880
L₁	29.76	0.892
L_mse+L_p	29.39	0.885
L₁+L_p	30.82	0.903

下载: 导出CSV

表 10 不同λ_p取值的消融实验

λ_p	PSNR (dB)	SSIM
0.1	29.51	0.884
1	30.82	0.903
10	29.66	0.887

下载: 导出CSV

表 11 不同强度(σ_i)扰动策略消融实验

不同扰动策略	PSNR(dB)	SSIM
等强度低扰动(σ₁=σ₂=σ₃=σ₄=0.01)	30.63	0.892
等强度高扰动(σ₁=σ₂=σ₃=σ₄=0.1)	30.58	0.887
扰动强度逐层降低 (σ₁=0.1, σ₂=0.06, σ₃=0.03, σ₄=0.01)	30.11	0.880
扰动强度逐层增加 (σ₁=0.01, σ₂=0.03, σ₃=0.06, σ₄=0.1)	30.82	0.903

下载: 导出CSV

参考文献(30)

[1]	LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442.
[2]	KANG Liwei, LIN C W, and FU Y H. Automatic single-image-based rain streaks removal via image decomposition[J]. IEEE Transactions on Image Processing, 2012, 21(4): 1742–1755. doi: 10.1109/TIP.2011.2179057.
[3]	ZHU Lei, FU C W, LISCHINSKI D, et al. Joint Bi-layer optimization for single-image rain streak removal[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2545–2553. doi: 10.1109/ICCV.2017.276.
[4]	FU Xueyang, HUANG Jiabin, DING Xinghao, et al. Clearing the skies: A deep network architecture for single-image rain removal[J]. IEEE Transactions on Image Processing, 2017, 26(6): 2944–2956. doi: 10.1109/TIP.2017.2691802.
[5]	FU Xueyang, HUANG Jiabin, ZENG Delu, et al. Removing rain from single images via a deep detail network[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1715–1723. doi: 10.1109/CVPR.2017.186.
[6]	梅天灿, 曹敏, 杨宏, 等. 基于密度分类引导的双阶段雨天图像复原方法[J]. 电子与信息学报, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157. MEI Tiancan, CAO Min, YANG Hong, et al. Two-stage rain image removal based on density guidance[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157.
[7]	REN Dongwei, ZUO Wangmeng, HU Qinghua, et al. Progressive image deraining networks: A better and simpler baseline[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3932–3941. doi: 10.1109/CVPR.2019.00406.
[8]	WEI Wei, MENG Deyu, ZHAO Qian, et al. Semi-supervised transfer learning for image rain removal[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3872–3881. doi: 10.1109/CVPR.2019.00400.
[9]	YASARLA R, SINDAGI V A, and PATEL V M. Syn2real transfer learning for image deraining using gaussian processes[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2723–2733. doi: 10.1109/CVPR42600.2020.00280.
[10]	JIANG Kui, WANG Zhongyuan, CHEN Chen, et al. Magic ELF: Image deraining meets association learning and transformer[C]. The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 827–836. doi: 10.1145/3503161.3547760.
[11]	XIAO Jie, FU Xueyang, LIU Aiping, et al. Image de-raining transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12978–12995. doi: 10.1109/TPAMI.2022.3183612.
[12]	CUI Yuning, REN Wenqi, CAO Xiaochun, et al. Revitalizing convolutional network for image restoration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9423–9438. doi: 10.1109/TPAMI.2024.3419007.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
[14]	MA Wenxin, ZHANG Xu, YAO Qingsong, et al. AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 4744–4754. doi: 10.1109/CVPR52734.2025.00447.
[15]	SUN Zeyi, FANG Ye, WU Tong, et al. Alpha-CLIP: A CLIP model focusing on wherever you want[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13019–13029. doi: 10.1109/CVPR52733.2024.01237.
[16]	WANG Mengmeng, XING Jiazheng, JIANG Boyuan, et al. A multimodal, multi-task adapting framework for video action recognition[C]. The 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada: AAAI, 2024: 5517–5525. doi: 10.1609/aaai.v38i6.28361.
[17]	LUO Ziwei, GUSTAFSSON F K, ZHAO Zheng, et al. Controlling vision-language models for multi-task image restoration[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
[18]	LIN Jingbo, ZHANG Zhilu, WEI Yuxiang, et al. Improving image restoration through removing degradations in textual representations[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 2866–2878. doi: 10.1109/CVPR52733.2024.00277.
[19]	文渊博, 高涛, 安毅生, 等. 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401. WEN Yuanbo, GAO Tao, AN Yisheng, et al. Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401.
[20]	WANG Ruiyi, LI Wenhao, LIU Xiaohong, et al. HazeCLIP: Towards language guided real-world image dehazing[C]. ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889509.
[21]	CHENG Jun, LIANG Dong, and TAN Shan. Transfer CLIP for generalizable image denoising[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25974–25984. doi: 10.1109/CVPR52733.2024.02454.
[22]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
[23]	ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 14816–14826. doi: 10.1109/CVPR46437.2021.01458.
[24]	ZHANG He and PATEL V M. Density-aware single image de-raining using a multi-stream dense network[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 695–704. doi: 10.1109/CVPR.2018.00079.
[25]	ZHOU Tianfei, YUAN Ye, WANG Binglu, et al. Federated feature augmentation and alignment[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. doi: 10.1109/TPAMI.2024.3457751.
[26]	LI Xia, WU Jianlong, LIN Zhouchen, et al. Recurrent squeeze-and-excitation context aggregation net for single image deraining[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 262–277. doi: 10.1007/978-3-030-01234-2_16.
[27]	JIANG Kui, WANG Zhongyuan, YI Peng, et al. Multi-scale progressive fusion network for single image deraining[C]. The IEEE/CVF conference on computer vision and pattern recognition, Recognition. Seattle, USA, 2020: 8346-8355. doi: 10.1109/CVPR42600. 2020.00837.
[28]	WANG Zhendong, CUN Xiaodong, BAO Jianmin, et al. Uformer: A general U-shaped transformer for image restoration[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17662–17672. doi: 10.1109/CVPR52688.2022.01716.
[29]	YAN Fei, HE Yuhong, CHEN Keyu, et al. Adaptive frequency enhancement network for single image deraining[C]. 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 2024: 4534–4541. doi: 10.1109/SMC54092.2024.10831025.
[30]	HE Yuhong, JIANG Aiwen, JIANG Lingfang, et al. Dual-path coupled image deraining network via spatial-frequency interaction[C]. 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024: 1452–1458. doi: 10.1109/ICIP51287.2024.10647753.