Image Deraining Driven by CLIP Visual Embedding

SUN Jin; CUI Yuntong; TIAN Hongwei; HUANG Changcheng; WANG Jigang

doi:10.11999/JEIT251066

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 >

SUN Jin, CUI Yuntong, TIAN Hongwei, HUANG Changcheng, WANG Jigang. Image Deraining Driven by CLIP Visual Embedding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251066

Citation:

SUN Jin, CUI Yuntong, TIAN Hongwei, HUANG Changcheng, WANG Jigang. Image Deraining Driven by CLIP Visual Embedding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251066

SUN Jin, CUI Yuntong, TIAN Hongwei, HUANG Changcheng, WANG Jigang. Image Deraining Driven by CLIP Visual Embedding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251066

Citation:

SUN Jin, CUI Yuntong, TIAN Hongwei, HUANG Changcheng, WANG Jigang. Image Deraining Driven by CLIP Visual Embedding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251066

PDF( 4625 KB)

Image Deraining Driven by CLIP Visual Embedding

doi: 10.11999/JEIT251066 cstr: 32379.14.JEIT251066

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

Funds: National Natural Science Foundation of China (61702260)

Received Date: 2025-10-09
Accepted Date: 2026-02-05
Rev Recd Date: 2026-02-05

Available Online: 2026-02-13

Abstract

Abstract

Objective Rain streaks introduce visual distortions that degrade image quality, which significantly impairs the performance of downstream vision tasks such as feature extraction and object detection. This work addresses the problem of single-image rain streak removal. Existing methods often rely heavily on restrictive priors or synthetic datasets, resulting in limited robustness and generalization capabilities due to the discrepancy with complex, unstructured real-world scenarios. CLIP demonstrates remarkable zero-shot generalization capabilities through large-scale image-text cross-modal contrastive learning. Motivated by this, we propose FCLIP-UNet, a visual-semantic-driven deraining architecture, for enhanced rain removal and improved generalization in real-world rainy environments. Methods FCLIP-UNet adopts the U-Net encoder-decoder architecture and reformulates deraining as pixel-level detail regression guided by high-level semantic features. In the encoding stage, dispensing with textual queries, FCLIP-UNet employs the first four layers of the frozen CLIP-RN50 to extract robust features decoupled from rain distribution, leveraging their semantic representation capacity to suppress diverse rain patterns. To guide image restoration accurately, we adopt a collaborative architecture of ConvNeXt-T and UpDWBlock at the decoding stage. The decoder utilizes ConvNeXt-T, replacing the traditional convolutional modules, to expand the receptive field for capturing global contextual information, and it parses rain streak patterns by leveraging the semantic priors extracted from the encoder. Under the constraint of such semantic priors, UpDWBlock reduces the information loss caused by upsampling and reconstructs fine-grained image details. Multi-level skip connections are employed to compensate for the information loss incurred in the encoding stage, and a layer-wise differentiated feature perturbation strategy is embedded to further enhance the model’s robustness and adaptability in complex real-world rainy scenarios. Results and Discussions Comprehensive evaluations are conducted on the Rain13K composite dataset by benchmarking the proposed model against ten state-of-the-art deraining algorithms. FCLIP-UNet demonstrates consistently superior performance across all five testing subsets of Rain13K. Notably, FCLIP-UNet outperformed the second-best method on both datasets: on Test100 by 0.32 dB (PSNR) and 0.06 (SSIM); on Test2800 by 0.14 dB and 0.002. On Rain100H and Rain100L, FCLIP-UNet achieved competitive results, with the best SSIM on Rain100H and comparable performance on other metrics. (Table 3). To evaluate model generalization, the Rain13K-pretrained FCLIP-UNet was quantitatively evaluated on three datasets with distinct rainfall distribution characteristics: SPA-Data, HQ-RAIN, and MPID (Table 4, Fig. 7). Qualitative and quantitative assessments were also conducted using the real-world NTURain-R dataset (Table 5, Figs. 8–10). Both results consistently demonstrated FCLIP-UNet's robust generalization capability. Ablation experiments on Rain100H validate the proposed encoder architecture and the effectiveness of both the UpDWBlock and LDFPS (Tables 6-8). Furthermore, ablation results demonstrated that employing LDFPS, combined with a 1:1 weighting ratio between L₁ loss and perceptual loss, yielded optimal performance of FCLIP-UNet (Tables 9-11). Conclusions This work introduces FCLIP-UNet, a deraining network targeting real-world generalization, by leveraging the contrastive language–image pre-training (CLIP) paradigm. The main contributions are threefold. First, image deraining is reformulated as a pixel-level regression task aimed at reconstructing rain-free images based on high-level semantic features; a frozen CLIP image encoder is employed to extract representations robust to rain-distribution variations, thereby mitigating domain shifts induced by diverse rain models. Second, a decoder integrating ConvNeXt-T with an upsampling depthwise convolution block (UpDWBlock) is designed, and a layer-wise differentiated feature perturbation strategy (LDFPS) is introduced to enhance robustness against unseen rain distributions. Third, a composite loss function is constructed to jointly optimize pixel-wise accuracy and perceptual consistency. Quantitative and qualitative experiments on both synthetic and real-world rainy datasets demonstrate that FCLIP-UNet effectively removes rain streaks while preserving fine image details and exhibits superior deraining performance and strong generalization capability.
- Image deraining,
- CLIP,
- Convolutional neural network (CNN),
- U-Net,
- Generalization performance

FullText(HTML)

References(31)

References

[1]	LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442.
[2]	KANG Liwei, LIN C W, and FU Y H. Automatic single-image-based rain streaks removal via image decomposition[J]. IEEE Transactions on Image Processing, 2012, 21(4): 1742–1755. doi: 10.1109/TIP.2011.2179057.
[3]	ZHU Lei, FU C W, LISCHINSKI D, et al. Joint Bi-layer optimization for single-image rain streak removal[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2545–2553. doi: 10.1109/ICCV.2017.276.
[4]	FU Xueyang, HUANG Jiabin, DING Xinghao, et al. Clearing the skies: A deep network architecture for single-image rain removal[J]. IEEE Transactions on Image Processing, 2017, 26(6): 2944–2956. doi: 10.1109/TIP.2017.2691802.
[5]	FU Xueyang, HUANG Jiabin, ZENG Delu, et al. Removing rain from single images via a deep detail network[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1715–1723. doi: 10.1109/CVPR.2017.186.
[6]	梅天灿, 曹敏, 杨宏, 等. 基于密度分类引导的双阶段雨天图像复原方法[J]. 电子与信息学报, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157. MEI Tiancan, CAO Min, YANG Hong, et al. Two-stage rain image removal based on density guidance[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157.
[7]	REN Dongwei, ZUO Wangmeng, HU Qinghua, et al. Progressive image deraining networks: A better and simpler baseline[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3932–3941. doi: 10.1109/CVPR.2019.00406.
[8]	WEI Wei, MENG Deyu, ZHAO Qian, et al. Semi-supervised transfer learning for image rain removal[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3872–3881. doi: 10.1109/CVPR.2019.00400.
[9]	YASARLA R, SINDAGI V A, and PATEL V M. Syn2real transfer learning for image deraining using gaussian processes[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2723–2733. doi: 10.1109/CVPR42600.2020.00280.
[10]	JIANG Kui, WANG Zhongyuan, CHEN Chen, et al. Magic ELF: Image deraining meets association learning and transformer[C]. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 827–836. doi: 10.1145/3503161.3547760.
[11]	XIAO Jie, FU Xueyang, LIU Aiping, et al. Image de-raining transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12978–12995. doi: 10.1109/TPAMI.2022.3183612.
[12]	CUI Yuning, REN Wenqi, CAO Xiaochun, et al. Revitalizing convolutional network for image restoration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9423–9438. doi: 10.1109/TPAMI.2024.3419007.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. Proceedings of the 38th International Conference on Machine Learning, 2021: 8748–8763. (查阅网上资料, 未找到本条文献出版地信息, 请确认并补充).
[14]	MA Wenxin, ZHANG Xu, YAO Qingsong, et al. AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 4744–4754. doi: 10.1109/CVPR52734.2025.00447.
[15]	SUN Zeyi, FANG Ye, WU Tong, et al. Alpha-CLIP: A CLIP model focusing on wherever you want[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13019–13029. doi: 10.1109/CVPR52733.2024.01237.
[16]	WANG Mengmeng, XING Jiazheng, JIANG Boyuan, et al. A multimodal, multi-task adapting framework for video action recognition[C]. Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada: AAAI, 2024: 5517–5525. doi: 10.1609/aaai.v38i6.28361.
[17]	LUO Ziwei, GUSTAFSSON F K, ZHAO Zheng, et al. Controlling vision-language models for multi-task image restoration[C]. Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
[18]	LIN Jingbo, ZHANG Zhilu, WEI Yuxiang, et al. Improving image restoration through removing degradations in textual representations[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 2866–2878. doi: 10.1109/CVPR52733.2024.00277.
[19]	文渊博, 高涛, 安毅生, 等. 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401. WEN Yuanbo, GAO Tao, AN Yisheng, et al. Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401.
[20]	WANG Ruiyi, LI Wenhao, LIU Xiaohong, et al. HazeCLIP: Towards language guided real-world image dehazing[C]. ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889509.
[21]	CHENG Jun, LIANG Dong, and TAN Shan. Transfer CLIP for generalizable image denoising[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25974–25984. doi: 10.1109/CVPR52733.2024.02454.
[22]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
[23]	ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 14816–14826. doi: 10.1109/CVPR46437.2021.01458.
[24]	ZHANG He and PATEL V M. Density-aware single image de-raining using a multi-stream dense network[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 695–704. doi: 10.1109/CVPR.2018.00079.
[25]	ZHOU Tianfei, YUAN Ye, WANG Binglu, et al. Federated feature augmentation and alignment[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 11119–11135. doi: 10.1109/TPAMI.2024.3457751.
[26]	LI Xia, WU Jianlong, LIN Zhouchen, et al. Recurrent squeeze-and-excitation context aggregation net for single image deraining[C]. Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 262–277. doi: 10.1007/978-3-030-01234-2_16.
[27]	JIANG Kui, LIU Wenxuan, WANG Zheng, et al. DAWN: Direction-aware attention wavelet network for image deraining[C]. Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023: 7065–7074. doi: 10.1145/3581783.3611697.
[28]	WANG Zhendong, CUN Xiaodong, BAO Jianmin, et al. Uformer: A general U-shaped transformer for image restoration[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17662–17672. doi: 10.1109/CVPR52688.2022.01716.
[29]	LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442. (查阅网上资料,本条文献与第1条文献重复,请确认).
[30]	YAN Fei, HE Yuhong, CHEN Keyu, et al. Adaptive frequency enhancement network for single image deraining[C]. 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 2024: 4534–4541. doi: 10.1109/SMC54092.2024.10831025.
[31]	HE Yuhong, JIANG Aiwen, JIANG Lingfang, et al. Dual-path coupled image deraining network via spatial-frequency interaction[C]. 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024: 1452–1458. doi: 10.1109/ICIP51287.2024.10647753.