Image Deraining Driven by CLIP Visual Embedding
-
摘要: 图像去雨是计算机视觉领域的基础任务,现有方法过度依赖假设雨模型或合成雨数据集,导致真实场景去雨效果泛化性能不足。该文分析发现CLIP模型图像编码器对雨纹干扰的鲁棒性,将去雨任务转化为基于视觉语义引导的像素级回归问题,提出基于冻结对比语言-图像预训练(Frozen Contrastive Language–Image Pretraining, FCLIP)策略的图像去雨模型FCLIP-UNet。该模型采用对称的编码器解码器结构(U-Net):编码器截取CLIP-RN50图像编码器的4层结构实现雨纹与图像内容语义的自动解耦;解码阶段采用ConvNeXt-T与UpDWBlock串行结构,结合跳跃连接中嵌入层级差异化扰动策略实现高层语义引导下的细节恢复与泛化能力的协同增强。定量和定性实验结果表明,FCLIP-UNet在公开合成数据集和真实雨图数据集上均取得最优或具有竞争力的性能,并在包含真实雨图的多个独立数据集上表现出良好的泛化性能。
-
关键词:
- 图像去雨 /
- CLIP /
- 卷积神经网络(CNN) /
- U-Net /
- 泛化性能
Abstract:Objective Rain streaks introduce visual distortions that degrade image quality, which significantly impairs the performance of downstream vision tasks such as feature extraction and object detection. This work addresses the problem of single-image rain streak removal. Existing methods often rely heavily on restrictive priors or synthetic datasets, resulting in limited robustness and generalization capabilities due to the discrepancy with complex, unstructured real-world scenarios. CLIP demonstrates remarkable zero-shot generalization capabilities through large-scale image-text cross-modal contrastive learning. Motivated by this, we propose FCLIP-UNet, a visual-semantic-driven deraining architecture, for enhanced rain removal and improved generalization in real-world rainy environments. Methods FCLIP-UNet adopts the U-Net encoder-decoder architecture and reformulates deraining as pixel-level detail regression guided by high-level semantic features. In the encoding stage, dispensing with textual queries, FCLIP-UNet employs the first four layers of the frozen CLIP-RN50 to extract robust features decoupled from rain distribution, leveraging their semantic representation capacity to suppress diverse rain patterns. To guide image restoration accurately, we adopt a collaborative architecture of ConvNeXt-T and UpDWBlock at the decoding stage. The decoder utilizes ConvNeXt-T, replacing the traditional convolutional modules, to expand the receptive field for capturing global contextual information, and it parses rain streak patterns by leveraging the semantic priors extracted from the encoder. Under the constraint of such semantic priors, UpDWBlock reduces the information loss caused by upsampling and reconstructs fine-grained image details. Multi-level skip connections are employed to compensate for the information loss incurred in the encoding stage, and a layer-wise differentiated feature perturbation strategy is embedded to further enhance the model’s robustness and adaptability in complex real-world rainy scenarios. Results and Discussions Comprehensive evaluations are conducted on the Rain13K composite dataset by benchmarking the proposed model against ten state-of-the-art deraining algorithms. FCLIP-UNet demonstrates consistently superior performance across all five testing subsets of Rain13K. Notably, FCLIP-UNet outperformed the second-best method on both datasets: on Test100 by 0.32 dB (PSNR) and 0.06 (SSIM); on Test2800 by 0.14 dB and 0.002. On Rain100H and Rain100L, FCLIP-UNet achieved competitive results, with the best SSIM on Rain100H and comparable performance on other metrics. ( Table 3 ). To evaluate model generalization, the Rain13K-pretrained FCLIP-UNet was quantitatively evaluated on three datasets with distinct rainfall distribution characteristics: SPA-Data, HQ-RAIN, and MPID (Table 4 ,Fig. 7 ). Qualitative and quantitative assessments were also conducted using the real-world NTURain-R dataset (Table 5 ,Figs. 8 –10 ). Both results consistently demonstrated FCLIP-UNet's robust generalization capability. Ablation experiments on Rain100H validate the proposed encoder architecture and the effectiveness of both the UpDWBlock and LDFPS (Tables 6 -8 ). Furthermore, ablation results demonstrated that employing LDFPS, combined with a 1:1 weighting ratio between L1 loss and perceptual loss, yielded optimal performance of FCLIP-UNet (Tables 9 -11 ).Conclusions This work introduces FCLIP-UNet, a deraining network targeting real-world generalization, by leveraging the contrastive language–image pre-training (CLIP) paradigm. The main contributions are threefold. First, image deraining is reformulated as a pixel-level regression task aimed at reconstructing rain-free images based on high-level semantic features; a frozen CLIP image encoder is employed to extract representations robust to rain-distribution variations, thereby mitigating domain shifts induced by diverse rain models. Second, a decoder integrating ConvNeXt-T with an upsampling depthwise convolution block (UpDWBlock) is designed, and a layer-wise differentiated feature perturbation strategy (LDFPS) is introduced to enhance robustness against unseen rain distributions. Third, a composite loss function is constructed to jointly optimize pixel-wise accuracy and perceptual consistency. Quantitative and qualitative experiments on both synthetic and real-world rainy datasets demonstrate that FCLIP-UNet effectively removes rain streaks while preserving fine image details and exhibits superior deraining performance and strong generalization capability. -
Key words:
- Image deraining /
- CLIP /
- Convolutional neural network (CNN) /
- U-Net /
- Generalization performance
-
表 1 Test1200数据集图像与细粒度文本标签匹配情况
图像 文本标签 light rain moderate rain heavy rain 低密度雨纹图像 196 97 107 中等密度雨纹图像 68 188 144 高密度雨纹图像 95 157 148 表 2 Rain13K数据集组成
数据集来源 Rain800 Rain100H Rain100L Rain14000 Rain1200 Rain12 总计 训练样本数 700 1800 0 11200 0 12 13712 测试样本数 100 100 100 2800 1200 0 4300 测试集名称 Test100 Rain100H Rain100L Test2800 Test1200 - - 表 3 Rain13K 数据集上对比实验结果
算法 Test100 Rain100H Rain100L Test2800 Test1200 Average PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM DerainNet[4] 22.77 0.810 14.92 0.592 27.03 0.884 24.31 0.861 23.38 0.835 22.48 0.796 DID-MDN[24] 22.56 0.818 17.35 0.524 25.23 0.741 28.13 0.867 29.95 0.901 24.58 0.770 RESCAN[26] 25.00 0.835 26.36 0.785 29.80 0.881 31.29 0.904 30.51 0.882 28.59 0.857 MSPFN[27] 27.50 0.876 28.66 0.860 32.40 0.933 32.82 0.930 32.39 0.916 30.75 0.903 MPRNet[21] 30.27 0.897 30.41 0.890 36.40 0.965 33.64 0.938 32.91 0.916 32.73 0.921 Uformer-B[28] 29.90 0.906 30.31 0.900 36.86 0.972 33.53 0.939 29.45 0.903 32.01 0.924 IDT[11] 29.69 0.905 29.95 0.898 37.01 0.971 33.38 0.937 31.38 0.908 32.28 0.924 DCTR[29] 30.91 0.912 30.74 0.892 38.19 0.974 33.89 0.941 33.57 0.926 33.46 0.929 AFENet[30] 30.51 0.918 31.22 0.901 37.66 0.978 33.13 0.925 33.82 0.944 33.27 0.933 DPCNet[31] 30.59 0.914 30.73 0.899 37.96 0.974 33.23 0.928 33.87 0.941 33.28 0.931 FCLIP-UNet(Ours) 31.23 0.924 30.82 0.903 38.06 0.972 34.03 0.943 33.27 0.928 33.48 0.934 注:加粗字体为每列最优值,下划线为每列次优值(本文其他表格设置相同)。 表 4 跨数据集上的对比实验结果
算法 SPA-Data HQ-RAIN MPID PSNR SSIM PSNR SSIM PSNR SSIM DID-MDN 31.12 0.937 23.62 0.640 23.09 0.794 RESCAN 34.57 0.958 23.79 0.519 26.74 0.823 MSPFN 34.55 0.961 23.99 0.572 27.48 0.849 MPRNet 35.16 0.954 26.36 0.681 31.53 0.896 Uformer-B 35.03 0.948 26.67 0.685 31.47 0.893 IDT 35.61 0.957 26.88 0.679 31.63 0.899 DCTR 35.87 0.963 27.33 0.684 31.81 0.905 DPCNet 35.64 0.958 28.56 0.786 31.59 0.894 FCLIP-UNet(Ours) 36.39 0.967 30.36 0.858 31.96 0.913 表 5 NTURain-R上无参考图像质量评价指标对比结果
算法 Input DerainNet DID-MDN RESCAN MSPFN MPRNet Uformer-B IDT DCTR DPCNet Ours NIQE 6.211 6.432 5.988 5.745 4.873 4.834 4.473 4.352 4.533 4.378 4.286 BRISQUE 33.156 31.167 30.896 31.766 29.866 30.651 28.768 27.378 26.245 26.509 25.676 表 6 CLIP不同编码器Rain100H上消融实验结果
编码器 PSNR/dB SSIM ResNet50 16.53 0.546 CLIP-ViT-B/32 29.78 0.878 CLIP-ViT-B/16 30.25 0.885 CLIP-RN50 30.82 0.903 表 7 CLIP不同编码器FLOPs和推理速度对比
CLIP编码器 RN50 ViT-B/32 ViT-B/16 FLOPs 2.36×1010 2.43×1011 9.76×1011 推理速度(s/frame) 0.23 0.56 1.06 表 8 Rain100H消融实验结果
网络 UpDWBlock LDFPS PSNR/dB SSIM N1 - - 30.02 0.879 N2 √ - 30.42 0.893 N3 - √ 30.29 0.887 N4 √ √ 30.82 0.903 表 9 不同损失函数的消融实验
损失函数 Lmse L1 Lmse+Lp L1+Lp PSNR/dB 29.44 29.76 29.39 30.82 SSIM 0.880 0.892 0.885 0.903 表 10 不同λp取值的消融实验
λp 0.1 1 10 PSNR/dB 29.51 30.82 29.66 SSIM 0.884 0.903 0.887 表 11 不同强度(σi)扰动策略消融实验
不同扰动策略 PSNR/dB SSIM 等强度低扰动(σ1=σ2=σ3=σ4=0.01) 30.63 0.892 等强度高扰动(σ1=σ2=σ3=σ4=0.1) 30.58 0.887 扰动强度逐层降低
(σ1=0.1, σ2=0.06, σ3=0.03, σ4=0.01)30.11 0.880 扰动强度逐层增加
(σ1=0.01, σ2=0.03, σ3=0.06, σ4=0.1)30.82 0.903 -
[1] LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442. [2] KANG Liwei, LIN C W, and FU Y H. Automatic single-image-based rain streaks removal via image decomposition[J]. IEEE Transactions on Image Processing, 2012, 21(4): 1742–1755. doi: 10.1109/TIP.2011.2179057. [3] ZHU Lei, FU C W, LISCHINSKI D, et al. Joint Bi-layer optimization for single-image rain streak removal[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2545–2553. doi: 10.1109/ICCV.2017.276. [4] FU Xueyang, HUANG Jiabin, DING Xinghao, et al. Clearing the skies: A deep network architecture for single-image rain removal[J]. IEEE Transactions on Image Processing, 2017, 26(6): 2944–2956. doi: 10.1109/TIP.2017.2691802. [5] FU Xueyang, HUANG Jiabin, ZENG Delu, et al. Removing rain from single images via a deep detail network[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1715–1723. doi: 10.1109/CVPR.2017.186. [6] 梅天灿, 曹敏, 杨宏, 等. 基于密度分类引导的双阶段雨天图像复原方法[J]. 电子与信息学报, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157.MEI Tiancan, CAO Min, YANG Hong, et al. Two-stage rain image removal based on density guidance[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1383–1390. doi: 10.11999/JEIT220157. [7] REN Dongwei, ZUO Wangmeng, HU Qinghua, et al. Progressive image deraining networks: A better and simpler baseline[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3932–3941. doi: 10.1109/CVPR.2019.00406. [8] WEI Wei, MENG Deyu, ZHAO Qian, et al. Semi-supervised transfer learning for image rain removal[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3872–3881. doi: 10.1109/CVPR.2019.00400. [9] YASARLA R, SINDAGI V A, and PATEL V M. Syn2real transfer learning for image deraining using gaussian processes[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2723–2733. doi: 10.1109/CVPR42600.2020.00280. [10] JIANG Kui, WANG Zhongyuan, CHEN Chen, et al. Magic ELF: Image deraining meets association learning and transformer[C]. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 827–836. doi: 10.1145/3503161.3547760. [11] XIAO Jie, FU Xueyang, LIU Aiping, et al. Image de-raining transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12978–12995. doi: 10.1109/TPAMI.2022.3183612. [12] CUI Yuning, REN Wenqi, CAO Xiaochun, et al. Revitalizing convolutional network for image restoration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9423–9438. doi: 10.1109/TPAMI.2024.3419007. [13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. Proceedings of the 38th International Conference on Machine Learning, 2021: 8748–8763. (查阅网上资料, 未找到本条文献出版地信息, 请确认并补充). [14] MA Wenxin, ZHANG Xu, YAO Qingsong, et al. AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 4744–4754. doi: 10.1109/CVPR52734.2025.00447. [15] SUN Zeyi, FANG Ye, WU Tong, et al. Alpha-CLIP: A CLIP model focusing on wherever you want[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13019–13029. doi: 10.1109/CVPR52733.2024.01237. [16] WANG Mengmeng, XING Jiazheng, JIANG Boyuan, et al. A multimodal, multi-task adapting framework for video action recognition[C]. Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada: AAAI, 2024: 5517–5525. doi: 10.1609/aaai.v38i6.28361. [17] LUO Ziwei, GUSTAFSSON F K, ZHAO Zheng, et al. Controlling vision-language models for multi-task image restoration[C]. Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024. [18] LIN Jingbo, ZHANG Zhilu, WEI Yuxiang, et al. Improving image restoration through removing degradations in textual representations[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 2866–2878. doi: 10.1109/CVPR52733.2024.00277. [19] 文渊博, 高涛, 安毅生, 等. 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401.WEN Yuanbo, GAO Tao, AN Yisheng, et al. Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47(10): 2401–2416. doi: 10.11897/SP.J.1016.2024.02401. [20] WANG Ruiyi, LI Wenhao, LIU Xiaohong, et al. HazeCLIP: Towards language guided real-world image dehazing[C]. ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10889509. [21] CHENG Jun, LIANG Dong, and TAN Shan. Transfer CLIP for generalizable image denoising[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25974–25984. doi: 10.1109/CVPR52733.2024.02454. [22] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167. [23] ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 14816–14826. doi: 10.1109/CVPR46437.2021.01458. [24] ZHANG He and PATEL V M. Density-aware single image de-raining using a multi-stream dense network[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 695–704. doi: 10.1109/CVPR.2018.00079. [25] ZHOU Tianfei, YUAN Ye, WANG Binglu, et al. Federated feature augmentation and alignment[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 11119–11135. doi: 10.1109/TPAMI.2024.3457751. [26] LI Xia, WU Jianlong, LIN Zhouchen, et al. Recurrent squeeze-and-excitation context aggregation net for single image deraining[C]. Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 262–277. doi: 10.1007/978-3-030-01234-2_16. [27] JIANG Kui, LIU Wenxuan, WANG Zheng, et al. DAWN: Direction-aware attention wavelet network for image deraining[C]. Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023: 7065–7074. doi: 10.1145/3581783.3611697. [28] WANG Zhendong, CUN Xiaodong, BAO Jianmin, et al. Uformer: A general U-shaped transformer for image restoration[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17662–17672. doi: 10.1109/CVPR52688.2022.01716. [29] LI Yufeng, LU Jiyang, CHEN Hongming, et al. Dilated convolutional transformer for high-quality image deraining[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023: 4199–4207. doi: 10.1109/CVPRW59228.2023.00442. (查阅网上资料,本条文献与第1条文献重复,请确认). [30] YAN Fei, HE Yuhong, CHEN Keyu, et al. Adaptive frequency enhancement network for single image deraining[C]. 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 2024: 4534–4541. doi: 10.1109/SMC54092.2024.10831025. [31] HE Yuhong, JIANG Aiwen, JIANG Lingfang, et al. Dual-path coupled image deraining network via spatial-frequency interaction[C]. 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024: 1452–1458. doi: 10.1109/ICIP51287.2024.10647753. -
下载:
下载: