HWT-SRNet：异质窗口图像超分辨率重建网络

卢迪; 党安圆

doi:10.11999/JEIT250868

HWT-SRNet：异质窗口图像超分辨率重建网络

doi: 10.11999/JEIT250868 cstr: 32379.14.JEIT250868

卢迪^{1, 2, ,},
党安圆^{1, 2}

1.
哈尔滨理工大学测控技术与通信工程学院哈尔滨 150080
2.
哈尔滨理工大学模式识别与信息感知黑龙江省重点实验室哈尔滨 150080

详细信息

作者简介:
卢迪：女，教授，博士，研究方向为数据融合、图像处理等

党安圆：男，硕士生，研究方向为图像处理、超分辨率重建等

通讯作者:
卢迪　ludizeng@hrbust.edu.cn

中图分类号: TN911.73
计量
- 文章访问数: 12
- HTML全文浏览量: 9
- PDF下载量: 0
- 被引次数: 0
出版历程
- 修回日期: 2026-05-29
- 录用日期: 2026-05-29
- 网络出版日期: 2026-06-08

HWT-SRNet: Heterogeneous Windows Transformer Network for Image Super-Resolution

LU Di^{1, 2
, ,},
DANG Anyuan^{1, 2}

1.
School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China
2.
Heilongjiang Province Key Laboratory of Pattern Recognition and Information Perception, Harbin University of Science and Technology, Harbin 150080, China

摘要

摘要: 在大数据时代，图像质量参差不齐，对低质量图像进行高分辨率重建具有重要的研究与应用价值。基于 Transformer的单图像超分辨率方法通常将自注意力机制限制在局部非重叠窗口中，导致感受野受限、窗口边界失真以及高频细节重构能力不足等问题。为此，该文提出一种基于Swin IR的异质窗口注意力网络(Heterogeneous window Transformer Network for Image Super-Resolution, HWT-SRNet)。首先，设计异质窗口注意力机制，充分融合多尺度特征，以缓解窗口边界失真问题并有效扩大感受野。其次，针对Transformer在高频信息重构能力上的不足，提出一种高频先验特征提取网络，增强网络对边缘与纹理细节的恢复能力。实验结果表明，HWT-SRNet在Set5, Set14, BSD100, Urban100, Manga109五个基准测试集上，PSNR指标相比基线模型Swin IR提升0.10 dB至0.37 dB，同时，与其他具有代表性的超分模型CAT, ACT, ART等相比，在图像细节和纹理方面也取得了更优的视觉效果。
- 超分辨率重建 /
- Transformer /
- 异质窗口 /
- 高频先验
Abstract: In the era of big data, image quality varies greatly, making the reconstruction of high-resolution images from low-quality inputs a critical task in computer vision. Existing super-resolution methods based on window self-attention, such as SwinIR, encounter limitations in receptive field expansion and insufficient ability to capture high-frequency details. These shortcomings reduce their effectiveness in reconstructing fine image structures, thereby necessitating further improvements. To overcome these challenges, this study proposes the Heterogeneous Windows Transformer Network for Image Super-Resolution (HWT-SRNet), a novel architecture built upon SwinIR. By integrating innovative module designs, HWT-SRNet enhances the extraction of high-frequency details while simultaneously expanding the receptive field, offering a more advanced solution for super-resolution tasks. Methods Building upon the Swin IR framework, this study incorporates two key modules to optimize super-resolution reconstruction performance:(1) Heterogeneous Windows Transformer Block (HWTB): Traditional window-based self-attention mechanisms suffer from a constrained receptive field, limiting their ability to capture long-range dependencies. To overcome this limitation, HWTB alternates between square windows and pale windows, preserving local feature extraction while significantly expanding the receptive field. This alternating mechanism enables the network to better model both fine-grained details and global structural information, improving the overall image reconstruction quality. The choice of window size and alternation frequency is optimized to trade off between computational efficiency and feature extraction.(2) High-Frequency Prior Extraction Network (HFPEN): Transformer-based super-resolution models often struggle with capturing high-frequency details due to their inherent bias towards low-frequency components. To mitigate this issue, the HFPEN module is introduced to explicitly extract high-frequency prior information from images using a Gaussian Difference of Gaussian (DoG) filter. The DoG filter emphasizes high-frequency details, including edges and textures, by computing the difference between a lightly blurred image (containing mid-frequency information) and a more heavily blurred one (capturing low-frequency information).This high-frequency information is then fused with the heterogeneous window attention mechanism, allowing HWT-SRNet to enhance fine details while maintaining structural coherence. The DoG filter is applied in the spatial domain, enabling the model to effectively capture and reconstruct sharp edges and textures without the need for frequency-domain transformations. This approach ensures that the network can focus on high-frequency features while preserving the overall image structure. Results and Discussions To thoroughly assess the effectiveness of HWT-SRNet, we performed experiments on several widely used benchmark datasets, namely Set5, Set14, BSD100, Urban100, and Manga109. Our method was compared with representative state-of-the-art approaches, including ACT, ART, and CAT.,The results demonstrate its superior performance across key evaluation metrics (see Table 1 for detailed comparisons). Specifically, HWT-SRNet achieves improvements in PSNR ranging from 0.10 dB to 0.37 dB compared to baseline models, demonstrating its effectiveness in enhancing image quality. Additionally, structural similarity (SSIM) scores also exhibit consistent improvement, indicating better perceptual quality and more visually pleasing reconstructions. Qualitative results further confirm that HWT-SRNet is capable of restoring sharper edges, preserving textures, and reducing blurring artifacts compared to existing methods. To further validate the contribution of each component in HWT-SRNet, we conducted ablation studies to analyze the impact of the Heterogeneous Windows Transformer Block (HWTB) and the High-Frequency Prior Extraction Network (HFPEN) (see Table 2 for ablation results).These advantages stem from the synergistic effect of heterogeneous window attention mechanisms and high-frequency prior extraction, which enable the network to effectively balance local feature refinement and global contextual understanding. By leveraging the alternating self-attention mechanisms and high-frequency prior extraction, HWT-SRNet provides a highly efficient solution for expanding the receptive field and improving high-frequency detail reconstruction. Conclusion Considering the limitations of existing super-resolution algorithms, this paper proposes a novel Heterogeneous Windows Transformer Network (HWT-SRNet), designed to improve image reconstruction quality by addressing challenges in receptive field expansion and high-frequency detail capture. The integration of heterogeneous window attention mechanisms and high-frequency prior feature extraction allows the model to achieve a more effective fusion of local and global features, leading to superior performance in both PSNR and SSIM. Experimental results confirm that HWT-SRNet surpasses existing state-of-the-art methods, providing a more efficient and accurate solution for super-resolution tasks. However, this study does not specifically explore the model's adaptability to noise interference in real-world scenarios. Future research can focus on further optimizing HWT-SRNet’s robustness to noisy and degraded inputs, improving its applicability to practical image restoration tasks in diverse environments. Additionally, the model's performance on specialized datasets, such as medical or satellite images, remains to be explored, which could further validate its generalization capabilities.
- Super-resolution reconstruction /
- Transformer /
- Alternating windows /
- High-frequency prior

HTML全文

图 1 不同的注意力机制

下载: 全尺寸图片幻灯片

图 2 异质窗口图像超分辨率重建网络

下载: 全尺寸图片幻灯片

图 3 异质窗口注意力模块

下载: 全尺寸图片幻灯片

图 4 高频先验提取网络

下载: 全尺寸图片幻灯片

图 5 不同方法在Set14数据集的图像4倍超分辨率重构比较

下载: 全尺寸图片幻灯片

图 6 不同方法在Urban100数据集的图像4倍超分辨率重构比较

下载: 全尺寸图片幻灯片

图 7 不同模块在Set5数据集“baby”图像4倍超分辨率重构视觉效果比较

下载: 全尺寸图片幻灯片

表 1 不同方法各数据集的PSNR和SSIM均值比较

算法	缩放因子	Set5	Set14	BSD100	Urban100	Manga109
算法	缩放因子	PSNR/SSIM/	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
ESWT^[8]	×2	38.33/0.9615	34.22/0.9233	32.47/0.9034	33.27/0.9397	39.79/0.9790
CAT-R^[7]		38.48/0.9625	34.53/0.9251	32.56/0.9045	34.08/0.9443	40.09/0.9804
Swin IR^[6]		38.42/0.9623	34.46/0.9250	32.53/0.9041	33.81/0.9427	39.92/0.9797
ACT^[14]		38.46/0.9626	34.60/0.9256	32.56/0.9048	34.07/0.9443	39.95/0.9804
CRAFT^[18]		38.23/0.9615	33.92/0.9211	32.33/0.9016	32.86/0.9343	39.39/0.9786
ART^[11]		38.56/0.9629	34.59/0.9267	32.58/0.9048	34.30/0.9452	40.24/0.9808
DFDN^[20]		38.19/0.9612	33.85/0.9199	32.30/0.9013	32.68/0.9335	——
MDIESR^[21]		38.17/0.9613	33.83/0.9200	32.31/0.9013	32.65/0.9331	-——
HWT-SRNet		38.59/0.9632	34.81/0.9287	32.58/0.9050	34.42/0.9453	40.25/0.9812
ESWT^[8]	×3	34.63/0.9290	30.55/0.8464	29.23/0.8088	28.70/0.8628	34.05/0.9479
CAT-R^[7]		34.99/0.9320	31.00/0.8539	29.49/0.8154	29.91/0.8848	35.29/0.9542
Swin IR^[6]		34.97/0.9318	30.93/0.8534	29.46/0.8145	29.75/0.8826	35.12/0.9537
ACT^[14]		35.03/0.9321	31.08/0.8541	29.51/0.8164	30.08/0.8858	35.27/0.954
CRAFT^[18]		34.71/0.9295	30.61/0.8469	29.24/0.8093	28.77/0.8635	34.29/0.9491
ART^[11]		35.07/0.9325	31.02/0.8541	29.51/0.8159	30.10/0.8871	35.39/0.9548
DFDN^[20]		34.69/0.9293	30.55/0.8464	29.25/0.8089	28.70/0.8630	——
MDIESR^[21]		34.69/0.9295	30.58/0.8465	29.25/0.8087	28.72/0.8634	——
HWT-SRNet		35.12/0.9344	31.06/0.8551	29.57/0.8173	30.18/0.8889	35.48/0.9552
ESWT^[8]	×4	32.46/0.8979	28.80/0.7866	27.70/0.7410	26.56/0.8006	30.94/0.9136
CAT-R^[7]		32.89/0.9044	29.13/0.7955	27.95/0.7500	27.62/0.8292	32.16/0.9269
Swin IR^[6]		32.92/0.9044	29.09/0.7950	27.92/0.7489	27.45/0.8254	32.03/0.9260
ACT^[14]		32.97/0.9031	29.18/0.7954	27.95/0.7507	27.74/0.8305	32.20/0.9267
CRAFT^[18]		32.52/0.8989	28.85/0.7872	27.72/0.7418	26.56/0.7995	31.18/0.9168
ART^[11]		33.04/0.9051	29.16/0.7958	27.97/0.7510	27.77/0.8321	32.31/0.9283
DFDN^[20]		32.56/0.8989	28.87/0.7880	27.73/0.7414	26.59/0.8008	——
MDIESR^[21]		32.49/0.8986	28.84/0.7867	27.73/0.7399	26.59/0.8007	——
HWT-SRNet		33.08/0.9060	29.23/0.7975	28.02/0.7520	27.82/0.8370	32.35/0.9296

下载: 导出CSV

表 2 不同方法各数据集的LPIPS均值比较(×4)

算法	Set5	Set14	BSD100	Urban100	Manga109
ESWT^[8]	0.2078	0.2977	0.3383	0.2812	0.1912
CAT-R^[7]	0.2061	0.2927	0.3279	0.2496	0.1819
Swin IR^[6]	0.2079	0.2957	0.3321	0.2602	0.1847
ACT^[14]	0.2078	0.2904	0.3235	0.2506	0.1840
CRAFT^[18]	0.2136	0.3044	0.3389	0.2816	0.1920
ART^[11]	0.2068	0.2913	0.3259	0.2464	0.1804
HWT-SRNet	0.2050	0.2907	0.3255	0.2448	0.1799

下载: 导出CSV

表 3 参数量与重构时间的比较

算法	参数量	重构时间(s)
ESWT^[8]	589K	1.20
CAT-R^[7]	16.6M	4.41
Swin IR^[6]	11.9M	1.38
ACT^[14]	46M	10.51
CRAFT^[18]	753K	1.92
ART^[11]	16.55M	3.85
HWT-SRNet	16.63M	2.96

下载: 导出CSV

表 4 不同窗口大小对比实验结果

序号	窗口形状	窗口大小	Multi-adds(GMac)	PSNR/SSIM
1	方形窗口	(8，8)	53.6	32.92/0.9044
		(16,16)	63.8	32.98/0.9050
		(32,32)	119.3	33.01/0.9051
2	栅栏形窗口	2	79.5	32.56/0.8989
		4	82.4	32.82/0.9029
		8	94.1	32.99/0.9049
		16	120.3	33.01/0.9050
3	异质窗口	(8,8)，8	81.4	33.00/0.9050
		(8,8),16	87.5	33.01/0.9052
		(16,16)，4	73.1	32.90/0.9040
		(16,16),8	87.0	33.03/0.9054

下载: 导出CSV

表 5 不同模块对比实验结果

序号	Swin IR	异质窗口	高频先验特征提取网络	PSNR/SSIM
1	√	×	×	32.92/0.9044
2	√	√	×	33.03/0.9054
3	√	×	√	32.98/0.9049
4	√	√	√	33.08/0.9060

下载: 导出CSV

参考文献(23)

[1]	DONG Chao, LOY C C, HE Kaiming, et al. Learning a deep convolutional network for image super-resolution[C]. The 13th European Conference on Computer Vision, Zurich, Switzerland, 2014: 184–199. doi: 10.1007/978-3-319-10593-2_13.
[2]	DONG Chao, LOY C C, and TANG Xiaoou. Accelerating the super-resolution convolutional neural network[C]. 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 391–407. doi: 10.1007/978-3-319-46475-6_25.
[3]	ZHANG Yulun, LI Kunpeng, LI Kai, et al. Image super-resolution using very deep residual channel attention networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 294–310. doi: 10.1007/978-3-030-01234-2_18.
[4]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. 9th International Conference on Learning Representations, Vienna, Austria, 2021: 1–21. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
[5]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021: 9992–10002. doi: 10.1109/ICCV48922.2021.00986.
[6]	LIANG Jingyun, CAO Jiezhang, SUN Guolei, et al. SwinIR: Image restoration using Swin transformer[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, Canada, 2021: 1833–1844. doi: 10.1109/ICCVW54120.2021.00210.
[7]	CHEN Zheng, ZHANG Yulun, GU Jinjin, et al. Cross aggregation transformer for image restoration[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1847.
[8]	SHI Jinpeng, LI Hui, LIU Tianle, et al. Image super-resolution using efficient striped window transformer[EB/OL]. https://arxiv.org/abs/2301.09869, 2023.
[9]	WU Sitong, WU Tianyi, TAN Haoru, et al. Pale transformer: A general vision transformer backbone with pale-shaped attention[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, Palo Alto, USA, 2022: 2731–2739. doi: 10.1609/aaai.v36i3.20176. (查阅网上资料,未找到本条文献出版地信息,请确认).
[10]	WU Gang, JIANG Junjun, JIANG Kui, et al. Content-aware transformer for all-in-one image restoration[EB/OL]. https://arxiv.org/abs/2504.04869v1, 2025.
[11]	ZHANG Jiale, ZHANG Yulun, GU Jinjin, et al. Accurate image restoration with attention retractable transformer[C]. The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 2023: 1–13.
[12]	CHEN Zheng, ZHANG Yulun, GU Jinjin, et al. Recursive generalization transformer for image super-resolution[C]. The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024: 1–12.
[13]	CHU Shuchuan, DOU Zhichao, PAN J S, et al. HMANet: Hybrid multi-axis aggregation network for image super-resolution[C]. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2024: 6257–6266. doi: 10.1109/CVPRW63382.2024.00629.
[14]	YOO J, KIM T, LEE S, et al. Enriched CNN-transformer feature aggregation networks for super-resolution[C]. Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 4945–4954. doi: 10.1109/WACV56688.2023.00493.
[15]	SI Chenyang, YU Weihao, ZHOU Pan, et al. Inception transformer[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1707.
[16]	KORKMAZ C, TEKALP A M, and DOGAN Z. Training generative image super-resolution models by wavelet-domain losses enables better control of artifacts[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 5926–5936. doi: 10.1109/CVPR52733.2024.00566.
[17]	韩玉兰, 崔玉杰, 罗轶宏, 等. 基于密集残差和质量评估引导的频率分离生成对抗超分辨率重构网络[J]. 电子与信息学报, 2024, 46(12): 4563–4574. doi: 10.11999/JEIT240388. HAN Yulan, CUI Yujie, LUO Yihong, et al. Frequency separation generative adversarial super-resolution reconstruction network based on dense residual and quality assessment[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4563–4574. doi: 10.11999/JEIT240388.
[18]	LI Ao, ZHANG Le, LIU Yun, et al. Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution[C]. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 12480–12490. doi: 10.1109/iccv51070.2023.01150.
[19]	YAO Hongdou, HAN Pengfei, WANG Xiaofen, et al. Super-resolution via hierarchical attention and detail enhancement transformer network[J]. Optics & Laser Technology, 2025, 188: 112836. doi: 10.1016/j.optlastec.2025.112836.
[20]	程德强, 袁航, 钱建生, 等. 基于深层特征差异性网络的图像超分辨率算法[J]. 电子与信息学报, 2024, 46(3): 1033–1042. doi: 10.11999/JEIT230179. CHENG Deqiang, YUAN Hang, QIAN Jiansheng, et al. Image super-resolution algorithms based on deep feature differentiation network[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1033–1042. doi: 10.11999/JEIT230179.
[21]	寇旗旗, 刘规, 江鹤, 等. 基于多域信息增强的轻量级图像超分辨率网络[J]. 通信学报, 2025, 46(4): 144–159. doi: 10.11959/j.issn.1000-436x.2025059. KOU Qiqi, LIU Gui, JIANG He, et al. Lightweight image super-resolution network based on muti-domain information enhancement[J]. Journal on Communications, 2025, 46(4): 144–159. doi: 10.11959/j.issn.1000-436x.2025059.
[22]	WANG Xintao, XIE Liangbin, DONG Chao, et al. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, Canada, 2021: 1905–1914. doi: 10.1109/ICCVW54120.2021.00217.
[23]	WANG Yufei, YANG Wenhan, CHEN Xinyuan, et al. SinSR: Diffusion-based image super-resolution in a single step[C]. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25796–25805. doi: 10.1109/CVPR52733.2024.02437.