Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification
-
摘要: 农业害虫图像普遍存在复杂背景干扰、不同虫态时期外观差异显著、拍摄角度多样和尺度变化大等问题,导致现有细粒度图像分类模型在特征提取和虫态变化适应性方面仍存在不足。针对上述问题,本文构建了一个涵盖多虫态时期、多角度和多尺度的农业害虫多维数据集(APMD),并提出一种基于动态聚焦与语义提示的细粒度害虫分类网络(DFS-PestNet)。本文构建主特征流与提示增强流相结合的解耦并行架构,通过空间形变建模模块(SDP)动态聚焦害虫斑点、翅脉等关键判别区域,以增强复杂背景下的局部细微特征提取能力;引入提示引导机制模块(AHVP),在浅中层特征中融合类别语义与空间位置信息,提升对不同虫态形态变化的适应性;同时采用显著性双分支采样(DSS),通过可学习原型部件和双分支显著性融合自适应聚合害虫关键部位特征,增强对微小害虫和早期幼虫等小目标的识别能力。实验结果表明,在IP102和APMD两个数据集上本文模型均取得了优于基线模型和现有主流方法的分类性能,验证了其在复杂场景下细粒度害虫分类任务中的有效性与应用潜力,为智慧农业中的虫害监测与精准防治提供技术参考。Abstract:
Objective Agricultural pest images are commonly affected by severe challenges, including complex background interference, significant appearance differences across morphological stages, diverse shooting angles, and massive scale variations. These issues result in distinct insufficiencies in feature extraction and morphological adaptability within existing fine-grained classification models. To address these challenges, an Agricultural Pest Multi-dimensional Dataset (APMD) comprehensively covering multiple morphological stages, viewing angles, and object scales is constructed. Furthermore, a fine-grained pest classification network based on dynamic focus and semantic prompts (DFS-PestNet) is proposed. A decoupled parallel architecture combining a main feature stream and a prompt enhancement stream is designed. Through a Spatial Dependency Perception (SDP) module, crucial discriminative regions (e.g., pest spots and wing veins) are dynamically focused upon to enhance local subtle feature extraction under complex backgrounds. An Advanced Haptic-Visual Prompting (AHVP) module is introduced to explicitly integrate category semantics and spatial position information into shallow and middle-level features, substantially improving adaptability to morphological variations across developmental stages. Simultaneously, Dual-Branch Saliency Sampling (DSS) is adopted to adaptively aggregate critical features of essential pest body parts through learnable prototype components and dual-branch saliency fusion. This strategy enhances the precise recognition capability for small targets, including tiny pests and early-stage larvae. Experimental results demonstrate that the proposed model achieves superior classification performance compared to baseline and mainstream methods on both public and self-constructed datasets. The effectiveness and application potential of the model in complex agricultural scenarios are fully validated, providing a reliable technical reference for intelligent pest monitoring and precise control in smart agriculture. Methods To tackle the problem of insufficient classification accuracy in existing models under complex background interference and multi-morphological conditions, the Agricultural Pest Multi-dimensional Dataset (APMD) is initially constructed. This comprehensive dataset encompasses extensive image data across various morphological stages of pests, multiple viewing angles, and different scales. Specifically, it contains a total of 15,680 images covering 58 distinct species, which are rigorously divided into training, validation, and testing sets with a standard ratio of 7:2:1 ( Fig. 1 ) (Table 1 ). This dataset provides crucial and high-quality resource support for further research on fine-grained pest classification. Subsequently, the Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification (DFS-PestNet) is formally proposed. Within this network architecture, the Spatial Dependency Perception (SDP) module is carefully designed to adaptively locate and structurally enhance the key discriminative regions of pests. By successfully overcoming pose variations and complex background interference, more accurate fine-grained pest feature extraction is achieved. In addition, the Advanced Haptic-Visual Prompting (AHVP) module is introduced into the network pipeline to embed deep category semantics and spatial position information. This module guides the network to consistently focus on crucial discriminative features across different morphological periods, thereby effectively improving the overall recognition robustness regarding dramatic morphological changes throughout the pest life cycle. Furthermore, Dual-Branch Saliency Sampling (DSS) is proposed to adaptively aggregate the features of essential pest body parts. This strategy structurally strengthens the precise recognition capability for challenging small targets, effectively resolving the inherent difficulties of small target detection in fine-grained pest classification tasks.Results and Discussions The superior performance of the DFS-PestNet model in fine-grained pest classification tasks is comprehensively evaluated and verified through multi-dimensional experiments. Firstly, in terms of qualitative visualization analysis, Grad-CAM heatmaps intuitively indicate that compared to the baseline model, which is highly susceptible to severe interference from complex farmland backgrounds and plant stems, DFS-PestNet is capable of effectively suppressing background noise. It precisely focuses on fine-grained discriminative parts, such as pest heads and antennae ( Fig. 6 ). Significant advantages are explicitly demonstrated in capturing features of tiny targets (e.g., leafhopper nymphs) and pests in different life stages (e.g., Chilo suppressalis hidden within stems). The t-SNE feature dimensionality reduction results further confirm that the proposed model effectively alleviates the feature confusion problem in multi-morphological scenarios, enabling high-dimensional features to exhibit clearer inter-class separation and tighter intra-class clustering within a two-dimensional visual space (Fig. 7 ). Secondly, regarding quantitative ablation and parameter optimization experiments, the ablation studies fully validate the powerful synergistic enhancement effect of the three major improved modules: SDP, AHVP, and DSS (Table 2 ). The organic combination of these three modules significantly increases the classification accuracy of the baseline model by 2.21%, successfully reaching 77.24%, with all core evaluation metrics achieving optimal values. Concurrently, hyperparameter optimization experiments explicitly determine the optimal number of prompt position tokens to be 6 and the optimal feature dropout rate to be 0.2 (Fig. 8 ). This specific configuration guarantees complete semantic expression while simultaneously achieving the best balance between simulating natural occlusion and enhancing overall model robustness. Finally, in comparative experiments with mainstream state-of-the-art models, DFS-PestNet achieves the highest accuracies of 77.24% and 98.01% on the large-scale public dataset IP102 and the highly challenging self-constructed multi-dimensional dataset APMD, respectively, when directly compared with existing frontier Convolutional Neural Network (CNN) and Transformer architectures, such as Gate-ViT and EST (Table 3 ) (Table 4 ). These quantitative results comprehensively lead to various fine-grained classification metrics. More importantly, while guaranteeing extremely high classification accuracy, the inference speeds of the proposed model reach remarkably high levels of 158 frames/s and 164 frames/s, respectively. In summary, DFS-PestNet achieves a perfect unification of top-tier classification accuracy and excellent inference efficiency in complex pest feature extraction across massive scales and multiple morphological stages, which lays a solid operational foundation for efficient deployment and implementation in practical smart agriculture scenarios.Conclusions To address the challenges of multi-morphological variations and small target recognition in fine-grained pest classification, the multi-dimensional dataset APMD is initially constructed, and the DFS-PestNet model is proposed based on the MPSA baseline. Specifically, the SDP module is introduced to adaptively focus on pose- and morphology-invariant discriminative features; the AHVP module embeds robust category semantics and spatial position information into shallow and middle-level networks; and the DSS module adaptively aggregates crucial body part features to significantly enhance small target detection. Experimental results consistently verify the superiority of DFS-PestNet over mainstream models on both the IP102 and APMD datasets across varying developmental stages, angles, and scales. Future work will focus on exploring lightweight model modifications for efficient edge deployment and investigating open-set recognition tasks to accurately issue early warnings for unknown pest categories in complex real-world environments. -
表 1 APMD数据集层级化分类体系
目 种类数 虫态数 图像数量 训练集 验证集 测试集 半翅目 23 3(卵/幼/成虫) 4347 1242 621 鳞翅目 16 4(卵/幼/蛹/成虫) 3024 864 432 鞘翅目 10 4(卵/幼/蛹/成虫) 1890 540 270 双翅目 3 4(卵/幼/蛹/成虫) 567 162 81 膜翅目 2 4(卵/幼/蛹/成虫) 378 108 54 蜱螨类 2 4(卵/幼/蛹/成虫) 378 108 54 直翅目 1 3(卵/幼/成虫) 189 54 27 等翅目 1 3(卵/幼/成虫) 189 54 27 合计 58 10962 3132 1566 表 2 消融实验结果
实验
编号SDP
模块AHVP
模块DSS
模块A/% P/% R/% F1-score/% 1 75.03 67.85 64.60 66.18 2 √ 75.73 69.10 65.90 67.46 3 √ 75.67 68.95 65.85 67.36 4 √ 75.38 68.40 65.30 66.81 5 √ √ 76.52 70.40 67.45 68.89 6 √ √ 76.49 70.35 67.30 68.79 7 √ √ 76.50 70.38 67.40 68.86 8 √ √ √ 77.24 71.64 68.78 70.18 表 3 IP102数据集上对比试验结果
网络模型 Backbone A/% P/% R/% F1-score/% FPS VRFNet[18] EfficientNet 68.34 68.37 68.33 68.34 - EfficientNet B7[19] EfficientNet 70.01 - - - - ViT[20] ViT-B/16 73.40 68.17 66.89 66.52 70 IELT[21] ViT-B/16 75.40 69.50 67.82 68.65 62 FFEL-Net[22] ViT-B/16 76.21 68.44 66.86 67.64 - FRCF[23] ViT-B/16 74.69 - - - 91 Gate-ViT[24] ViT-B/16 76.10 70.11 69.14 69.62 109 EST[25] Swin-B 71.84 65.85 63.22 64.06 84 本文模型 Swin-B 77.24 71.64 68.78 70.18 158 表 4 APMD数据集上对比试验结果
网络模型 Backbone A/% P/% R/% F1-score/% FPS ViT[20] ViT-B/16 88.92 89.80 87.45 88.61 125 IELT[21] ViT-B/16 90.24 90.90 89.21 90.05 117 Gate-ViT[24] ViT-B/16 91.60 92.10 90.87 91.48 112 CLCA[26] ViT-B/16 93.52 94.00 92.79 93.39 110 GLSim[27] ViT-B/16 95.78 96.10 95.32 95.71 105 EST[25] Swin-B 93.45 93.82 93.10 93.46 138 本文模型 Swin-B 98.01 98.20 98.00 97.90 164 -
[1] 陆宴辉, 刘杨, 杨现明, 等. 中国农业害虫综合防治研究进展: 2018年-2022年[J]. 植物保护, 2023, 49(5): 145–166. doi: 10.16688/j.zwbh.2023207.LU Yanhui, LIU Yang, YANG Xianming, et al. Advances in integrated management of agricultural insect pests in China: 2018-2022[J]. Plant Protection, 2023, 49(5): 145–166. doi: 10.16688/j.zwbh.2023207. [2] 赵雪如, 李晖, 胡欣仪, 等. 基于深度学习的田间害虫自动识别技术综述[J]. 图像与信号处理, 2023, 12(2): 77–88. doi: 10.12677/JISP.2023.122008.ZHAO Xueru, LI Hui, HU Xinyi, et al. Survey of automatic identification of field pests based on deep learning[J]. Journal of Image and Signal Processing, 2023, 12(2): 77–88. doi: 10.12677/JISP.2023.122008. [3] WU Xiaoping, ZHAN Chi, LAI Yukun, et al. IP102: A large-scale benchmark dataset for insect pest recognition[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 8779–8788. doi: 10.1109/CVPR.2019.00899. [4] 陈磊, 刘立波, 王晓丽. 2020年宁夏枸杞虫害图文跨模态检索数据集[J]. 中国科学数据, 2022, 7(3): 1–8. doi: 10.11922/11-6035.nasdc.2021.0058.zh.CHEN Lei, LIU Libo, and WANG Xiaoli. A dataset of image-text cross-modal retrieval of Lycium barbarum pests in Ningxia in 2020[J]. China Scientific Data, 2022, 7(3): 1–8. doi: 10.11922/11-6035.nasdc.2021.0058.zh. [5] LI Yanfen, WANG Hanxiang, DANG L M, et al. Crop pest recognition in natural scenes using convolutional neural networks[J]. Computers and Electronics in Agriculture, 2020, 169: 105174. doi: 10.1016/j.compag.2019.105174. [6] BOLLIS E, PEDRINI H, and AVILA S. Weakly supervised learning guided by activation mapping applied to a novel citrus pest benchmark[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2020: 310–319. doi: 10.1109/CVPRW50498.2020.00043. [7] FANG Mingwei, TAN Zhiping, TANG Yu, et al. Pest-ConFormer: A hybrid CNN-Transformer architecture for large-scale multi-class crop pest recognition[J]. Expert Systems with Applications, 2024, 255: 124833. doi: 10.1016/j.eswa.2024.124833. [8] CHENG Zekai and XIA Wan. Fine-grained image classification on agricultural pest larvae[J]. IOP Conference Series: Earth and Environmental Science, 2021, 792: 012037. doi: 10.1088/1755-1315/792/1/012037. [9] AMARATHUNGA D C, RATNAYAKE M N, GRUNDY J, et al. Fine-grained image classification of microscopic insect pest species: Western Flower thrips and Plague thrips[J]. Computers and Electronics in Agriculture, 2022, 203: 107462. doi: 10.1016/j.compag.2022.107462. [10] WANG Linfeng, LIU Yong, LI Jiayao, et al. Based on the multi-scale information sharing network of fine-grained attention for agricultural pest detection[J]. PLoS One, 2023, 18(10): e0286732. doi: 10.1371/journal.pone.0286732. [11] 赵凤, 耿苗苗, 刘汉强, 等. 卷积神经网络与视觉Transformer联合驱动的跨层多尺度融合网络高光谱图像分类方法[J]. 电子与信息学报, 2024, 46(5): 2237–2248. doi: 10.11999/JEIT231209.ZHAO Feng, GENG Miaomiao, LIU Hanqiang, et al. Convolutional neural network and vision Transformer-driven cross-layer multi-scale fusion network for hyperspectral image classification[J]. Journal of Electronics & Information Technology, 2024, 46(5): 2237–2248. doi: 10.11999/JEIT231209. [12] 文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. [13] 宋婉莹, 刘毓琛, 王杰, 等. 面向高分辨遥感图像的熵驱动自适应融合网络构建与场景分类研究[J/OL]. 电子与信息学报, https://link.cnki.net/urlid/11.4494.TN.20260405.2112.010, 2025.SONG Wanying, LIU Yuchen, WANG Jie, et al. Entropy-driven adaptive fusion network for scene classification of high-resolution remote sensing images[J]. Journal of Electronics & Information Technology, https://link.cnki.net/urlid/11.4494.TN.20260405.2112.010, 2025. [14] HAN Yuantao, ZHANG Cong, ZHAN Xiaoyun, et al. Crossing multiple life stages: Fine-grained classification of agricultural pests[J]. Plant Methods, 2024, 20(1): 191. doi: 10.1186/s13007-024-01317-w. [15] WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813. [16] DAI Jifeng, QI Haozhi, XIONG Yuwen, et al. Deformable convolutional networks[C]. The 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 764–773. doi: 10.1109/ICCV.2017.89. [17] KIRKLAND E J. Advanced Computing in Electron Microscopy[M]. 2nd ed. New York: Springer, 2010: 261–263. doi: 10.1007/978-1-4419-6533-2. [18] NANDHINI C and BRINDHA M. Visual regenerative fusion network for pest recognition[J]. Neural Computing and Applications, 2024, 36(6): 2867–2882. doi: 10.1007/s00521-023-09173-w. [19] MALIK P and PARIDA M K. Classification of insect pest using transfer learning mechanism[C]. The 8th International Conference on Computer Vision and Image Processing, Jammu, India, 2023: 78–89. doi: 10.1007/978-3-031-58535-7_7. [20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021. [21] XU Qin, WANG Jiahui, JIANG Bo, et al. Fine-grained visual classification via internal ensemble learning transformer[J]. IEEE Transactions on Multimedia, 2023, 25: 9015–9028. doi: 10.1109/TMM.2023.3244340. [22] 张文丽, 宋威. 基于特征融合与集成学习的细粒度图像分类[J]. 激光与光电子学进展, 2024, 61(22): 2237010. doi: 10.3788/LOP240759.ZHANG Wenli and SONG Wei. Fine-grained image classification based on feature fusion and ensemble learning[J]. Laser & Optoelectronics Progress, 2024, 61(22): 2237010. doi: 10.3788/LOP240759. [23] LIU Honglin, ZHAN Yongzhao, XIA Huifen, et al. Self-supervised transformer-based pre-training method using latent semantic masking auto-encoder for pest and disease classification[J]. Computers and Electronics in Agriculture, 2022, 203: 107448. doi: 10.1016/j.compag.2022.107448. [24] LU Xiaowei, WANG Kanqi, WANG Peiyu, et al. Gate-ViT: Gated vision transformer for fine-grained visual classification[C]. The 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 2025: 468–479. doi: 10.1007/978-981-96-8180-8_37. [25] LIU Wei and ZHANG Ao. Plant disease detection algorithm based on efficient Swin transformer[J]. Computers, Materials & Continua, 2025, 82(2): 3045–3068. doi: 10.32604/cmc.2024.058640. [26] RIOS E A, YUANDA J C, GHANZ V L, et al. Cross-layer cache aggregation for token reduction in ultra-fine-grained image recognition[C]. 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/icassp49660.2025.10890489. [27] RIOS E A, HU Minchun, and LAI Bocheng. Global-local similarity for efficient fine-grained image recognition with vision transformers[C]. 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, UK, 2025: 1–5. doi: 10.1109/ISCAS56072.2025.11043866. -
下载:
下载: