Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection
-
摘要: 利用生成模型为小样本目标检测提供额外样本是解决样本稀缺问题的方法之一。现有生成额外样本的方法,多关注于生成样本的多样性,而忽略了生成样本的质量和代表性。为解决这一问题,该文提出了一个新的基于数据生成的小样本目标检测框架FQRS。首先,构造类间条件控制模块使得数据生成器能够学习不同类别间的关系,利用基类和新类的类间关系信息辅助模型估计新类的分布,从而提高生成样本的质量。其次,设计类内条件控制模块,利用交并比(IOU)信息限制生成样本在特征空间的位置,通过控制生成的样本更聚集于类别的中心,确保它们能够捕捉对应类别的关键特征,从而提高生成样本的代表性。在PASCAL VOC和MS COCO数据集上进行测试,在不同小样本条件下,该文提出的模型均超过当前最好的两阶段微调目标检测模型—解耦的更快区域卷积神经网络(DeFRCN)。实验验证了该文方法在小样本目标检测上具有出色的检测效果。Abstract:
Objective Deep learning-based object detection typically requires a large volume of high-quality annotated samples, which limits its practical applicability. Few-Shot Object Detection (FSOD) has gained significant attention as a promising research area. FSOD leverages base classes with abundant labeled data to recognize novel classes with limited training samples. Several methods based on generative models have been proposed to address the challenge of limited annotated data in FSOD. However, some limitations remain. (1) Most generative models fail to sufficiently capture the relationships between base and novel classes, which hinders their ability to accurately estimate novel class distributions and degrades the quality of generated samples. (2) Existing methods often prioritize increasing sample diversity, neglecting the critical need for representativeness. Low representativeness can cause confusion between categories, potentially reducing detection performance. Since the quality of generated samples directly affects the performance of the object detection network, which trains using both original and generated samples, this issue must be addressed. To address these challenges, a novel framework for data generation in FSOD via additional high-Quality and Representative Samples (FQRS), is introduced. A conditional control module, incorporating both inter-class and intra-class dynamics, is introduced to improve the quality and representativeness of generated samples, ultimately enhancing the accuracy of FSOD. Methods The proposed model architecture consists of a fine-tuning-based object detector and a data generator. First, the object detector is trained using base class data. Then, the pre-trained detector is employed to extract Region of Interest (RoI) features, which are used as training data for the generator. The generator, once trained, generates new samples for the novel classes. The architecture of the data generator includes a diffusion model for sample generation and an inter-class and intra-class conditional control module to guide the diffusion process. For inter-class conditional control, a semantic relation embedding is introduced, using cosine similarity to represent the degree of correlation between different classes. This enables the data generator to learn inter-class relations effectively. The relations between base and novel classes assist the diffusion model in estimating novel class distributions, improving the quality of generated samples. For intra-class conditional control, Intersection Over Union (IOU) information is utilized to constrain the position of generated samples within the corresponding feature space. This ensures that generated samples cluster around their respective category centers, enhancing their representativeness and preserving important class characteristics. Finally, the object detector is fine-tuned using both the generated samples and the original training samples. A hyperparameter in the loss function is introduced to control the influence of generated samples on the object detector’s training process. Results and Discussions The effectiveness and robustness of the proposed network are validated on two public datasets: PASCAL VOC and MS COCO. Detection accuracy is evaluated using mAP and mAP50 metrics. Quantitative comparisons ( Tables 1 and2 ) show that the proposed network outperforms existing methods across both datasets. For example, on the MS COCO dataset under the 1-shot setting, the proposed method achieves a 16.9% improvement over the state-of-the-art DeFRCN approach. A cross-domain experiment (Table 3 ), where base and novel class data are sourced from different datasets, demonstrates the superior generalization capability of the proposed method. Visual comparisons (Fig. 5 ) highlight that the proposed method effectively addresses issues like missed detections and category confusion arising from limited training data, thus improving the performance of FSOD. Ablation studies (Tables 4 ,5 , and6 ) confirm the efficacy of the proposed modules and reveal the impact of varying parameter configurations on detection performance. t-SNE visualization results (Fig. 6 ) show that the inter-class and intra-class conditional control module enhances feature aggregation within the same category, while improving discriminability between categories and reducing categorical confusion. Additionally, quantitative analysis (Table 7 ) examines the variations in model complexity introduced by the data generator, focusing on both parameter count and floating-point operations.Conclusions This paper presents a novel data-generation-based framework for obtaining additional samples in FSOD. The framework integrates a data generator, built on a conditional diffusion model, into a fine-tuning-based object detection network. The proposed data generator learns category features in conjunction with inter-class relations, capturing distinct category characteristics and improving generalization to novel classes. Additionally, the generator enhances sample representativeness by constraining generated samples to cluster around category centers. These high-quality, representative generated samples facilitate the object detector’s training, leading to improved FSOD accuracy. In various few-shot settings, the proposed model outperforms the state-of-the-art fine-tuning object detection model, Decoupled Faster Region-based Convolutional Neural net-work (DeFRCN), on both the PASCAL VOC and MS COCO datasets. Extensive experimental results validate the superiority of the proposed approach. -
表 1 在PASCAL VOC数据集上本文方法与其他方法结果对比
方法/shot Novel Set 1 Novel Set 2 Novel Set 3 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10 TFA[6] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8 MPSR[23] 41.7 – 51.4 55.2 61.8 24.4 – 39.2 39.9 47.8 35.6 – 42.3 48 49.7 TIP[24] 27.7 36.5 43.3 50.2 59.6 22.7 30.1 33.8 40.9 46.9 21.7 30.6 38.1 44.5 50.9 DCNet[25] 33.9 37.4 43.7 51.1 59.6 23.2 24.8 30.6 36.7 46.6 32.3 34.9 39.7 42.6 50.7 CME[26] 41.5 47.5 50.4 58.2 60.9 27.2 30.2 41.4 42.5 46.8 34.3 39.6 45.1 48.3 51.5 SRR-FSD[10] 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4 FADI[27] 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6 DeFRCN[9] 45.7 56.4 59.3 62.6 64.6 35.7 40.5 45.3 50.4 54.1 39.8 50.6 52.8 56.1 60.8 FCT[28] 38.5 49.6 53.5 59.8 64.3 25.9 34.2 40.1 44.9 47.4 34.7 43.9 49.3 53.1 56.3 Meta-DETR[29] 35.1 49.0 53.2 57.4 62.0 27.9 32.3 38.4 43.2 51.8 34.9 41.8 47.1 54.1 58.2 本文方法 47.8 56.6 59.3 63.2 65.6 37.5 43.1 47.5 52.0 56.3 40.2 50.8 53.8 56.6 62.4 表 2 在MS COCO数据集上本文方法与其他方法结果对比
方法 1-shot 2-shot 3-shot 5-shot 10-shot 30-shot TFA[6] 4.4 5.4 6 7.7 10.0 13.7 MPSR[23] 5.1 6.7 7.4 8.7 9.8 14.1 FSDetView[30] 4.5 6.6 7.2 10.7 12.5 14.7 TIP[24] – – – – 16.3 18.3 DCNet[25] – – – – 12.8 18.6 CME[26] – – – – 15.1 16.9 SRR-FSD[10] – – – – 11.3 14.7 FADI[27] 5.7 7 8.6 10.1 12.2 16.1 DeFRCN[9] 7.7 11.4 13.2 15.5 18.5 22.4 FCT[28] 5.1 7.2 9.8 12.0 15.3 20.2 Meta-DETR[29] 7.5 – 13.5 15.4 19.0 22.2 本文方法 9.0 12.2 13.6 15.7 18.6 22.6 表 4 类间-类内控制模块有效性验证
类间条件控制 类内条件控制 mAP (%) 语义嵌入 语义关系嵌入 √ 8.3 √ √ 8.8 √ √ 8.6 √ √ √ 9.0 表 5 不同参数取值结果对比
I的取值范围 $ \gamma $ mAP (%) [0.5,1] 0.4 8.7 [0.5,0.75] 0.4 8.2 [0.75,1] 0.3 8.8 [0.75,1] 0.5 8.7 [0.75,1] 0.4 9.0 表 6 参数T不同取值结果对比
T mAP (%) 900 8.9 1 000 9.0 1 100 9.0 -
[1] LIU Qiankun, LIU Rui, ZHENG Bolun, et al. Infrared small target detection with scale and location sensitivity[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 17490–17499. doi: 10.1109/CVPR52733.2024.01656. [2] ZHANG Gang, CHEN Junnan, GAO Guohuan, et al. SAFDNet: A simple and effective network for fully sparse 3D object detection[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 14477–14486. doi: 10.1109/CVPR52733.2024.01372. [3] YE Mingqiao, KE Lei, LI Siyuan, et al. Cascade-DETR: Delving into high-quality universal object detection[C]. 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 6704–6714. doi: 10.1109/ICCV51070.2023.00617. [4] WANG C Y, BOCHKOVSKIY A, and LIAO H Y M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7464–7475. doi: 10.1109/CVPR52729.2023.00721. [5] WANG Yuxiong, RAMANAN D, and HEBERT M. Meta-learning to detect rare objects[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9925–9934. doi: 10.1109/ICCV.2019.01002. [6] WANG Xin, HUANG T E, DARRELL T, et al. Frustratingly simple few-shot object detection[C]. The 37th International Conference on Machine Learning, 2020: 920. [7] SUN Bo, LI Banghuai, CAI Shengcai, et al. FSCE: Few-shot object detection via contrastive proposal encoding[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7352–7362. doi: 10.1109/CVPR46437.2021.00727. [8] YAN Xiaopeng, CHEN Ziliang, XU Anni, et al. Meta R-CNN: Towards general solver for instance-level low-shot learning[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9577–9586. doi: 10.1109/ICCV.2019.00967. [9] QIAO Limeng, ZHAO Yuxuan, LI Zhiyuan, et al. DeFRCN: Decoupled faster R-CNN for few-shot object detection[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 8681–8690. doi: 10.1109/ICCV48922.2021.00856. [10] ZHU Chenchen, CHEN Fangyi, AHMED U, et al. Semantic relation reasoning for shot-stable few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8782–8791. doi: 10.1109/CVPR46437.2021.00867. [11] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031. [12] ZHANG Weilin and WANG Yuxiong. Hallucination improves few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13008–13017. doi: 10.1109/CVPR46437.2021.01281. [13] ZHU Pengkai, WANG Hanxiao, and SALIGRAMA V. Don’t even look once: Synthesizing features for zero-shot detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11693–11702. doi: 10.1109/CVPR42600.2020.01171. [14] XU Jingyi, LE H, and SAMARAS D. Generating features with increased crop-related diversity for few-shot object detection[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 19713–19722. doi: 10.1109/CVPR52729.2023.01888. [15] HO J, JAIN A, and ABBEEL P. Denoising diffusion probabilistic models[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 574. [16] HO J and SALIMANS T. Classifier-free diffusion guidance[EB/OL]. https://arxiv.org/abs/2207.12598, 2022. [17] QI Tianhao, FANG Shancheng, WU Yanze, et al. DEADiff: An efficient stylization diffusion model with disentangled representations[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 8693–8702. doi: 10.1109/CVPR52733.2024.00830. [18] GARBER T and TIRER T. Image restoration by denoising diffusion models with iteratively preconditioned guidance[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25245–25254. doi: 10.1109/CVPR52733.2024.02385. [19] LI Muyang, CAI Tianle, CAO Jiaxin, et al. DistriFusion: Distributed parallel inference for high-resolution diffusion models[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 7183–7193. doi: 10.1109/CVPR52733.2024.00686. [20] HUANG Ziqi, CHAN K C K, JIANG Yuming, et al. Collaborative diffusion for multi-modal face generation and editing[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6080–6090. doi: 10.1109/CVPR52729.2023.00589. [21] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763. [22] RONNEBERGER O, FISCHER P, and BROX T. U-net: Convolutional networks for biomedical image segmentation[C]. The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241. doi: 10.1007/978-3-319-24574-4_28. [23] WU Jiaxi, LIU Songtao, HUANG Di, et al. Multi-scale positive sample refinement for few-shot object detection[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 456–472. doi: 10.1007/978-3-030-58517-4_27. [24] LI Aoxue and LI Zhenguo. Transformation invariant few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 3094–3102. doi: 10.1109/CVPR46437.2021.00311. [25] HU Hanzhe, BAI Shuai, LI Aoxue, et al. Dense relation distillation with context-aware aggregation for few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 10185–10194. doi: 10.1109/CVPR46437.2021.01005. [26] LI Bohao, YANG Boyu, LIU Chang, et al. Beyond max-margin: Class margin equilibrium for few-shot object detection[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7363–7372. doi: 10.1109/CVPR46437.2021.00728. [27] CAO Yuhang, WANG Jiaqi, JIN Ying, et al. Few-shot object detection via association and discrimination[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 1267. [28] HAN Guangxing, MA Jiawei, HUANG Shiyuan, et al. Few-shot object detection with fully cross-transformer[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 5321–5330. doi: 10.1109/CVPR52688.2022.00525. [29] ZHANG Gongjie, LUO Zhipeng, CUI Kaiwen, et al. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12832–12843. doi: 10.1109/TPAMI.2022.3195735. [30] XIAO Yang, LEPETIT V, and MARLET R. Few-shot object detection and viewpoint estimation for objects in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3090–3106. doi: 10.1109/TPAMI.2022.3174072. -