Sample Generation Based on Conditional Diffusion Model for Few-Shot Object Detection
-
摘要: 利用生成模型为小样本目标检测提供额外样本是解决样本稀缺问题的方法之一。现有生成额外样本的方法,多关注于生成样本的多样性,而忽略了生成样本的质量和代表性。为解决这一问题,该文提出了一个新的基于数据生成的小样本目标检测框架FQRS。首先,构造类间条件控制模块使得数据生成器能够学习不同类别间的关系,利用基类和新类的类间关系信息辅助模型估计新类的分布,从而提高生成样本的质量。其次,设计类内条件控制模块,利用交并比(IOU)信息限制生成样本在特征空间的位置,通过控制生成的样本更聚集于类别的中心,确保它们能够捕捉对应类别的关键特征,从而提高生成样本的代表性。在PASCAL VOC和MS COCO数据集上进行测试,在不同小样本条件下,该文提出的模型均超过当前最好的两阶段微调目标检测模型DeFRCN。实验验证了该文方法在小样本目标检测上具有出色的检测效果。Abstract:
Objective Deep learning-based object detection typically requires a large volume of high-quality annotated samples, which limits its practical applicability. Few-Shot Object Detection (FSOD) has gained significant attention as a promising research area. FSOD leverages base classes with abundant labeled data to recognize novel classes with limited training samples. Several methods based on generative models have been proposed to address the challenge of limited annotated data in FSOD. However, some limitations remain. (1) Most generative models fail to sufficiently capture the relationships between base and novel classes, which hinders their ability to accurately estimate novel class distributions and degrades the quality of generated samples. (2) Existing methods often prioritize increasing sample diversity, neglecting the critical need for representativeness. Low representativeness can cause confusion between categories, potentially reducing detection performance. Since the quality of generated samples directly affects the performance of the object detection network, which trains using both original and generated samples, this issue must be addressed. To address these challenges, a novel framework for data generation in FSOD via additional high-Quality and Representative Samples (FQRS), is introduced. A conditional control module, incorporating both inter-class and intra-class dynamics, is introduced to improve the quality and representativeness of generated samples, ultimately enhancing the accuracy of FSOD. Methods The proposed model architecture consists of a fine-tuning-based object detector and a data generator. First, the object detector is trained using base class data. Then, the pre-trained detector is employed to extract Region of Interest (RoI) features, which are used as training data for the generator. The generator, once trained, generates new samples for the novel classes. The architecture of the data generator includes a diffusion model for sample generation and an inter-class and intra-class conditional control module to guide the diffusion process. For inter-class conditional control, a semantic relation embedding is introduced, using cosine similarity to represent the degree of correlation between different classes. This enables the data generator to learn inter-class relations effectively. The relations between base and novel classes assist the diffusion model in estimating novel class distributions, improving the quality of generated samples. For intra-class conditional control, Intersection Over Union (IOU) information is utilized to constrain the position of generated samples within the corresponding feature space. This ensures that generated samples cluster around their respective category centers, enhancing their representativeness and preserving important class characteristics. Finally, the object detector is fine-tuned using both the generated samples and the original training samples. A hyperparameter in the loss function is introduced to control the influence of generated samples on the object detector’s training process. Results and Discussions The effectiveness and robustness of the proposed network are validated on two public datasets: PASCAL VOC and MS COCO. Detection accuracy is evaluated using mAP and mAP50 metrics. Quantitative comparisons ( Tables 1 and2 ) show that the proposed network outperforms existing methods across both datasets. For example, on the MS COCO dataset under the 1-shot setting, the proposed method achieves a 16.9% improvement over the state-of-the-art DeFRCN approach. A cross-domain experiment (Table 3 ), where base and novel class data are sourced from different datasets, demonstrates the superior generalization capability of the proposed method. Visual comparisons (Fig. 5 ) highlight that the proposed method effectively addresses issues like missed detections and category confusion arising from limited training data, thus improving the performance of FSOD. Ablation studies (Tables 4 ,5 , and 6) confirm the efficacy of the proposed modules and reveal the impact of varying parameter configurations on detection performance. t-SNE visualization results (Fig. 6 ) show that the inter-class and intra-class conditional control module enhances feature aggregation within the same category, while improving discriminability between categories and reducing categorical confusion. Additionally, quantitative analysis (Table 7 ) examines the variations in model complexity introduced by the data generator, focusing on both parameter count and floating-point operations.Conclusions This paper presents a novel data-generation-based framework for obtaining additional samples in FSOD. The framework integrates a data generator, built on a conditional diffusion model, into a fine-tuning-based object detection network. The proposed data generator learns category features in conjunction with inter-class relations, capturing distinct category characteristics and improving generalization to novel classes. Additionally, the generator enhances sample representativeness by constraining generated samples to cluster around category centers. These high-quality, representative generated samples facilitate the object detector’s training, leading to improved FSOD accuracy. In various few-shot settings, the proposed model outperforms the state-of-the-art fine-tuning object detection model, DeFRCN, on both the PASCAL VOC and MS COCO datasets. Extensive experimental results validate the superiority of the proposed approach. -
1. 引言
近年来,基于深度学习的目标检测方法在诸多领域展现了强大的性能[1–4],但这些方法往往高度依赖于带标注训练样本的数量和质量。在医疗健康、智能驾驶等领域,获取大量高质量的标注样本耗时且成本高昂,这严重限制了深度学习的应用。因此小样本目标检测受到了广泛关注[5–7]。小样本目标检测利用具有大量标记样本的基类数据,旨在仅使用少量训练样本来识别新的类别。
现有的小样本目标检测研究大多基于两阶段目标检测框架[8–11],由于缺乏训练数据,区域建议网络(Region Proposal Network, RPN)产生的特征缺乏多样性,难以训练出高质量的分类器[12]。为了解决这一问题,一个较为直接的策略是为分类器提供额外的训练样本[12–14]。例如,Zhang等人[12]训练了一个特征幻觉网络,利用基类特征的变化来推测新类特征的变化,从而提高新类样本的多样性。
然而,现有方法在为新类生成额外样本时,专注于提高样本的多样性,而忽略了生成样本的质量和代表性。这类方法由于缺乏关于新类与基类的类间关系信息,会阻碍生成模型有效地估计新类的分布,从而导致生成样本的质量较低。此外,在生成过程中,不可避免地会出现一些远离其对应类别中心的、缺乏代表性的样本,用这些代表性不足的样本来训练分类器,会导致分类器对类别估计的偏差。
针对上述问题,本文提出一个新的基于数据生成的框架,通过生成高质量且有代表性的样本辅助进行小样本目标检测(Few-shot object detection via additional high-Quality and Representative Samples, FQRS)。本框架由一个两阶段目标检测器和一个数据生成器组成。其中数据生成器基于条件去噪扩散概率模型[15,16],扩散模型作为新兴的生成模型,在图像生成领域已经展现出巨大的潜力[17–20]。为了提高扩散模型生成的样本的质量和代表性,本文提出一个类间-类内控制模块,为扩散模型提供条件信息。一方面,除了引入语义嵌入外,在生成模型中还使用类间关系作为类别控制条件。如图1(a)所示,在训练过程中,生成模型不再是孤立地学习每个基类,而是在学习基类的同时学习不同类别之间的关系。通过利用图1(a)和图1(b)所示的这些类间关系,生成模型可以更好地从基类泛化到新类,从而提高生成样本的质量。另一方面,本文使用交并比(Intersection-Over-Union, IOU)信息来生成类内控制条件,以此限制样本在特征空间中的位置,确保生成的样本是具有代表性的、能够捕获类别的关键特征的。
2. 本文方法
2.1 问题定义
在小样本目标检测中,训练集由基类集Db和新类集Dn组成,其中基类集由带大量标注样本的基类Cb组成,新类集由只有少量标注样本的新类Cn组成(Cb∩Cn=∅)。K-shot表示每个新类带标注的目标数量为K。小样本目标检测的目的是从基类集Db学习并推广到新类集Dn中,从而实现对新类Cn的检测。
2.2 整体框架
本文所提FQRS架构如图2所示。在两阶段微调方法[9]的基础上,合并了一个由扩散模型和类间-类内控制模块组成的数据生成器,其中类间-类内控制模块为扩散模型提供控制条件。
如图2所示,本文方法的流程主要分为4步:(1)首先使用基类数据训练目标检测器。(2)之后用训练好的基类目标检测器获得基类的感兴趣区域(Region of Interest, RoI)特征,基类的RoI特征用于训练数据生成器。(3)然后使用训练好的生成器为新类生成新的样本。(4)最后,这些生成的样本与原始训练样本一起对新类目标检测器进行微调。本文所使用目标检测器为小样本目标检测代表性方法:解耦的更快区域卷积神经网络(Decoupled Faster Region-based Convolutional Neural network, DeFRCN)[9]。
2.3 类间-类内控制模块
本文所提类间-类内控制模块是用于调控数据生成模块生成的样本,它由两部分组成:类间条件控制和类内条件控制。类间条件控制用于引导生成器为指定的类生成样本。类内条件控制用于限制生成的样本在其对应的特征空间中的位置。类间-类内控制模块的结构如图3所示。
2.3.1 类间条件控制
在图片生成领域,常用语义嵌入作为控制生成样本类别的条件。但在小样本目标检测领域,仅依靠类别名称而没有其他信息,模型难以准确估计训练数据中不存在的新类的分布。为了解决这个问题,本文训练模型时不是孤立地学习每个类别特征,而是在学习不同类别特征的同时学习类别间的关系,从而使模型能够利用新类和基类之间的关系更好地泛化到新类。在图1中,若不加入类别之间关系信息,生成模型面临的任务是:学习“鸟”、“汽车”、“人”和“熊”4个基类,然后为“猫”这一未学习过的新类生成样本。这一方式完全依赖于输出语义嵌入的对比语言图像预训练模型(Contrastive Language-Image Pre-training, CLIP)[21]从大数据中学习到的同一类别语义与视觉信息的对应,没有充分利用类别之间的关系实现基类和新类之间的关联与区分。加入类别之间关系信息后,生成模型面临的任务为:学习“鸟”、“汽车”、“人”和“熊”4个基类及其之间的相关程度,例如“鸟”这一类别与4个基类之间的相关程度分别为[1, 0.83, 0.85, 0.87],然后在已知“猫”与4个基类之间相关程度为[0.84, 0.86, 0.81, 0.83]的情况下为“猫”这一类别生成新样本。在这种情况下,生成模型可以利用类别关联信息,从已经学习过的类别中推测当前未学习过的类别“猫”,从而使模型能够更好地捕捉新类的分布,实现从基类到新类的迁移泛化。
具体来说,本文将类间条件控制分为语义嵌入和语义关系嵌入两部分。(1)语义嵌入是引导生成器为指定类生成样本的常用方法。本文将其表示为ysem∈RS,其中S = 512是语义嵌入的维度。语义嵌入是将类别名称输入CLIP模型的文本编码器获得的。(2)语义关系嵌入ycor表示目标类与所有基类的关联程度。本文将所有基类的语义嵌入表示为Ysem-base={ysem1, ysem2,···, ysemB},其中B是基类的数量。然后依据目标类别的语义嵌入ysem和所有基类的语义嵌入Ysem-base之间的余弦相似度来计算语义关系嵌入ycor∈RB,如式(1)所示
{{\boldsymbol{y}}_{{\text{cor}}}} = \cos ({{\boldsymbol{y}}_{{\text{sem}}}},{{\boldsymbol{Y}}_{{\text{sem}} - {\text{base}}}}) (1) 2.3.2 类内条件控制
对于给定类别,特征空间内不同位置样本对模型训练的影响不同。简单样本通常更集中、更接近相应的类中心、更具有类别代表性,可以帮助模型快速学习类的基本特征和模式。困难样本通常在特征空间中更分散,可以迫使模型学习更复杂的特征以及处理更具挑战性的场景。而小样本目标检测中,现有的样本通常不足以捕捉某一类别的典型特征,在更具类别代表性的简单样本较为稀缺的情况下,生成较多的困难样本,会影响目标检测模型对类别基本特征的学习,导致反向传播过程中梯度变化的不稳定,使得模型损失下降缓慢甚至难以收敛。此外,用生成模型进行样本生成时具有一定随机性,若不加限制,远离对应类别中心的样本可能会与周围其他类别过于接近,造成类别间的互相干扰与混淆。因此,需要对生成模型生成的样本在特征空间中的位置加以限制,使其更接近相应类别中心,从而为目标检测模型提供更具代表性的样本。
为了实现这一目标,本文引入了一个参数I来限制生成的样本在特征空间中的位置。图4(a)展示了与同一个目标的真值框具有不同IOU的RoI,其中图4(a)左上角图片展示了真值框,其他4个为RoI区域,与真值框的IOU依次为0.19, 0.43, 0.58, 0.78。可以看出,IOU反映了RoI特征包含有效信息的多少,IOU越大意味着RoI特征包含的有效信息越多、杂乱的背景信息越少。图4(b)利用t分布随机邻域嵌入(t-distributed Stochastic Neighbor Embedding, t-SNE)可视化方法展示了MS COCO数据集中部分不同类别不同IOU的特征。其中不同的颜色表示不同的类别。在同一类中,IOU在[0.75,1]范围内的RoI特征用透明度较低、较深色表示,IOU在[0.5, 0.75)范围内的RoI特征用透明度较高、较浅色表示。可以看出,IOU越大的特征往往更靠近其对应的类中心、更具有代表性。因此,本文使用IOU作为类内位置参数I,通过全连接层对其进行编码,得到类内条件yintra∈R512 。
如图3所示,本文使用一个微小的网络 {\mathcal{F}}(\cdot) 将类间和类内的条件统一到同一个维度,并将它们加在一起得到最终的条件y∈R2048
{\boldsymbol{y}} = \mathcal{F}\left[ {{{\boldsymbol{y}}_{{\text{sem}}}}\left( a \right),{{\boldsymbol{y}}_{{\text{cor}}}}\left( {a,A} \right),{{\boldsymbol{y}}_{{\text{intra}}}}\left( I \right)} \right] (2) 其中, a 是目标类的名称, A 是所有基类的名称集合。
2.4 数据生成和微调
如图2所示,本文使用预训练的目标检测器DeFRCN获取基类的RoI特征,并选择IOU大于0.5的特征作为数据生成器的输入x0∈R2048。数据生成器的结构是基于条件扩散模型。将RoI特征的类别名称和IOU输入到类间-类内控制模块,类间-类内控制模块输出条件y作为扩散模型的条件控制。
在训练生成器时,首先在输入x0中逐步添加噪声 {\boldsymbol{\varepsilon}} ,直到得到高斯噪声xT∈R2048。之后训练神经网络在条件y的监督下,预测每一步加入的噪声 {{\boldsymbol{\varepsilon}} }_{\theta } ,从而实现反向去噪过程。用于预测的神经网络是U-net[22]。实验表明,将扩散步骤总数T设置为1 000可获得最佳效果。训练该生成器的损失函数为
L={\mathbb{E}}_{{{\boldsymbol{x}}}_{0},{\boldsymbol{\varepsilon}} ,t}\left[{\Vert {\boldsymbol{\varepsilon}} -{{\boldsymbol{\varepsilon}} }_{\theta }\left({{\boldsymbol{x}}}_{t},{\boldsymbol{y}},t\right)\Vert }_{2}^{2}\right] (3) 其中, \mathbb{E} 为预测的噪声 {{\boldsymbol{\varepsilon}} }_{\theta } 与添加的噪声 {\boldsymbol{\varepsilon}} 之间的期望损失,x0为输入样本,t为步数,y为引导扩散模型的条件,xt为步骤t时的样本,{\boldsymbol{ \epsilon }}为符合正态分布 \mathcal{N}\left( {0,1} \right) 的噪声, {{\boldsymbol{\varepsilon}} }_{\theta }\left({{\boldsymbol{x}}}_{t},{\boldsymbol{y}},t\right) 为预测噪声。
用基类数据将生成器训练好之后,使用生成器为新类生成额外的样本。为了保证生成的样本具有代表性,将类内位置参数I设置在[0.75, 1]的范围内。类间-类内控制模块以新类的类别名称和参数I作为输入,并输出条件y。采样过程从高斯噪声xT开始,并使用训练好的U-net在y的控制下迭代地去除噪声 {\boldsymbol{\varepsilon }},最终生成样本 {\bar {\boldsymbol{x}}_0} 。
之后用生成的样本与数据集中的原始训练样本一起对目标检测网络进行微调。微调阶段的损失函数为
L = {L_{{\text{cls-ori}}}} + {L_{{\text{reg-ori}}}} + \gamma {L_{{\text{cls-n}}}} (4) 其中 {L_{{\text{cls-ori}}}} 和 {L_{{\text{reg-ori}}}} 分别为原始样本造成的分类损失和回归损失, {L_{{\text{cls-n}}}} 为新样本造成的分类损失, \gamma 为控制生成样本影响的超参数,设为0.4。
3. 实验
在PASCAL VOC(07 + 12)和MS COCO数据集上进行实验来评估本文的方法。为了公平对比,本文遵循前人[6,9]的数据分割方法和评估协议,在MS COCO数据集上以平均精度均值(mean Average Precision, mAP)作为评价指标。所有实验结果均为10次实验的平均值;在PASCAL VOC数据集上以交并比阈值为50%的平均精度均值(mean Average Precision at 0.5 intersection over union, mAP50)作为评价指标,采用3种不同的基类-新类拆分方式(Novel Set 1/2/3)。
为验证本文方法有效性,与多个小样本目标检测方法进行对比。用于对比实验的方法有:两阶段微调方法(Two-stage Fine-tuning Approach, TFA)[6],多尺度正样本细化网络(Multi-scale Positive Sample Refinement, MPSR)[23],变换不变原理网络(Transformation Invariant Principle, TIP)[24],基于上下文感知聚合的稠密关系蒸馏网络(Dense relation distillation with Context-aware aggregation, DCNet)[25],类边界均衡方法(Class Margin Equilibrium approach, CME)[26],语义关系推理小样本检测器(Semantic Relation Reasoning Few-Shot Detector, SRR-FSD)[10],基于关联和判别的小样本目标检测方法(Few-shot object detection via Association and DIscrimination, FADI)[27],解耦的更快区域卷积神经网络(Decoupled Faster Region-based Convolutional Neural network, DeFRCN)[9],基于完全交叉转换器的模型(Fully Cross-Transformer based model, FCT)[28],元检测转换器(Meta DEtection TRansformer, Meta-DETR)[29],小样本视点估计方法(Few-Shot Viewpoint estimation method, FSDetView)[30],基于元学习的区域卷积神经网络(Meta Region-based Convolutional Neural Network, MetaRCNN)[8]。
3.1 实验设置
本文实验使用的计算机CPU为Intel Core i9-13900K, GPU为2个Nvidia GeForce RTX
4090 ,操作系统为Ubuntu18.04,CUDA版本为11.0,软件环境为python3.7, pytorch1.7。目标检测网络设置与基准模型一致,优化器使用SGD,动量为0.9,权重衰减为5–5,每个GPU的batch size为16,基类训练时学习率为0.02,主干网络采用在ImageNet数据集上预训练的权重,新类微调时学习率为0.01,微调时RoI Head的参数冻结,对其他部分参数进行微调。CLIP模型使用OpenAI官方提供的CLIP ViT-B/32模型及其权重。生成模型训练时,采用AdamW优化,学习率为10–3,权重衰减系数为10–4。3.2 与现有方法的比较
(1)在PASCAL VOC数据集上实验。表1展示了本文方法与其他不同方法在PASCAL VOC数据集上的检测结果对比。可以看出,在大多数小样本目标检测设置中,本文方法优于现有其他方法。
表 1 在PASCAL VOC数据集上本文方法与其他方法结果对比方法/shot Novel Set 1 Novel Set 2 Novel Set 3 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10 TFA[6] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8 MPSR[23] 41.7 – 51.4 55.2 61.8 24.4 – 39.2 39.9 47.8 35.6 – 42.3 48 49.7 TIP[24] 27.7 36.5 43.3 50.2 59.6 22.7 30.1 33.8 40.9 46.9 21.7 30.6 38.1 44.5 50.9 DCNet[25] 33.9 37.4 43.7 51.1 59.6 23.2 24.8 30.6 36.7 46.6 32.3 34.9 39.7 42.6 50.7 CME[26] 41.5 47.5 50.4 58.2 60.9 27.2 30.2 41.4 42.5 46.8 34.3 39.6 45.1 48.3 51.5 SRR-FSD[10] 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4 FADI[27] 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6 DeFRCN[9] 45.7 56.4 59.3 62.6 64.6 35.7 40.5 45.3 50.4 54.1 39.8 50.6 52.8 56.1 60.8 FCT[28] 38.5 49.6 53.5 59.8 64.3 25.9 34.2 40.1 44.9 47.4 34.7 43.9 49.3 53.1 56.3 Meta-DETR[29] 35.1 49.0 53.2 57.4 62.0 27.9 32.3 38.4 43.2 51.8 34.9 41.8 47.1 54.1 58.2 本文方法 47.8 56.6 59.3 63.2 65.6 37.5 43.1 47.5 52.0 56.3 40.2 50.8 53.8 56.6 62.4 (2)在MS COCO数据集上实验。表2展示了本文方法与其他不同方法在MS COCO数据集上的检测结果对比。本文的方法在大多数小样本目标检测设置中达到最佳效果。图5展示了在1-shot情况下本文方法与DeFRCN的检测效果对比,第1张图中,对比DeFRCN方法,本文模型对漏检情况有所改善;第2张图中,数据的缺乏导致了DeFRCN模型的偏见,即由于衣服、眼镜等特征的影响,错误检测出人,而本文模型通过生成新的特征缓解了这一问题;第3张图中,DeFRCN模型将狗误判为牛而本文模型正确检测,展现了本文方法对类别混淆问题的改善,但对于该图左上角的猫,两种方法均存在漏检问题。
表 2 在MS COCO数据集上本文方法与其他方法结果对比方法 1-shot 2-shot 3-shot 5-shot 10-shot 30-shot TFA[6] 4.4 5.4 6 7.7 10.0 13.7 MPSR[23] 5.1 6.7 7.4 8.7 9.8 14.1 FSDetView[30] 4.5 6.6 7.2 10.7 12.5 14.7 TIP[24] – – – – 16.3 18.3 DCNet[25] – – – – 12.8 18.6 CME[26] – – – – 15.1 16.9 SRR-FSD[10] – – – – 11.3 14.7 FADI[27] 5.7 7 8.6 10.1 12.2 16.1 DeFRCN[9] 7.7 11.4 13.2 15.5 18.5 22.4 FCT[28] 5.1 7.2 9.8 12.0 15.3 20.2 Meta-DETR[29] 7.5 – 13.5 15.4 19.0 22.2 本文方法 9.0 12.2 13.6 15.7 18.6 22.6 (3)从MS COCO数据集到PASCAL VOC数据集的跨域实验。跨域实验遵循前人方法[9,23]中相同的设置,用来自MS COCO数据集的60个基类作为跨域实验的基类,来自PASCAL VOC数据集的20个类别作为跨域实验的新类,其中每个新类包含10个样本。不同方法检测结果对比如表3所示,可以看出本文方法取得了较好的检测效果,说明本文方法在跨域的情况下具有很好的泛化能力。
3.3 消融实验
在MS COCO数据集上,本文在1-shot下进行了一系列消融研究。如表4、表5和表6所示,本文验证了类间-类内控制模块的有效性,并评估了不同参数对模型检测性能的影响。
表 4 类间-类内控制模块有效性验证类间条件控制 类内条件控制 mAP (%) 语义嵌入 语义关系嵌入 √ 8.3 √ √ 8.8 √ √ 8.6 √ √ √ 9.0 表 5 不同参数取值结果对比I的取值范围 \gamma mAP (%) [0.5,1] 0.4 8.7 [0.5,0.75] 0.4 8.2 [0.75,1] 0.3 8.8 [0.75,1] 0.5 8.7 [0.75,1] 0.4 9.0 表 6 参数T不同取值结果对比T mAP (%) 900 8.9 1 000 9.0 1 100 9.0 (1)类间-类内控制模块有效性验证:作为基准,首先将数据生成器加入目标检测结构,只使用语义嵌入作为条件来引导生成器为指定的类生成样本,检测结果如表4第1行所示。之后,分别测试了以下3种情况:(a)添加语义关系嵌入作为控制生成样本类别的条件;(b)添加类内条件控制,控制生成样本在特征空间的位置;(c)同时添加语义关系嵌入和类内条件控制。根据表4所示,语义关系嵌入和类内条件控制分别使小样本目标检测mAP提高了6.02%和3.61%,二者同时使用,使小样本目标检测mAP提高了8.43%。
对于MS COCO数据集中的20个新类,在不同控制条件下生成的样本用t-SNE方法可视化的结果如图6所示。可以看出,加入类间-类内控制模块,使得生成的样本同类别间更加聚集,不同类别间更具有区分性,类别间混淆减少,展示了本文方法的有效性。
(2)参数I不同取值对检测性能的影响:参数I决定了生成样本的代表性。I越大说明样本代表性越高。在表5的第1, 2, 5行中,本文比较了参数I的不同值对检测性能的影响。在每次实验中,从取值范围中均匀抽取6个数作为I值。每个I值对应每个类别10个样本。实验结果表明,I值越大检测性能越好。
(3)参数 \gamma 不同取值对检测性能的影响:表5的第3~5行展示了当式(4)中超参数 \gamma 设置为不同值时检测性能的变化。根据表5的结果, \gamma 设置为0.4时模型表现出最好的检测效果,因此在实验中设置 \gamma 的默认值为0.4。
(4)参数T不同取值对检测性能的影响:参数T表示扩散模型中的扩散步骤总数,T越大则生成样本时扩散的步数越多,计算量越大。表6展示了扩散模型中扩散步数T的不同取值对检测性能的影响。根据表6的结果,当T<1 000时,生成样本时的计算量减小,但检测效果有所下降;当T>1 000时,生成样本时的计算量增加,但检测效果没有明显提升。因此在实验中设置T=1 000。
3.4 模型复杂度分析
表7展示了本文方法在基准模型基础上参数量(number of Parameters, #Param.)和浮点运算次数(FLoating Point Operations, FLOPs)的变化。目标检测器计算FLOPs时,输入图片尺寸设置为800×800;数据生成器计算FLOPs时,输入的RoI特征向量长度为固定的2 048。本文框架中不同阶段复杂度分析如下:(1)训练数据生成器时,需要训练的参数量和一次前向传播的FLOPs分别为34.6M和32.5G。因为目标检测器的输入是图片,而数据生成器的输入是RoI特征,RoI特征包含的数据量远小于图片,因此数据生成器1次前向传播的计算量远小于目标检测器。(2)用数据生成器进行样本生成时,生成一组新样本需要进行T步去噪过程,即需要进行T次前向传播过程,计算量较大。(3)目标检测网络微调时,因为生成的样本可以保存到本地文件中,在目标检测网络微调时可以直接从本地文件中读取已生成的样本,而不需要同步进行样本生成过程,因此微调时计算量不会明显增加。(4)进行新类目标检测时,由于目标检测器本身的结构相比基准模型没有改变,因此在本文方法中用微调好的目标检测器进行检测推理时,其参数量和浮点运算次数相比基准模型没有增加。
4. 结束语
本文提出了一个新的基于数据生成的框架来给小样本类别提供额外样本,它在两阶段微调目标检测网络的基础上加入基于扩散模型的数据生成器。本文提出类间-类内控制模块来对数据生成器进行调控,通过使数据生成器学习类间关系来提高生成样本质量,通过限制样本位置来提高生成样本的代表性。最终提高小样本目标检测精度。未来将在小样本语义分割和其他小样本任务中探索本文方法的应用。
-
表 1 在PASCAL VOC数据集上本文方法与其他方法结果对比
方法/shot Novel Set 1 Novel Set 2 Novel Set 3 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10 TFA[6] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8 MPSR[23] 41.7 – 51.4 55.2 61.8 24.4 – 39.2 39.9 47.8 35.6 – 42.3 48 49.7 TIP[24] 27.7 36.5 43.3 50.2 59.6 22.7 30.1 33.8 40.9 46.9 21.7 30.6 38.1 44.5 50.9 DCNet[25] 33.9 37.4 43.7 51.1 59.6 23.2 24.8 30.6 36.7 46.6 32.3 34.9 39.7 42.6 50.7 CME[26] 41.5 47.5 50.4 58.2 60.9 27.2 30.2 41.4 42.5 46.8 34.3 39.6 45.1 48.3 51.5 SRR-FSD[10] 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4 FADI[27] 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6 DeFRCN[9] 45.7 56.4 59.3 62.6 64.6 35.7 40.5 45.3 50.4 54.1 39.8 50.6 52.8 56.1 60.8 FCT[28] 38.5 49.6 53.5 59.8 64.3 25.9 34.2 40.1 44.9 47.4 34.7 43.9 49.3 53.1 56.3 Meta-DETR[29] 35.1 49.0 53.2 57.4 62.0 27.9 32.3 38.4 43.2 51.8 34.9 41.8 47.1 54.1 58.2 本文方法 47.8 56.6 59.3 63.2 65.6 37.5 43.1 47.5 52.0 56.3 40.2 50.8 53.8 56.6 62.4 表 2 在MS COCO数据集上本文方法与其他方法结果对比
方法 1-shot 2-shot 3-shot 5-shot 10-shot 30-shot TFA[6] 4.4 5.4 6 7.7 10.0 13.7 MPSR[23] 5.1 6.7 7.4 8.7 9.8 14.1 FSDetView[30] 4.5 6.6 7.2 10.7 12.5 14.7 TIP[24] – – – – 16.3 18.3 DCNet[25] – – – – 12.8 18.6 CME[26] – – – – 15.1 16.9 SRR-FSD[10] – – – – 11.3 14.7 FADI[27] 5.7 7 8.6 10.1 12.2 16.1 DeFRCN[9] 7.7 11.4 13.2 15.5 18.5 22.4 FCT[28] 5.1 7.2 9.8 12.0 15.3 20.2 Meta-DETR[29] 7.5 – 13.5 15.4 19.0 22.2 本文方法 9.0 12.2 13.6 15.7 18.6 22.6 表 4 类间-类内控制模块有效性验证
类间条件控制 类内条件控制 mAP (%) 语义嵌入 语义关系嵌入 √ 8.3 √ √ 8.8 √ √ 8.6 √ √ √ 9.0 表 5 不同参数取值结果对比
I的取值范围 \gamma mAP (%) [0.5,1] 0.4 8.7 [0.5,0.75] 0.4 8.2 [0.75,1] 0.3 8.8 [0.75,1] 0.5 8.7 [0.75,1] 0.4 9.0 表 6 参数T不同取值结果对比
T mAP (%) 900 8.9 1 000 9.0 1 100 9.0 -
[1] LIU Qiankun, LIU Rui, ZHENG Bolun, et al. Infrared small target detection with scale and location sensitivity[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 17490–17499. doi: 10.1109/CVPR52733.2024.01656. [2] ZHANG Gang, CHEN Junnan, GAO Guohuan, et al. SAFDNet: A simple and effective network for fully sparse 3D object detection[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 14477–14486. doi: 10.1109/CVPR52733.2024.01372. [3] YE Mingqiao, KE Lei, LI Siyuan, et al. Cascade-DETR: Delving into high-quality universal object detection[C]. Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 6704–6714. doi: 10.1109/ICCV51070.2023.00617. [4] WANG C Y, BOCHKOVSKIY A, and LIAO H Y M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7464–7475. doi: 10.1109/CVPR52729.2023.00721. [5] WANG Yuxiong, RAMANAN D, and HEBERT M. Meta-learning to detect rare objects[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9925–9934. doi: 10.1109/ICCV.2019.01002. [6] WANG Xin, HUANG T E, DARRELL T, et al. Frustratingly simple few-shot object detection[C]. Proceedings of the 37th International Conference on Machine Learning, 2020: 920. [7] SUN Bo, LI Banghuai, CAI Shengcai, et al. FSCE: Few-shot object detection via contrastive proposal encoding[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7352–7362. doi: 10.1109/CVPR46437.2021.00727. [8] YAN Xiaopeng, CHEN Ziliang, XU Anni, et al. Meta R-CNN: Towards general solver for instance-level low-shot learning[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9577–9586. doi: 10.1109/ICCV.2019.00967. [9] QIAO Limeng, ZHAO Yuxuan, LI Zhiyuan, et al. DeFRCN: Decoupled faster R-CNN for few-shot object detection[C]. Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 8681–8690. doi: 10.1109/ICCV48922.2021.00856. [10] ZHU Chenchen, CHEN Fangyi, AHMED U, et al. Semantic relation reasoning for shot-stable few-shot object detection[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8782–8791. doi: 10.1109/CVPR46437.2021.00867. [11] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031. [12] ZHANG Weilin and WANG Yuxiong. Hallucination improves few-shot object detection[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13008–13017. doi: 10.1109/CVPR46437.2021.01281. [13] ZHU Pengkai, WANG Hanxiao, and SALIGRAMA V. Don't even look once: Synthesizing features for zero-shot detection[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11693–11702. doi: 10.1109/CVPR42600.2020.01171. [14] XU Jingyi, LE H, and SAMARAS D. Generating features with increased crop-related diversity for few-shot object detection[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 19713–19722. doi: 10.1109/CVPR52729.2023.01888. [15] HO J, JAIN A, and ABBEEL P. Denoising diffusion probabilistic models[C]. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 574. [16] HO J and SALIMANS T. Classifier-free diffusion guidance[EB/OL]. https://arxiv.org/abs/2207.12598, 2022. [17] QI Tianhao, FANG Shancheng, WU Yanze, et al. DEADiff: An efficient stylization diffusion model with disentangled representations[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 8693–8702. doi: 10.1109/CVPR52733.2024.00830. [18] GARBER T and TIRER T. Image restoration by denoising diffusion models with iteratively preconditioned guidance[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 25245–25254. doi: 10.1109/CVPR52733.2024.02385. [19] LI Muyang, CAI Tianle, CAO Jiaxin, et al. DistriFusion: Distributed parallel inference for high-resolution diffusion models[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 7183–7193. doi: 10.1109/CVPR52733.2024.00686. [20] HUANG Ziqi, CHAN K C K, JIANG Yuming, et al. Collaborative diffusion for multi-modal face generation and editing[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6080–6090. doi: 10.1109/CVPR52729.2023.00589. [21] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. Proceedings of the 38th International Conference on Machine Learning, 2021: 8748–8763. [22] RONNEBERGER O, FISCHER P, and BROX T. U-net: Convolutional networks for biomedical image segmentation[C]. Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241. doi: 10.1007/978-3-319-24574-4_28. [23] WU Jiaxi, LIU Songtao, HUANG Di, et al. Multi-scale positive sample refinement for few-shot object detection[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 456–472. doi: 10.1007/978-3-030-58517-4_27. [24] LI Aoxue and LI Zhenguo. Transformation invariant few-shot object detection[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 3094–3102. doi: 10.1109/CVPR46437.2021.00311. [25] HU Hanzhe, BAI Shuai, LI Aoxue, et al. Dense relation distillation with context-aware aggregation for few-shot object detection[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 10185–10194. doi: 10.1109/CVPR46437.2021.01005. [26] LI Bohao, YANG Boyu, LIU Chang, et al. Beyond max-margin: Class margin equilibrium for few-shot object detection[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7363–7372. doi: 10.1109/CVPR46437.2021.00728. [27] CAO Yuhang, WANG Jiaqi, JIN Ying, et al. Few-shot object detection via association and discrimination[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1267. [28] HAN Guangxing, MA Jiawei, HUANG Shiyuan, et al. Few-shot object detection with fully cross-transformer[C]. Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 5321–5330. doi: 10.1109/CVPR52688.2022.00525. [29] ZHANG Gongjie, LUO Zhipeng, CUI Kaiwen, et al. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12832–12843. doi: 10.1109/TPAMI.2022.3195735. [30] XIAO Yang, LEPETIT V, and MARLET R. Few-shot object detection and viewpoint estimation for objects in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3090–3106. doi: 10.1109/TPAMI.2022.3174072. -