高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

利用多类不均衡数据局部分布特征的自适应过采样算法

陶新民 徐安南 史丽航 李俊轩 郭心悦 张艳萍

陶新民, 徐安南, 史丽航, 李俊轩, 郭心悦, 张艳萍. 利用多类不均衡数据局部分布特征的自适应过采样算法[J]. 电子与信息学报. doi: 10.11999/JEIT250381
引用本文: 陶新民, 徐安南, 史丽航, 李俊轩, 郭心悦, 张艳萍. 利用多类不均衡数据局部分布特征的自适应过采样算法[J]. 电子与信息学报. doi: 10.11999/JEIT250381
TAO Xinmin, XU Annan, SHI Lihang, LI Junxuan, GUO Xinyue, ZHANG Yanping. A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250381
Citation: TAO Xinmin, XU Annan, SHI Lihang, LI Junxuan, GUO Xinyue, ZHANG Yanping. A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250381

利用多类不均衡数据局部分布特征的自适应过采样算法

doi: 10.11999/JEIT250381 cstr: 32379.14.JEIT250381
基金项目: 国家自然科学基金(62176050),山东省自然科学基金(ZR2024QA140)
详细信息
    作者简介:

    陶新民:男,教授,研究方向为小样本分类、数据挖掘

    徐安南:男,硕士生,研究方向为不均衡数据挖掘

    史丽航:女,硕士生,研究方向为数据挖掘

    李俊轩:男,硕士生,研究方向为小样本分类

    郭心悦:女,硕士生,研究方向为数据挖掘

    张艳萍:女,副教授,研究方向为数据挖掘

    通讯作者:

    陶新民 taoxinmin@nefu.edu.cn

  • 中图分类号: TN911

A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets

Funds: The National Natural Science Foundation of China (62176050), The Natural Science Foundation of Shandong Provincial (ZR2024QA140)
  • 摘要: 不均衡数据集除数据不均衡外还存在类重叠、小析取、离群点、低密度等复杂因素,这些因素会导致分类器性能进一步下降,尤其是在处理多类不均衡数据分类问题时。鉴于此,该文提出一种基于多类不均衡数据局部分布特征的自适应过采样算法(MC-LDWO)。该算法首先以动态确定的所有少数类为球心,构建半径依赖于当前少数类分布的超球体。然后,基于超球体内样本分布选择参与过采样的少数类样本,并利用各类别局部密度指标设计自适应权重分配策略,确保低密度区域和边界附近的样本有更高的过采样机率。随后,根据组合多数类和少数类的局部分布信息计算低密度向量,引入随机向量并设置截断阈值以确定合成样本的生成位置。最后,利用优化后的特定分解策略解决多类不均衡数据分类问题。多个数据集上的实验结果表明,MC-LDWO在各类评估指标上显著优于其他对比算法,验证了其处理具有复杂因素多类不均衡数据分类问题的有效性。
  • 图  1  人工数据集在各种过采样算法上获得的过采样结果

    图  2  在超球体内同时存在组合多数类样本和少数类样本时确定生成位置的详细策略

    图  3  在超球体内除了球心少数类样本外仅有组合多数类样本时确定生成位置的详细策略

    图  4  确定生成位置的详细策略

    图  5  MC-LDWO算法得到的过采样结果

    图  6  对比算法使用C5.0分类器在所有测试数据集上的平均秩

    图  7  对比算法使用MLP分类器在所有测试数据集上的平均秩

    图  8  对比算法使用NB分类器在所有测试数据集上的平均秩

    图  9  不同$ t $值下的$ AvG $结果

    图  10  不同$ \beta $值下的$ AvG $结果

    表  1  实验中多类不均衡数据集的详细特征

    数据集 总样本数 类别数 特征数 各类别样本数 IR
    breast-tissue 106 6 9 22/21/14/15/16/18 1.38
    wine 178 3 13 59/71/48 1.48
    led7digit 500 10 7 45/37/51/57/52/52/47/57/53/49 1.54
    hayes-roth 132 3 4 51/51/30 1.70
    contraceptive 1473 3 9 629/333/511 1.89
    satimage 6435 6 36 1533/703/1358/626/707/1508 2.45
    vertebral_c 310 3 6 60/100/150 2.50
    new-thyroid 215 3 5 150/35/30 5.00
    dermatology* 358 6 34 111/60/71/48/48/20 5.55
    balance* 625 3 4 288/49/288 5.88
    flare* 1066 6 11 211/331/147/239/95/43 7.70
    glass* 214 6 9 70/76/17/13/9/29 8.44
    plates-faults1* 1941 7 27 158/190/391/72/55/402/673 12.24
    cleveland* 303 5 13 164/55/36/35/13 12.62
    plates-faults3* 1941 5 27 55/72/348/793/673 15.69
    SkillCraft1* 3395 8 19 167/347/553/811/806/621/35/55 23.17
    thyroid* 7200 3 21 166/368/6666 40.16
    Anuran Calls* 7195 10 22 672/3478/542/310/472/1121/270/114/68/148 51.15
    winequality-red* 1599 6 11 10/53/681/638/199/18 68.10
    yeast* 1484 10 8 244/429/463/44/51/163/35/30/20/5 92.60
    Mapping* 10545 6 28 7431/1441/969/446/205/53 140.21
    pageblocks* 5472 5 10 4913/329/28/87/115 175.56
    winequality-white* 4898 7 11 20/163/1457/2198/880/175/5 439.60
    shuttle* 57999 7 9 45586/49/171/8903/3267/10/13 4558.60
    “*”表示对应的数据集为复杂数据集。
    下载: 导出CSV

    表  2  MC-LDWO作为控制算法的Holm测试结果

    分类器 C5.0 MLP NB
    算法 $ {\alpha }_{0.05} $ $ p $ -value $ {\alpha }_{0.05} $ $ p $ -value $ {\alpha }_{0.05} $ $ p $ -value
    AvF
    STATIC-SMOTE 0.0064 1.1800e-44 0.0057 2.9003e-71 0.0057 2.9614e-27
    MDO 0.0057 4.6247e-47 0.0064 3.2744e-64 0.0064 7.7082e-26
    MC-RBO 0.0085 2.4986e-31 0.0102 4.0471e-48 0.0073 1.0580e-16
    MC-CCR 0.0102 8.1267e-27 0.0085 1.7582e-53 0.0102 3.9708e-15
    EOS 0.0073 1.5390e-36 0.0073 2.5027e-54 0.0085 5.8226e-16
    CCO 0.0127 2.5865e-18 0.0170 7.6682e-22 0.0170 3.4142e-11
    AdaBoost.AD 0.0253 3.9674e-05 0.05 3.5897e-04 0.05 1.1912e-03
    MC-MBRC 0.0170 7.4773e-17 0.0253 2.0960e-09 0.0127 2.2229e-11
    HCE-MCD 0.05 2.2639e-04 0.0127 4.1612e-23 0.0253 7.2076e-04
    AvG
    STATIC-SMOTE 0.0057 3.5994e-53 0.0057 4.1636e-65 0.0057 4.4918e-56
    MDO 0.0064 1.3933e-49 0.0064 4.2750e-62 0.0064 1.0076e-55
    MC-RBO 0.0102 5.9281e-27 0.0085 1.0189e-47 0.0085 1.3766e-34
    MC-CCR 0.0085 2.7611e-29 0.0102 9.7086e-41 0.0102 1.2688e-31
    EOS 0.0073 1.0580e-36 0.0073 1.5495e-54 0.0073 7.3381e-48
    CCO 0.0127 5.1465e-15 0.0170 3.9797e-20 0.0170 4.6367e-18
    AdaBoost.AD 0.0253 7.0072e-08 0.0253 5.5773e-07 0.0127 5.7676e-21
    MC-MBRC 0.0170 4.3712e-12 0.0127 1.1056e-26 0.05 8.8066e-06
    HCE-MCD 0.05 6.6996e-07 0.05 7.5201e-07 0.0253 2.5207e-06
    AvAUC
    STATIC-SMOTE 0.0057 9.1653e-39 0.0057 4.5353e-57 0.0057 2.4406e-41
    MDO 0.0064 1.4965e-36 0.0064 3.1707e-53 0.0064 3.5996e-37
    MC-RBO 0.0085 4.8793e-20 0.0102 2.7833e-37 0.0102 5.3970e-22
    MC-CCR 0.0127 5.0720e-17 0.0085 1.3510e-39 0.0085 1.5324e-22
    EOS 0.0073 4.9240e-25 0.0073 2.58889e-49 0.0073 1.0680e-29
    CCO 0.0102 2.8768e-17 0.0127 5.1594e-19 0.0127 5.7970e-17
    AdaBoost.AD 0.0170 8.3507e-14 0.0170 1.2937e-17 0.05 2.3991e-05
    MC-MBRC 0.0253 9.9428e-06 0.05 4.6246e-03 0.0253 6.4738e-06
    HCE-MCD 0.05 2.0335e-05 0.0253 3.9545e-06 0.0170 4.2003e-14
    下载: 导出CSV
  • [1] NGUYEN M N. A scoping review of deep learning approaches for lung cancer detection using chest radiographs and computed tomography scans[J]. Biomedical Engineering Advances, 2025, 9: 100138. doi: 10.1016/j.bea.2024.100138.
    [2] LIANG Xiayu, GAO Ying, and XU Shanrong. ASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[J]. Expert Systems with Applications, 2024, 238: 122049. doi: 10.1016/j.eswa.2023.122049.
    [3] TENG Hu, WANG Cheng, YANG Qing, et al. Leveraging adversarial augmentation on imbalance data for online trading fraud detection[J]. IEEE Transactions on Computational Social Systems, 2024, 11(2): 1602–1614. doi: 10.1109/TCSS.2023.3240968.
    [4] DOU Jun, WEI Guoliang, SONG Yan, et al. Switching triple-weight-SMOTE in empirical feature space for imbalanced and incomplete data[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(2): 1850–1866. doi: 10.1109/TASE.2023.3240759.
    [5] 张永清, 卢荣钊, 乔少杰, 等. 一种基于样本空间的类别不平衡数据采样方法[J]. 自动化学报, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034.

    ZHANG Yongqing, LU Rongzhao, QIAO Shaojie, et al. A sampling method of imbalanced data based on sample space[J]. Acta Automatica Sinica, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034.
    [6] YANG Yuxuan, KHORSHIDI H A, and AICKELIN U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: Insights for medical problems[J]. Frontiers in Digital Health, 2024, 6: 1430245. doi: 10.3389/fdgth.2024.1430245.
    [7] SÁEZ J A, KRAWCZYK B, and WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition, 2016, 57: 164–178. doi: 10.1016/j.patcog.2016.03.012.
    [8] ZHU Tuanfei, LIN Yaping, and LIU Yonghe. Synthetic minority oversampling technique for multiclass imbalance problems[J]. Pattern Recognition, 2017, 72: 327–340. doi: 10.1016/j.patcog.2017.07.024.
    [9] ABDI L and HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1): 238–251. doi: 10.1109/TKDE.2015.2458858.
    [10] KRAWCZYK B, KOZIARSKI M, and WOŹNIAK M. Radial-based oversampling for multiclass imbalanced data classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(8): 2818–2831. doi: 10.1109/TNNLS.2019.2913673.
    [11] KOZIARSKI M, WOŹNIAK M, and KRAWCZYK B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise[J]. Knowledge-Based Systems, 2020, 204: 106223. doi: 10.1016/j.knosys.2020.106223.
    [12] MONDAL P, ANSARI F, and DAS S. CCO: A cluster core-based oversampling technique for improved class-imbalanced learning[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2025, 9(2): 1153–1165. doi: 10.1109/TETCI.2024.3407784.
    [13] LI Shuxian, SONG Liyan, WU Xiaoyu, et al. Multi-class imbalance classification based on data distribution and adaptive weights[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(10): 5265–5279. doi: 10.1109/TKDE.2024.3384961.
    [14] MA Tingting, LU Shuxia, and JIANG Chen. A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data[J]. Expert Systems with Applications, 2024, 240: 122565. doi: 10.1016/j.eswa.2023.122565.
    [15] DAI Qi, WANG Longhui, XU Kailong, et al. Class-overlap detection based on heterogeneous clustering ensemble for multi-class imbalance problem[J]. Expert Systems with Applications, 2024, 255: 124558. doi: 10.1016/j.eswa.2024.124558.
    [16] TAO Xinmin, ZHANG Xiaohan, ZHENG Yujia, et al. A MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification[J]. Information Sciences, 2024, 672: 120699. doi: 10.1016/j.ins.2024.120699.
    [17] RODRIGUEZ A and LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496. doi: 10.1126/science.1242072.
    [18] 贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048.

    HE Qianhua, CHEN Yongqiang, ZHENG Ruowei, et al. End-to-end speech keyword spotting training method based on sample's class uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048.
    [19] SHARIEF F, IJAZ H, SHOJAFAR M, et al. Multi-class imbalanced data handling with concept drift in fog computing: A taxonomy, review, and future directions[J]. ACM Computing Surveys, 2025, 57(1): 16. doi: 10.1145/3689627.
    [20] KEEL. KEEL dataset repository[EB/OL]. https://sci2s.ugr.es/keel/imbalanced.php. (查阅网上资料,未能确认标题信息,请确认) (查阅网上资料,未找到引用日期,请补充).
    [21] Machine learning repository UCI. http://archive.ics.uci.edu/ml/datasets.html. (查阅网上资料,未找到本条文献信息,请确认).
    [22] LI Lusi, HE Haibo, and LI Jie. Entropy-based sampling approaches for multi-class imbalanced problems[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(11): 2159–2170. doi: 10.1109/TKDE.2019.2913859.
  • 加载中
图(10) / 表(2)
计量
  • 文章访问数:  14
  • HTML全文浏览量:  11
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-04-29
  • 修回日期:  2025-09-17
  • 网络出版日期:  2025-09-23

目录

    /

    返回文章
    返回