A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets
-
摘要: 不均衡数据集除数据不均衡外还存在类重叠、小析取、离群点、低密度等复杂因素,这些因素会导致分类器性能进一步下降,尤其是在处理多类不均衡数据分类问题时。鉴于此,该文提出一种基于多类不均衡数据局部分布特征的自适应过采样算法(MC-LDWO)。该算法首先以动态确定的所有少数类为球心,构建半径依赖于当前少数类分布的超球体。然后,基于超球体内样本分布选择参与过采样的少数类样本,并利用各类别局部密度指标设计自适应权重分配策略,确保低密度区域和边界附近的样本有更高的过采样机率。随后,根据组合多数类和少数类的局部分布信息计算低密度向量,引入随机向量并设置截断阈值以确定合成样本的生成位置。最后,利用优化后的特定分解策略解决多类不均衡数据分类问题。多个数据集上的实验结果表明,MC-LDWO在各类评估指标上显著优于其他对比算法,验证了其处理具有复杂因素多类不均衡数据分类问题的有效性。Abstract:
Objective Classification with imbalanced datasets remains one of the most challenging problems in machine learning. In addition to class imbalance, such datasets often contain complex factors including class overlap, small disjuncts, outliers, and low-density regions, all of which can substantially degrade classifier performance, particularly in multi-class settings. To address these challenges simultaneously, this study proposes the Multi-class Local Distribution-based Weighted Oversampling Algorithm (MC-LDWO). Methods The MC-LDWO algorithm first constructs hyperspheres centered on dynamically determined minority classes, with radii estimated from the distribution of each class. Within these hyperspheres, minority class samples are selected for oversampling according to their local distribution, and an adaptive weight allocation strategy is designed using local density metrics. This ensures that samples in low-density regions and near class boundaries are assigned higher probabilities of being oversampled. Next, a low-density vector is computed from the local distribution of both majority and minority classes. A random vector is then introduced and integrated with the low-density vector, and a cutoff threshold is applied to determine the generation sites of synthetic samples, thereby reducing class overlap during boundary oversampling. Finally, an improved decomposition strategy tailored for multi-class imbalance is employed to further enhance classification performance in multi-class imbalanced scenarios. Results and Discussions The MC-LDWO algorithm dynamically identifies the minority and combined majority class sample sets and constructs hyperspheres centered on each minority class sample, with radii determined by the distribution of the corresponding minority class. These hyperspheres guide the subsequent oversampling process. A trade-off parameter ($ \beta $) is introduced to balance the influence of local densities between the combined majority and minority classes. Experimental results on KEEL datasets show that this approach effectively prevents class overlap during boundary oversampling while assigning higher oversampling weights to critical minority samples located near boundaries and in low-density regions. This improves boundary distribution and simultaneously addresses within-class imbalance. When the trade-off parameter is set to 0.5, MC-LDWO achieves a balanced consideration of both boundary distribution and the diverse densities present in minority classes due to data difficulty factors, thereby supporting improved performance in downstream classification tasks ( Fig. 10 ).Conclusions Comparative results with other state-of-the-art oversampling algorithms demonstrate that: (1) The MC-LDWO algorithm effectively prevents overlap when strengthening decision boundaries by setting the cutoff threshold ($ T $) and adaptively assigns oversampling weights according to two local density indicators for the minority and combined majority classes within the hypersphere. This approach addresses within-class imbalance caused by data difficulty factors and enhances boundary distribution. (2) By jointly considering density and boundary distribution, and setting the trade-off parameter to 0.5, the proposed algorithm can simultaneously mitigate within-class imbalance and reinforce the boundary information of minority classes. (3) When applied to highly imbalanced datasets characterized by complex decision boundaries and data difficulty factors such as outliers and small disjuncts, MC-LDWO significantly improves the boundary distribution of each minority class while effectively managing within-class imbalance, thereby enhancing the performance of subsequent classifiers. -
Key words:
- Imbalanced datasets /
- Classification /
- Oversampling /
- Within-class imbalance /
- Hypersphere
-
表 1 实验中多类不均衡数据集的详细特征
数据集 总样本数 类别数 特征数 各类别样本数 IR breast-tissue 106 6 9 22/21/14/15/16/18 1.38 wine 178 3 13 59/71/48 1.48 led7digit 500 10 7 45/37/51/57/52/52/47/57/53/49 1.54 hayes-roth 132 3 4 51/51/30 1.70 contraceptive 1473 3 9 629/333/511 1.89 satimage 6435 6 36 1533 /703/1358 /626/707/1508 2.45 vertebral_c 310 3 6 60/100/150 2.50 new-thyroid 215 3 5 150/35/30 5.00 dermatology* 358 6 34 111/60/71/48/48/20 5.55 balance* 625 3 4 288/49/288 5.88 flare* 1066 6 11 211/331/147/239/95/43 7.70 glass* 214 6 9 70/76/17/13/9/29 8.44 plates-faults1* 1941 7 27 158/190/391/72/55/402/673 12.24 cleveland* 303 5 13 164/55/36/35/13 12.62 plates-faults3* 1941 5 27 55/72/348/793/673 15.69 SkillCraft1* 3395 8 19 167/347/553/811/806/621/35/55 23.17 thyroid* 7200 3 21 166/368/ 6666 40.16 Anuran Calls* 7195 10 22 672/ 3478 /542/310/472/1121 /270/114/68/14851.15 winequality-red* 1599 6 11 10/53/681/638/199/18 68.10 yeast* 1484 10 8 244/429/463/44/51/163/35/30/20/5 92.60 Mapping* 10545 6 28 7431 /1441 /969/446/205/53140.21 pageblocks* 5472 5 10 4913 /329/28/87/115175.56 winequality-white* 4898 7 11 20/163/ 1457 /2198 /880/175/5439.60 shuttle* 57999 7 9 45586 /49/171/8903 /3267 /10/134558.60 “*”表示对应的数据集为复杂数据集。 表 2 MC-LDWO作为控制算法的Holm测试结果
分类器 C5.0 MLP NB 算法 $ {\alpha }_{0.05} $ $ p $ -value $ {\alpha }_{0.05} $ $ p $ -value $ {\alpha }_{0.05} $ $ p $ -value AvF STATIC-SMOTE 0.0064 1.1800e-44 0.0057 2.9003e-71 0.0057 2.9614e-27 MDO 0.0057 4.6247e-47 0.0064 3.2744e-64 0.0064 7.7082e-26 MC-RBO 0.0085 2.4986e-31 0.0102 4.0471e-48 0.0073 1.0580e-16 MC-CCR 0.0102 8.1267e-27 0.0085 1.7582e-53 0.0102 3.9708e-15 EOS 0.0073 1.5390e-36 0.0073 2.5027e-54 0.0085 5.8226e-16 CCO 0.0127 2.5865e-18 0.0170 7.6682e-22 0.0170 3.4142e-11 AdaBoost.AD 0.0253 3.9674e-05 0.05 3.5897e-04 0.05 1.1912e-03 MC-MBRC 0.0170 7.4773e-17 0.0253 2.0960e-09 0.0127 2.2229e-11 HCE-MCD 0.05 2.2639e-04 0.0127 4.1612e-23 0.0253 7.2076e-04 AvG STATIC-SMOTE 0.0057 3.5994e-53 0.0057 4.1636e-65 0.0057 4.4918e-56 MDO 0.0064 1.3933e-49 0.0064 4.2750e-62 0.0064 1.0076e-55 MC-RBO 0.0102 5.9281e-27 0.0085 1.0189e-47 0.0085 1.3766e-34 MC-CCR 0.0085 2.7611e-29 0.0102 9.7086e-41 0.0102 1.2688e-31 EOS 0.0073 1.0580e-36 0.0073 1.5495e-54 0.0073 7.3381e-48 CCO 0.0127 5.1465e-15 0.0170 3.9797e-20 0.0170 4.6367e-18 AdaBoost.AD 0.0253 7.0072e-08 0.0253 5.5773e-07 0.0127 5.7676e-21 MC-MBRC 0.0170 4.3712e-12 0.0127 1.1056e-26 0.05 8.8066e-06 HCE-MCD 0.05 6.6996e-07 0.05 7.5201e-07 0.0253 2.5207e-06 AvAUC STATIC-SMOTE 0.0057 9.1653e-39 0.0057 4.5353e-57 0.0057 2.4406e-41 MDO 0.0064 1.4965e-36 0.0064 3.1707e-53 0.0064 3.5996e-37 MC-RBO 0.0085 4.8793e-20 0.0102 2.7833e-37 0.0102 5.3970e-22 MC-CCR 0.0127 5.0720e-17 0.0085 1.3510e-39 0.0085 1.5324e-22 EOS 0.0073 4.9240e-25 0.0073 2.58889e-49 0.0073 1.0680e-29 CCO 0.0102 2.8768e-17 0.0127 5.1594e-19 0.0127 5.7970e-17 AdaBoost.AD 0.0170 8.3507e-14 0.0170 1.2937e-17 0.05 2.3991e-05 MC-MBRC 0.0253 9.9428e-06 0.05 4.6246e-03 0.0253 6.4738e-06 HCE-MCD 0.05 2.0335e-05 0.0253 3.9545e-06 0.0170 4.2003e-14 -
[1] NGUYEN M N. A scoping review of deep learning approaches for lung cancer detection using chest radiographs and computed tomography scans[J]. Biomedical Engineering Advances, 2025, 9: 100138. doi: 10.1016/j.bea.2024.100138. [2] LIANG Xiayu, GAO Ying, and XU Shanrong. ASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[J]. Expert Systems with Applications, 2024, 238: 122049. doi: 10.1016/j.eswa.2023.122049. [3] TENG Hu, WANG Cheng, YANG Qing, et al. Leveraging adversarial augmentation on imbalance data for online trading fraud detection[J]. IEEE Transactions on Computational Social Systems, 2024, 11(2): 1602–1614. doi: 10.1109/TCSS.2023.3240968. [4] DOU Jun, WEI Guoliang, SONG Yan, et al. Switching triple-weight-SMOTE in empirical feature space for imbalanced and incomplete data[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(2): 1850–1866. doi: 10.1109/TASE.2023.3240759. [5] 张永清, 卢荣钊, 乔少杰, 等. 一种基于样本空间的类别不平衡数据采样方法[J]. 自动化学报, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034.ZHANG Yongqing, LU Rongzhao, QIAO Shaojie, et al. A sampling method of imbalanced data based on sample space[J]. Acta Automatica Sinica, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034. [6] YANG Yuxuan, KHORSHIDI H A, and AICKELIN U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: Insights for medical problems[J]. Frontiers in Digital Health, 2024, 6: 1430245. doi: 10.3389/fdgth.2024.1430245. [7] SÁEZ J A, KRAWCZYK B, and WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition, 2016, 57: 164–178. doi: 10.1016/j.patcog.2016.03.012. [8] ZHU Tuanfei, LIN Yaping, and LIU Yonghe. Synthetic minority oversampling technique for multiclass imbalance problems[J]. Pattern Recognition, 2017, 72: 327–340. doi: 10.1016/j.patcog.2017.07.024. [9] ABDI L and HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1): 238–251. doi: 10.1109/TKDE.2015.2458858. [10] KRAWCZYK B, KOZIARSKI M, and WOŹNIAK M. Radial-based oversampling for multiclass imbalanced data classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(8): 2818–2831. doi: 10.1109/TNNLS.2019.2913673. [11] KOZIARSKI M, WOŹNIAK M, and KRAWCZYK B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise[J]. Knowledge-Based Systems, 2020, 204: 106223. doi: 10.1016/j.knosys.2020.106223. [12] MONDAL P, ANSARI F, and DAS S. CCO: A cluster core-based oversampling technique for improved class-imbalanced learning[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2025, 9(2): 1153–1165. doi: 10.1109/TETCI.2024.3407784. [13] LI Shuxian, SONG Liyan, WU Xiaoyu, et al. Multi-class imbalance classification based on data distribution and adaptive weights[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(10): 5265–5279. doi: 10.1109/TKDE.2024.3384961. [14] MA Tingting, LU Shuxia, and JIANG Chen. A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data[J]. Expert Systems with Applications, 2024, 240: 122565. doi: 10.1016/j.eswa.2023.122565. [15] DAI Qi, WANG Longhui, XU Kailong, et al. Class-overlap detection based on heterogeneous clustering ensemble for multi-class imbalance problem[J]. Expert Systems with Applications, 2024, 255: 124558. doi: 10.1016/j.eswa.2024.124558. [16] TAO Xinmin, ZHANG Xiaohan, ZHENG Yujia, et al. A MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification[J]. Information Sciences, 2024, 672: 120699. doi: 10.1016/j.ins.2024.120699. [17] RODRIGUEZ A and LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496. doi: 10.1126/science.1242072. [18] 贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048.HE Qianhua, CHEN Yongqiang, ZHENG Ruowei, et al. End-to-end speech keyword spotting training method based on sample's class uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048. [19] SHARIEF F, IJAZ H, SHOJAFAR M, et al. Multi-class imbalanced data handling with concept drift in fog computing: A taxonomy, review, and future directions[J]. ACM Computing Surveys, 2025, 57(1): 16. doi: 10.1145/3689627. [20] KEEL. KEEL dataset repository[EB/OL]. https://sci2s.ugr.es/keel/imbalanced.php. (查阅网上资料,未能确认标题信息,请确认) (查阅网上资料,未找到引用日期,请补充). [21] Machine learning repository UCI. http://archive.ics.uci.edu/ml/datasets.html. (查阅网上资料,未找到本条文献信息,请确认). [22] LI Lusi, HE Haibo, and LI Jie. Entropy-based sampling approaches for multi-class imbalanced problems[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(11): 2159–2170. doi: 10.1109/TKDE.2019.2913859. -