利用多类不均衡数据局部分布特征的自适应过采样算法

陶新民; 徐安南; 史丽航; 李俊轩; 郭心悦; 张艳萍

doi:10.11999/JEIT250381

利用多类不均衡数据局部分布特征的自适应过采样算法

doi: 10.11999/JEIT250381 cstr: 32379.14.JEIT250381

1.
东北林业大学土木与交通学院哈尔滨 150040
2.
华北电力大学经济与管理学院北京 102206
3.
山东省工业大数据与智能制造重点实验室济南 250200

基金项目: 国家自然科学基金(62176050)，山东省自然科学基金(ZR2024QA140)

详细信息

作者简介:
陶新民：男，教授，研究方向为小样本分类、数据挖掘

徐安南：男，硕士生，研究方向为不均衡数据挖掘

史丽航：女，硕士生，研究方向为数据挖掘

李俊轩：男，硕士生，研究方向为小样本分类

郭心悦：女，硕士生，研究方向为数据挖掘

张艳萍：女，副教授，研究方向为数据挖掘

通讯作者:
陶新民　taoxinmin@nefu.edu.cn

中图分类号: TN911
计量
- 文章访问数: 368
- HTML全文浏览量: 198
- PDF下载量: 27
- 被引次数: 0
出版历程
- 收稿日期: 2025-04-29
- 修回日期: 2025-09-17
- 网络出版日期: 2025-09-23
- 刊出日期: 2025-11-10

A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets

1.
College of Civil Engineering and Transportation, University of Northeast Forestry, Harbin 150040, China
2.
College of Management and Economics, University of North China Electric Power, Beijing 102206, China
3.
Shandong Provincial Key Laboratory of Industrial Big Data and Intelligent Manufacturing, Jinan 250200, China

Funds: The National Natural Science Foundation of China (62176050), The Natural Science Foundation of Shandong Provincial (ZR2024QA140)

摘要

摘要: 不均衡数据集除数据不均衡外还存在类重叠、小析取、离群点、低密度等复杂因素，这些因素会导致分类器性能进一步下降，尤其是在处理多类不均衡数据分类问题时。鉴于此，该文提出一种基于多类不均衡数据局部分布特征的自适应过采样算法(MC-LDWO)。该算法首先以动态确定的所有少数类为球心，构建半径依赖于当前少数类分布的超球体。然后，基于超球体内样本分布选择参与过采样的少数类样本，并利用各类别局部密度指标设计自适应权重分配策略，确保低密度区域和边界附近的样本有更高的过采样概率。随后，根据组合多数类和少数类的局部分布信息计算低密度向量，引入随机向量并设置截断阈值以确定合成样本的生成位置。最后，利用优化后的特定分解策略解决多类不均衡数据分类问题。多个数据集上的实验结果表明，MC-LDWO在各类评估指标上显著优于其他对比算法，验证了其处理具有复杂因素多类不均衡数据分类问题的有效性。
- 不均衡数据集 /
- 分类 /
- 过采样 /
- 类内不平衡 /
- 超球体
Abstract: Objective Classification with imbalanced datasets remains one of the most challenging problems in machine learning. In addition to class imbalance, such datasets often contain complex factors including class overlap, small disjuncts, outliers, and low-density regions, all of which can substantially degrade classifier performance, particularly in multi-class settings. To address these challenges simultaneously, this study proposes the Multi-class Local Distribution-based Weighted Oversampling Algorithm (MC-LDWO). Methods The MC-LDWO algorithm first constructs hyperspheres centered on dynamically determined minority classes, with radii estimated from the distribution of each class. Within these hyperspheres, minority class samples are selected for oversampling according to their local distribution, and an adaptive weight allocation strategy is designed using local density metrics. This ensures that samples in low-density regions and near class boundaries are assigned higher probabilities of being oversampled. Next, a low-density vector is computed from the local distribution of both majority and minority classes. A random vector is then introduced and integrated with the low-density vector, and a cutoff threshold is applied to determine the generation sites of synthetic samples, thereby reducing class overlap during boundary oversampling. Finally, an improved decomposition strategy tailored for multi-class imbalance is employed to further enhance classification performance in multi-class imbalanced scenarios. Results and Discussions The MC-LDWO algorithm dynamically identifies the minority and combined majority class sample sets and constructs hyperspheres centered on each minority class sample, with radii determined by the distribution of the corresponding minority class. These hyperspheres guide the subsequent oversampling process. A trade-off parameter ($ \beta $) is introduced to balance the influence of local densities between the combined majority and minority classes. Experimental results on KEEL datasets show that this approach effectively prevents class overlap during boundary oversampling while assigning higher oversampling weights to critical minority samples located near boundaries and in low-density regions. This improves boundary distribution and simultaneously addresses within-class imbalance. When the trade-off parameter is set to 0.5, MC-LDWO achieves a balanced consideration of both boundary distribution and the diverse densities present in minority classes due to data difficulty factors, thereby supporting improved performance in downstream classification tasks (Fig. 10). Conclusions Comparative results with other state-of-the-art oversampling algorithms demonstrate that: (1) The MC-LDWO algorithm effectively prevents overlap when strengthening decision boundaries by setting the cutoff threshold ($ T $) and adaptively assigns oversampling weights according to two local density indicators for the minority and combined majority classes within the hypersphere. This approach addresses within-class imbalance caused by data difficulty factors and enhances boundary distribution. (2) By jointly considering density and boundary distribution, and setting the trade-off parameter to 0.5, the proposed algorithm can simultaneously mitigate within-class imbalance and reinforce the boundary information of minority classes. (3) When applied to highly imbalanced datasets characterized by complex decision boundaries and data difficulty factors such as outliers and small disjuncts, MC-LDWO significantly improves the boundary distribution of each minority class while effectively managing within-class imbalance, thereby enhancing the performance of subsequent classifiers.
- Imbalanced datasets /
- Classification /
- Oversampling /
- Within-class imbalance /
- Hypersphere

HTML全文

图 1 人工数据集在各种过采样算法上获得的过采样结果

下载: 全尺寸图片幻灯片

图 2 在超球体内同时存在组合多数类样本和少数类样本时确定生成位置的详细策略

下载: 全尺寸图片幻灯片

图 3 在超球体内除了球心少数类样本外仅有组合多数类样本时确定生成位置的详细策略

下载: 全尺寸图片幻灯片

图 4 确定生成位置的详细策略

下载: 全尺寸图片幻灯片

图 5 MC-LDWO算法得到的过采样结果

下载: 全尺寸图片幻灯片

图 6 对比算法使用C5.0分类器在所有测试数据集上的平均秩

下载: 全尺寸图片幻灯片

图 7 对比算法使用MLP分类器在所有测试数据集上的平均秩

下载: 全尺寸图片幻灯片

图 8 对比算法使用NB分类器在所有测试数据集上的平均秩

下载: 全尺寸图片幻灯片

图 9 不同$ t $值下的AvG结果

下载: 全尺寸图片幻灯片

图 10 不同$ \beta $值下的AvG结果

下载: 全尺寸图片幻灯片

表 1 实验中多类不均衡数据集的详细特征

数据集	总样本数	类别数	特征数	各类别样本数	IR
breast-tissue	106	6	9	22/21/14/15/16/18	1.38
wine	178	3	13	59/71/48	1.48
led7digit	500	10	7	45/37/51/57/52/52/47/57/53/49	1.54
hayes-roth	132	3	4	51/51/30	1.70
contraceptive	1473	3	9	629/333/511	1.89
satimage	6435	6	36	1533/703/1358/626/707/1508	2.45
vertebral_c	310	3	6	60/100/150	2.50
new-thyroid	215	3	5	150/35/30	5.00
dermatology*	358	6	34	111/60/71/48/48/20	5.55
balance*	625	3	4	288/49/288	5.88
flare*	1066	6	11	211/331/147/239/95/43	7.70
glass*	214	6	9	70/76/17/13/9/29	8.44
plates-faults1*	1941	7	27	158/190/391/72/55/402/673	12.24
cleveland*	303	5	13	164/55/36/35/13	12.62
plates-faults3*	1941	5	27	55/72/348/793/673	15.69
SkillCraft1*	3395	8	19	167/347/553/811/806/621/35/55	23.17
thyroid*	7200	3	21	166/368/6666	40.16
Anuran Calls*	7195	10	22	672/3478/542/310/472/1121/270/114/68/148	51.15
winequality-red*	1599	6	11	10/53/681/638/199/18	68.10
yeast*	1484	10	8	244/429/463/44/51/163/35/30/20/5	92.60
Mapping*	10545	6	28	7431/1441/969/446/205/53	140.21
pageblocks*	5472	5	10	4913/329/28/87/115	175.56
winequality-white*	4898	7	11	20/163/1457/2198/880/175/5	439.60
shuttle*	57999	7	9	45586/49/171/8903/3267/10/13	4558.60
“*”表示对应的数据集为复杂数据集。

下载: 导出CSV

表 2 MC-LDWO作为控制算法的Holm测试结果

分类器		C5.0		MLP		NB
算法		$ {\alpha }_{0.05} $	$ p $ -value	$ {\alpha }_{0.05} $	$ p $ -value	$ {\alpha }_{0.05} $	$ p $ -value
AvF	STATIC-SMOTE	0.0064	1.1800e–44	0.0057	2.9003e–71	0.0057	2.9614e–27
	MDO	0.0057	4.6247e–47	0.0064	3.2744e–64	0.0064	7.7082e–26
	MC-RBO	0.0085	2.4986e–31	0.0102	4.0471e–48	0.0073	1.0580e–16
	MC-CCR	0.0102	8.1267e–27	0.0085	1.7582e–53	0.0102	3.9708e–15
	EOS	0.0073	1.5390e–36	0.0073	2.5027e–54	0.0085	5.8226e–16
	CCO	0.0127	2.5865e–18	0.0170	7.6682e–22	0.0170	3.4142e–11
	AdaBoost.AD	0.0253	3.9674e–05	0.05	3.5897e–04	0.05	1.1912e–03
	MC-MBRC	0.0170	7.4773e–17	0.0253	2.0960e–09	0.0127	2.2229e–11
	HCE-MCD	0.05	2.2639e–04	0.0127	4.1612e–23	0.0253	7.2076e–04
AvG	STATIC-SMOTE	0.0057	3.5994e–53	0.0057	4.1636e–65	0.0057	4.4918e–56
	MDO	0.0064	1.3933e–49	0.0064	4.2750e–62	0.0064	1.0076e–55
	MC-RBO	0.0102	5.9281e–27	0.0085	1.0189e–47	0.0085	1.3766e–34
	MC-CCR	0.0085	2.7611e–29	0.0102	9.7086e–41	0.0102	1.2688e–31
	EOS	0.0073	1.0580e–36	0.0073	1.5495e–54	0.0073	7.3381e–48
	CCO	0.0127	5.1465e–15	0.0170	3.9797e–20	0.0170	4.6367e–18
	AdaBoost.AD	0.0253	7.0072e–08	0.0253	5.5773e–07	0.0127	5.7676e–21
	MC-MBRC	0.0170	4.3712e–12	0.0127	1.1056e–26	0.05	8.8066e–06
	HCE-MCD	0.05	6.6996e–07	0.05	7.5201e–07	0.0253	2.5207e–06
AvAUC	STATIC-SMOTE	0.0057	9.1653e–39	0.0057	4.5353e–57	0.0057	2.4406e–41
	MDO	0.0064	1.4965e–36	0.0064	3.1707e–53	0.0064	3.5996e–37
	MC-RBO	0.0085	4.8793e–20	0.0102	2.7833e–37	0.0102	5.3970e–22
	MC-CCR	0.0127	5.0720e–17	0.0085	1.3510e–39	0.0085	1.5324e–22
	EOS	0.0073	4.9240e–25	0.0073	2.58889e–49	0.0073	1.0680e–29
	CCO	0.0102	2.8768e–17	0.0127	5.1594e–19	0.0127	5.7970e–17
	AdaBoost.AD	0.0170	8.3507e–14	0.0170	1.2937e–17	0.05	2.3991e–05
	MC-MBRC	0.0253	9.9428e–06	0.05	4.6246e–03	0.0253	6.4738e–06
	HCE-MCD	0.05	2.0335e–05	0.0253	3.9545e–06	0.0170	4.2003e–14

下载: 导出CSV

参考文献(22)

[1]	NGUYEN M N. A scoping review of deep learning approaches for lung cancer detection using chest radiographs and computed tomography scans[J]. Biomedical Engineering Advances, 2025, 9: 100138. doi: 10.1016/j.bea.2024.100138.
[2]	LIANG Xiayu, GAO Ying, and XU Shanrong. ASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[J]. Expert Systems with Applications, 2024, 238: 122049. doi: 10.1016/j.eswa.2023.122049.
[3]	TENG Hu, WANG Cheng, YANG Qing, et al. Leveraging adversarial augmentation on imbalance data for online trading fraud detection[J]. IEEE Transactions on Computational Social Systems, 2024, 11(2): 1602–1614. doi: 10.1109/TCSS.2023.3240968.
[4]	DOU Jun, WEI Guoliang, SONG Yan, et al. Switching triple-weight-SMOTE in empirical feature space for imbalanced and incomplete data[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(2): 1850–1866. doi: 10.1109/TASE.2023.3240759.
[5]	张永清, 卢荣钊, 乔少杰, 等. 一种基于样本空间的类别不平衡数据采样方法[J]. 自动化学报, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034. ZHANG Yongqing, LU Rongzhao, QIAO Shaojie, et al. A sampling method of imbalanced data based on sample space[J]. Acta Automatica Sinica, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034.
[6]	YANG Yuxuan, KHORSHIDI H A, and AICKELIN U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: Insights for medical problems[J]. Frontiers in Digital Health, 2024, 6: 1430245. doi: 10.3389/fdgth.2024.1430245.
[7]	SÁEZ J A, KRAWCZYK B, and WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition, 2016, 57: 164–178. doi: 10.1016/j.patcog.2016.03.012.
[8]	ZHU Tuanfei, LIN Yaping, and LIU Yonghe. Synthetic minority oversampling technique for multiclass imbalance problems[J]. Pattern Recognition, 2017, 72: 327–340. doi: 10.1016/j.patcog.2017.07.024.
[9]	ABDI L and HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1): 238–251. doi: 10.1109/TKDE.2015.2458858.
[10]	KRAWCZYK B, KOZIARSKI M, and WOŹNIAK M. Radial-based oversampling for multiclass imbalanced data classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(8): 2818–2831. doi: 10.1109/TNNLS.2019.2913673.
[11]	KOZIARSKI M, WOŹNIAK M, and KRAWCZYK B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise[J]. Knowledge-Based Systems, 2020, 204: 106223. doi: 10.1016/j.knosys.2020.106223.
[12]	MONDAL P, ANSARI F, and DAS S. CCO: A cluster core-based oversampling technique for improved class-imbalanced learning[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2025, 9(2): 1153–1165. doi: 10.1109/TETCI.2024.3407784.
[13]	LI Shuxian, SONG Liyan, WU Xiaoyu, et al. Multi-class imbalance classification based on data distribution and adaptive weights[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(10): 5265–5279. doi: 10.1109/TKDE.2024.3384961.
[14]	MA Tingting, LU Shuxia, and JIANG Chen. A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data[J]. Expert Systems with Applications, 2024, 240: 122565. doi: 10.1016/j.eswa.2023.122565.
[15]	DAI Qi, WANG Longhui, XU Kailong, et al. Class-overlap detection based on heterogeneous clustering ensemble for multi-class imbalance problem[J]. Expert Systems with Applications, 2024, 255: 124558. doi: 10.1016/j.eswa.2024.124558.
[16]	TAO Xinmin, ZHANG Xiaohan, ZHENG Yujia, et al. A MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification[J]. Information Sciences, 2024, 672: 120699. doi: 10.1016/j.ins.2024.120699.
[17]	RODRIGUEZ A and LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496. doi: 10.1126/science.1242072.
[18]	贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048. HE Qianhua, CHEN Yongqiang, ZHENG Ruowei, et al. End-to-end speech keyword spotting training method based on sample's class uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048.
[19]	SHARIEF F, IJAZ H, SHOJAFAR M, et al. Multi-class imbalanced data handling with concept drift in fog computing: A taxonomy, review, and future directions[J]. ACM Computing Surveys, 2025, 57(1): 16. doi: 10.1145/3689627.
[20]	KEEL. KEEL dataset repository[EB/OL]. https://sci2s.ugr.es/keel/imbalanced.php.
[21]	Machine learning repository UCI. http://archive.ics.uci.edu/ml/datasets.html.
[22]	LI Lusi, HE Haibo, and LI Jie. Entropy-based sampling approaches for multi-class imbalanced problems[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(11): 2159–2170. doi: 10.1109/TKDE.2019.2913859.