A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets

TAO Xinmin; XU Annan; SHI Lihang; LI Junxuan; GUO Xinyue; ZHANG Yanping

doi:10.11999/JEIT250381

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

TAO Xinmin, XU Annan, SHI Lihang, LI Junxuan, GUO Xinyue, ZHANG Yanping. A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250381

Citation:

TAO Xinmin, XU Annan, SHI Lihang, LI Junxuan, GUO Xinyue, ZHANG Yanping. A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250381

Citation:

TAO Xinmin, XU Annan, SHI Lihang, LI Junxuan, GUO Xinyue, ZHANG Yanping. A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250381

PDF( 4238 KB)

A Multi-class Local Distribution-based Weighted Oversampling Algorithm for Multi-class Imbalanced Datasets

doi: 10.11999/JEIT250381 cstr: 32379.14.JEIT250381

1.
College of Civil Engineering and Transportation, University of Northeast Forestry, Harbin 150040, China
2.
College of Management and Economics, University of North China Electric Power, Beijing 102206, China
3.
Shandong Provincial Key Laboratory of Industrial Big Data and Intelligent Manufacturing, Jinan 25000, China

Funds: The National Natural Science Foundation of China (62176050), The Natural Science Foundation of Shandong Provincial (ZR2024QA140)

Received Date: 2025-04-29
Rev Recd Date: 2025-09-17

Available Online: 2025-09-23

Abstract

Abstract

Objective Classification with imbalanced datasets remains one of the most challenging problems in machine learning. In addition to class imbalance, such datasets often contain complex factors including class overlap, small disjuncts, outliers, and low-density regions, all of which can substantially degrade classifier performance, particularly in multi-class settings. To address these challenges simultaneously, this study proposes the Multi-class Local Distribution-based Weighted Oversampling Algorithm (MC-LDWO). Methods The MC-LDWO algorithm first constructs hyperspheres centered on dynamically determined minority classes, with radii estimated from the distribution of each class. Within these hyperspheres, minority class samples are selected for oversampling according to their local distribution, and an adaptive weight allocation strategy is designed using local density metrics. This ensures that samples in low-density regions and near class boundaries are assigned higher probabilities of being oversampled. Next, a low-density vector is computed from the local distribution of both majority and minority classes. A random vector is then introduced and integrated with the low-density vector, and a cutoff threshold is applied to determine the generation sites of synthetic samples, thereby reducing class overlap during boundary oversampling. Finally, an improved decomposition strategy tailored for multi-class imbalance is employed to further enhance classification performance in multi-class imbalanced scenarios. Results and Discussions The MC-LDWO algorithm dynamically identifies the minority and combined majority class sample sets and constructs hyperspheres centered on each minority class sample, with radii determined by the distribution of the corresponding minority class. These hyperspheres guide the subsequent oversampling process. A trade-off parameter ($ \beta $) is introduced to balance the influence of local densities between the combined majority and minority classes. Experimental results on KEEL datasets show that this approach effectively prevents class overlap during boundary oversampling while assigning higher oversampling weights to critical minority samples located near boundaries and in low-density regions. This improves boundary distribution and simultaneously addresses within-class imbalance. When the trade-off parameter is set to 0.5, MC-LDWO achieves a balanced consideration of both boundary distribution and the diverse densities present in minority classes due to data difficulty factors, thereby supporting improved performance in downstream classification tasks (Fig. 10). Conclusions Comparative results with other state-of-the-art oversampling algorithms demonstrate that: (1) The MC-LDWO algorithm effectively prevents overlap when strengthening decision boundaries by setting the cutoff threshold ($ T $) and adaptively assigns oversampling weights according to two local density indicators for the minority and combined majority classes within the hypersphere. This approach addresses within-class imbalance caused by data difficulty factors and enhances boundary distribution. (2) By jointly considering density and boundary distribution, and setting the trade-off parameter to 0.5, the proposed algorithm can simultaneously mitigate within-class imbalance and reinforce the boundary information of minority classes. (3) When applied to highly imbalanced datasets characterized by complex decision boundaries and data difficulty factors such as outliers and small disjuncts, MC-LDWO significantly improves the boundary distribution of each minority class while effectively managing within-class imbalance, thereby enhancing the performance of subsequent classifiers.
- Imbalanced datasets,
- Classification,
- Oversampling,
- Within-class imbalance,
- Hypersphere

FullText(HTML)

References(22)

References

[1]	NGUYEN M N. A scoping review of deep learning approaches for lung cancer detection using chest radiographs and computed tomography scans[J]. Biomedical Engineering Advances, 2025, 9: 100138. doi: 10.1016/j.bea.2024.100138.
[2]	LIANG Xiayu, GAO Ying, and XU Shanrong. ASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[J]. Expert Systems with Applications, 2024, 238: 122049. doi: 10.1016/j.eswa.2023.122049.
[3]	TENG Hu, WANG Cheng, YANG Qing, et al. Leveraging adversarial augmentation on imbalance data for online trading fraud detection[J]. IEEE Transactions on Computational Social Systems, 2024, 11(2): 1602–1614. doi: 10.1109/TCSS.2023.3240968.
[4]	DOU Jun, WEI Guoliang, SONG Yan, et al. Switching triple-weight-SMOTE in empirical feature space for imbalanced and incomplete data[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(2): 1850–1866. doi: 10.1109/TASE.2023.3240759.
[5]	张永清, 卢荣钊, 乔少杰, 等. 一种基于样本空间的类别不平衡数据采样方法[J]. 自动化学报, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034. ZHANG Yongqing, LU Rongzhao, QIAO Shaojie, et al. A sampling method of imbalanced data based on sample space[J]. Acta Automatica Sinica, 2022, 48(10): 2549–2563. doi: 10.16383/j.aas.c200034.
[6]	YANG Yuxuan, KHORSHIDI H A, and AICKELIN U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: Insights for medical problems[J]. Frontiers in Digital Health, 2024, 6: 1430245. doi: 10.3389/fdgth.2024.1430245.
[7]	SÁEZ J A, KRAWCZYK B, and WOŹNIAK M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets[J]. Pattern Recognition, 2016, 57: 164–178. doi: 10.1016/j.patcog.2016.03.012.
[8]	ZHU Tuanfei, LIN Yaping, and LIU Yonghe. Synthetic minority oversampling technique for multiclass imbalance problems[J]. Pattern Recognition, 2017, 72: 327–340. doi: 10.1016/j.patcog.2017.07.024.
[9]	ABDI L and HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1): 238–251. doi: 10.1109/TKDE.2015.2458858.
[10]	KRAWCZYK B, KOZIARSKI M, and WOŹNIAK M. Radial-based oversampling for multiclass imbalanced data classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(8): 2818–2831. doi: 10.1109/TNNLS.2019.2913673.
[11]	KOZIARSKI M, WOŹNIAK M, and KRAWCZYK B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise[J]. Knowledge-Based Systems, 2020, 204: 106223. doi: 10.1016/j.knosys.2020.106223.
[12]	MONDAL P, ANSARI F, and DAS S. CCO: A cluster core-based oversampling technique for improved class-imbalanced learning[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2025, 9(2): 1153–1165. doi: 10.1109/TETCI.2024.3407784.
[13]	LI Shuxian, SONG Liyan, WU Xiaoyu, et al. Multi-class imbalance classification based on data distribution and adaptive weights[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(10): 5265–5279. doi: 10.1109/TKDE.2024.3384961.
[14]	MA Tingting, LU Shuxia, and JIANG Chen. A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data[J]. Expert Systems with Applications, 2024, 240: 122565. doi: 10.1016/j.eswa.2023.122565.
[15]	DAI Qi, WANG Longhui, XU Kailong, et al. Class-overlap detection based on heterogeneous clustering ensemble for multi-class imbalance problem[J]. Expert Systems with Applications, 2024, 255: 124558. doi: 10.1016/j.eswa.2024.124558.
[16]	TAO Xinmin, ZHANG Xiaohan, ZHENG Yujia, et al. A MeanShift-guided oversampling with self-adaptive sizes for imbalanced data classification[J]. Information Sciences, 2024, 672: 120699. doi: 10.1016/j.ins.2024.120699.
[17]	RODRIGUEZ A and LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496. doi: 10.1126/science.1242072.
[18]	贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048. HE Qianhua, CHEN Yongqiang, ZHENG Ruowei, et al. End-to-end speech keyword spotting training method based on sample's class uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482–3492. doi: 10.12263/DZXB.20240048.
[19]	SHARIEF F, IJAZ H, SHOJAFAR M, et al. Multi-class imbalanced data handling with concept drift in fog computing: A taxonomy, review, and future directions[J]. ACM Computing Surveys, 2025, 57(1): 16. doi: 10.1145/3689627.
[20]	KEEL. KEEL dataset repository[EB/OL]. https://sci2s.ugr.es/keel/imbalanced.php. (查阅网上资料,未能确认标题信息,请确认) (查阅网上资料,未找到引用日期,请补充).
[21]	Machine learning repository UCI. http://archive.ics.uci.edu/ml/datasets.html. (查阅网上资料,未找到本条文献信息,请确认).
[22]	LI Lusi, HE Haibo, and LI Jie. Entropy-based sampling approaches for multi-class imbalanced problems[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(11): 2159–2170. doi: 10.1109/TKDE.2019.2913859.