Feature Selection Algorithm for Class Imbalanced Internet Traffic
-
摘要: 针对网络流量分类过程中出现的类不平衡问题,该文提出一种基于加权对称不确定性(WSU)和近似马尔科夫毯(AMB)的特征选择算法。首先,根据类别分布信息,定义了偏向于小类别的特征度量,使得与小类别具有强相关性的特征更容易被选择出来;其次,充分考虑特征与类别间、特征与特征之间的相关性,利用加权对称不确定性和近似马尔科夫毯删除不相关特征及冗余特征;最后,利用基于相关性度量的特征评估函数以及序列搜索算法进一步降低特征维数,确定最优特征子集。实验表明,在保证算法整体分类精确率的前提下,算法能够有效提高小类别的分类性能。Abstract: Class imbalance always exists in the process of network traffic classification. Considering the problem, a new feature selection algorithm using Weighted Symmetric Uncertainty (WSU) and Approximate Markov Blanket (AMB) is proposed. Firstly, a feature metric is defined using category distribution information, which is biased to minority classes. This makes it easier pick out features which have strong correlation with minority classes. Then, considering the correlation between features and categories and between features and features, the weighted symmetry uncertainty and approximate Markov blanket are used to delete the unrelated features and redundant features. Finally, the feature dimension is further reduced to determine the optimal feature subset, by using feature evaluation functions based on correlation measures and sequence search algorithms. The experimental results demonstrate that the algorithm can effectively improve the classification performance of minority classes without sacrificing the accuracy of the overall classification.
-
表 1 基于WSU_AMB的特征选择算法
输入:$D({f_1},{f_2}, ···,{f_N},C)$,WSU阈值δ,$F = \{ {f_1},{f_2}, ···,{f_N}\} $,最优特征子集中特征数目L 输出:最优特征子集${F_{\rm{O}}}$ 第1阶段:确定候选特征集合 (1) FOR ${f_i} \in F$ (2) 计算${\rm{WSU}}({f_i},\;C)$ (3) 将特征按${\rm{WSU}}({f_i},\;C)$值降序排列 (4) IF ${\rm{WSU}}({f_i},\;C) > \delta $ (5) 将特征${f_i}$添加到特征子集${S^*}$中 (6) WHILE ${S^*} \ne \varnothing$ (7) 选择${S^*}$中的第1个特征${f_i}$作为显著特征,将特征${f_i}$加入特征子集$S$,从特征集合${S^*}$中删除特征${f_i}$ (8) 查找以特征${f_i}$为近似马尔科夫毯的特征子集{${f_j}$} (9) 将特征子集{${f_j}$}从${S^*}$中删除 第2阶段:选择最优特征子集 (10) FOR ${f_d} \in S$ (11) 计算$J({f_d})$ (12) IF $J\left( {{f_a}} \right) = \max \left\{ {J\left( {{f_d}} \right)} \right\}$ (13) 将特征${f_a}$加入目标特征子集${F_{\rm{O}}}$,从候选特征集合$S$中删除特征${f_a}$ (14) FOR ${f_x} \in S$ (15) 计算$J({F_{\rm{O}}} \cup {f_a})$ (16) IF $J\left( {{f_1}} \right) = \max \left\{ {J({F_{\rm{O}}} \cup {f_a})} \right\}$ (17) 将特征${f_1}$加入目标特征子集${F_{\rm{O}}}$,从候选特征集合S中删除特征 (18) FOR ${\rm{Length}}({F_{\rm{O}}}) < L$ (19) 重复(14)—(17)行 (20) 输出${F_{\rm{O}}}$ 表 2 Moore数据集的统计信息
类别 应用实例 流量数 百分比(%) WWW www 328,092 86.905 MAIL Imap, pop2/3, smtp 28,567 7.567 FTP-CONTROL ftp-control 3,054 0.809 FTP-PASV ftp-pasv 2,688 0.712 ATTACK Internet worm, virus attacks 1,793 0.475 P2P KaZaA, BitTorrent, GnuTella 2,094 0.555 DATABASE Postgres, sqlnet oracle, ingres 2,648 0.702 FTP-DATA ftp-data 5,797 1.536 MULTIMEDIA Windows media player, Real 576 0.152 SERVICES X11, dns, ident, Idap, ntp 2,099 0.556 INTERACTIVE ssh, klogin, rlogin, telenet 110 0.029 GAMES Half-Life 8 0.002 total 28 377,526 100 表 3 不同特征选择方法所选特征数目
数据集 FCBF CSDT EFOA MFHFS WSU_AMB DataSet1 8 9 9 8 6 DataSet2 7 5 6 7 6 DataSet3 5 10 7 6 6 DataSet4 6 14 5 5 6 DataSet5 7 5 8 7 6 DataSet6 6 9 5 6 6 DataSet7 7 11 6 5 6 DataSet8 8 9 8 7 6 DataSet9 6 14 7 7 6 DataSet10 7 6 7 6 6 表 4 不同特征选择方法的时间复杂度分析
算法 时间复杂度 FCBF $O(MN{\log _2}N)$ CSDT $O({N^2} + {\log _2}L)$ EFOA $O\left({\displaystyle\sum\nolimits_{t=1}^{D}{N}_{t} \cdot {C}_{t} }\right)+MN{\mathrm{log} }_{2}N)$ MFHFS $O(N{\log _2}N{\rm{ + }}{L^3})$ WSU_AMB $O(M{N^2}) + O(N{K^2})$ 表 5 分类时间的比较(ms)
算法 分类时间的均值 FCBF 153.6 CSDT 215.3 EFOA 234.6 MFHFS 146.4 WSU_AMB 120.7 -
XUE Yibo, ZHANG Luoshi, and WANG Dawei. Traffic classification: Issues and challenges[J]. Journal of Communications, 2013, 8(4): 240–248. doi: 10.12720/jcm.8.4.240-248 NGUYEN T T T and ARMITAGE G. A survey of techniques for internet traffic classification using machine learning[J]. IEEE Communications Surveys & Tutorials, 2008, 10(4): 56–76. doi: 10.1109/SURV.2008.080406 DAINOTTI A, PESCAPE A, and CLAFFY K C. Issues and future directions in traffic classification[J]. IEEE Network, 2012, 26(1): 35–40. doi: 10.1109/mnet.2012.6135854 MOORE A W and PAPAGIANNAKI K. Toward the accurate identification of network applications[C]. The 6th International Workshop on Passive and Active Network Measurement, Boston, USA, 2005: 41–54. doi: 10.1007/978-3-540-31966-5_4. 叶春明, 王珍, 陈思, 等. 基于节点行为特征分析的网络流量分类方法[J]. 电子与信息学报, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600YE Chunming, WANG Zhen, CHEN Si, et al. Internet Traffic classification based on hosts behavior analysis[J]. Journal of Electronics &Information Technology, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600 DIAS K L, PONGELUPE M A, CAMINHAS W M, et al. An innovative approach for real-time network traffic classification[J]. Computer Networks, 2019, 158: 143–157. doi: 10.1016/j.comnet.2019.04.004 鲁刚, 张宏莉, 叶麟. P2P流量识别[J]. 软件学报, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995LU Gang, ZHANG Hongli, and YE Lin. P2P traffic identification[J]. Journal of Software, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995 MOORE A W and ZUZV D. Internet traffic classification using Bayesian analysis techniques[J]. ACM SIGMETRICS Performance Evaluation Review, 2005, 33(1): 50–60. doi: 10.1145/1071690.1064220 DAI Lei, YUN Xiaochun, and XIAO Jun. Optimizing traffic classification using hybrid feature selection[C]. The 9th International Conference on Web-Age Information Management, Zhangjiajie, China, 2008: 520–525. doi: 10.1109/WAIM.2008.30. XU Huali, YU Shuhao, CHEN Jiajun, et al. An improved firefly algorithm for feature selection in classification[J]. Wireless Personal Communications, 2018, 102(4): 2823–2834. doi: 10.1007/s11277-018-5309-1 张震, 汪斌强, 陈鸿昶, 等. 互联网中基于用户连接图的流量分类机制[J]. 电子与信息学报, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040ZHANG Zhen, WANG Binqiang, CHEN Hongchang, et al. Internet traffic classification based on host connection graph[J]. Journal of Electronics &Information Technology, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040 SHAFIQ M, YU Xiangzhan, BASHIR A K, et al. A machine learning approach for feature selection traffic classification using security analysis[J]. The Journal of Supercomputing, 2018, 74(10): 4867–4892. doi: 10.1007/s11227-018-2263-3 SHI Hongtao, LI Hongping, ZHANG Dan, et al. An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification[J]. Computer Networks, 2018, 132: 81–89. doi: 10.1016/j.comnet.2018.01.007 WANG Youwei and FENG Lizhou. A new hybrid feature selection based on multi-filter weights and multi-feature weights[J]. Applied Intelligence, 2019, 49(12): 4033–4057. doi: 10.1007/s10489-019-01470-z 王勇, 周慧怡, 俸皓, 等. 基于深度卷积神经网络的网络流量分类方法[J]. 通信学报, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018WANG Yong, ZHOU Huiyi, FENG Hao, et al. Network traffic classification method basing on CNN[J]. Journal on Communications, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018 REN Xinming, GU Huaxi, and WEI Wenting. Tree-RNN: Tree structural recurrent neural network for network traffic classification[J]. Expert Systems with Applications, 2021, 167: 114363. doi: 10.1016/j.eswa.2020.114363 LIN S Z, SHI Yong, and XUE Zhi. Character-level intrusion detection based on convolutional neural networks[C]. 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018: 1–8. doi: 10.1109/IJCNN.2018.8488987. 夏栋梁, 刘玉坤, 鲁书喜. 基于蚁群算法和改进SSO的混合网络入侵检测方法[J]. 重庆邮电大学学报: 自然科学版, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021XIA Dongliang, LIU Yukun, and LU Shuxi. Hybrid network intrusion detection method based on ant colony algorithm and improved simplified swarm optimization[J]. Journal of Chongqing University of Posts and Telecommunications:Natural Science Edition, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021 LOPEZ-MARTIN M, CARRO B, SANCHEZ-ESGUEVILLAS A, et al. Shallow neural network with kernel approximation for prediction problems in highly demanding data networks[J]. Expert Systems with Applications, 2019, 124: 196–208. doi: 10.1016/j.eswa.2019.01.063 DASH M and LIU Huan. Consistency-based search in feature selection[J]. Artificial Intelligence, 2003, 151(1/2): 155–176. doi: 10.1016/s0004-3702(03)00079-1 ZHANG Hongli, LU Gang, QASSRAWI M T, et al. Feature selection for optimizing traffic classification[J]. Computer Communications, 2012, 35(12): 1457–1471. doi: 10.1016/j.comcom.2012.04.012 崔自峰, 徐宝文, 张卫丰, 等. 一种近似Markov Blanket最优特征选择算法[J]. 计算机学报, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002CUI Zifeng, XU Baowen, ZHANG Weifeng, et al. An approximate markov blanket feature selection algorithm[J]. Chinese Journal of Computers, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002 MOORE A W. Dataset[EB/OL]. https://www.cl.cam.ac.uk/research/srg/netos/nprobe/data/papers/sigmetrics/index.html, 2005.