面向类不平衡网络流量的特征选择算法

唐宏; 刘丹; 姚立霜; 王云锋; 裴作飞

doi:10.11999/JEIT190992

面向类不平衡网络流量的特征选择算法

doi: 10.11999/JEIT190992 cstr: 32379.14.JEIT190992

唐宏^{1, 2},
刘丹^{1, 2, ,},
姚立霜^{1, 2},
王云锋¹,
裴作飞^{1, 2}

1.
重庆邮电大学通信与信息工程学院重庆 400065
2.
移动通信技术重庆市重点实验室重庆 400065

基金项目: 长江学者和创新团队发展计划(IRT_16R72)

详细信息

作者简介:
唐宏：男，1967年生，教授，研究方向为计算机网络、移动通信

刘丹：女，1995年生，硕士生，研究方向为网络管理、机器学习

姚立霜：女，1993年生，硕士生，研究方向为网络管理、机器学习

王云锋：男，1992年生，硕士生，研究方向为机器学习、数据挖掘

裴作飞：男，1994年生，硕士生，研究方向为机器学习、数据挖掘

通讯作者:
刘丹　s170101113@stu.cqupt.edu.cn

中图分类号: TP393
计量
- 文章访问数: 1374
- HTML全文浏览量: 685
- PDF下载量: 85
- 被引次数: 0
出版历程
- 收稿日期: 2019-12-11
- 修回日期: 2021-02-22
- 网络出版日期: 2021-03-04
- 刊出日期: 2021-04-20

Feature Selection Algorithm for Class Imbalanced Internet Traffic

Hong TANG^{1, 2},
Dan LIU^{1, 2
, ,},
LiShuang YAO^{1, 2},
Yunfeng WANG¹,
Zuofei PEI^{1, 2}

1.
School of Communication and Information Engineering, Chongqing University of Posts and Communications, Chongqing 400065, China
2.
Key Laboratory of Mobile Communications Technology, Chongqing University of Posts and Communications, Chongqing 400065, China

Funds: Changjiang Scholars and Innovative Research Team in University (IRT_16R72)

摘要

摘要: 针对网络流量分类过程中出现的类不平衡问题，该文提出一种基于加权对称不确定性(WSU)和近似马尔科夫毯(AMB)的特征选择算法。首先，根据类别分布信息，定义了偏向于小类别的特征度量，使得与小类别具有强相关性的特征更容易被选择出来；其次，充分考虑特征与类别间、特征与特征之间的相关性，利用加权对称不确定性和近似马尔科夫毯删除不相关特征及冗余特征；最后，利用基于相关性度量的特征评估函数以及序列搜索算法进一步降低特征维数，确定最优特征子集。实验表明，在保证算法整体分类精确率的前提下，算法能够有效提高小类别的分类性能。
- 流量分类 /
- 特征选择 /
- 类不平衡 /
- 加权对称不确定性 /
- 近似马尔科夫毯
Abstract: Class imbalance always exists in the process of network traffic classification. Considering the problem, a new feature selection algorithm using Weighted Symmetric Uncertainty (WSU) and Approximate Markov Blanket (AMB) is proposed. Firstly, a feature metric is defined using category distribution information, which is biased to minority classes. This makes it easier pick out features which have strong correlation with minority classes. Then, considering the correlation between features and categories and between features and features, the weighted symmetry uncertainty and approximate Markov blanket are used to delete the unrelated features and redundant features. Finally, the feature dimension is further reduced to determine the optimal feature subset, by using feature evaluation functions based on correlation measures and sequence search algorithms. The experimental results demonstrate that the algorithm can effectively improve the classification performance of minority classes without sacrificing the accuracy of the overall classification.
- Traffic classification /
- Feature selection /
- Class imbalance /
- Weighted Symmetric Uncertainty (WSU) /
- Approximate Markov Blanket (AMB)

HTML全文

图 1 特征选择流程图

下载: 全尺寸图片幻灯片

图 2 WSU_AMB特征选择算法的总体框架

下载: 全尺寸图片幻灯片

图 3 特征子集数目L对算法的影响

下载: 全尺寸图片幻灯片

图 4 阈值δ对算法的影响

下载: 全尺寸图片幻灯片

图 5 不同算法在各数据集上的特征选择时间对比

下载: 全尺寸图片幻灯片

图 6 不同数据子集下各特征选择算法在4种分类器上的整体精确率对比

下载: 全尺寸图片幻灯片

图 7 各特征选择算法的小类准确率对比

下载: 全尺寸图片幻灯片

图 8 各特征选择算法的小类召回率对比

下载: 全尺寸图片幻灯片

图 9 各特征选择算法的小类F1值对比

下载: 全尺寸图片幻灯片

表 1 基于WSU_AMB的特征选择算法

输入：$D({f_1},{f_2}, ···,{f_N},C)$，WSU阈值δ，$F = \{ {f_1},{f_2}, ···,{f_N}\} $，最优特征子集中特征数目L
输出：最优特征子集${F_{\rm{O}}}$
第1阶段：确定候选特征集合
(1) FOR ${f_i} \in F$
(2)　　计算${\rm{WSU}}({f_i},\;C)$
(3)　　将特征按${\rm{WSU}}({f_i},\;C)$值降序排列
(4)　　IF ${\rm{WSU}}({f_i},\;C) > \delta $
(5)　　　　将特征${f_i}$添加到特征子集${S^*}$中
(6)　　WHILE ${S^*} \ne \varnothing$
(7)　　　　选择${S^}$中的第1个特征${f_i}$作为显著特征，将特征${f_i}$加入特征子集$S$，从特征集合${S^}$中删除特征${f_i}$
(8)　　　　查找以特征${f_i}$为近似马尔科夫毯的特征子集{${f_j}$}
(9)　　　将特征子集{${f_j}$}从${S^*}$中删除
第2阶段：选择最优特征子集
(10) FOR ${f_d} \in S$
(11)　　计算$J({f_d})$
(12)　　IF $J\left( {{f_a}} \right) = \max \left\{ {J\left( {{f_d}} \right)} \right\}$
(13)　　　　将特征${f_a}$加入目标特征子集${F_{\rm{O}}}$，从候选特征集合$S$中删除特征${f_a}$
(14) FOR ${f_x} \in S$
(15)　　计算$J({F_{\rm{O}}} \cup {f_a})$
(16)　　IF $J\left( {{f_1}} \right) = \max \left\{ {J({F_{\rm{O}}} \cup {f_a})} \right\}$
(17)　　　　将特征${f_1}$加入目标特征子集${F_{\rm{O}}}$，从候选特征集合S中删除特征
(18) FOR ${\rm{Length}}({F_{\rm{O}}}) < L$
(19)　　重复(14)—(17)行
(20) 输出${F_{\rm{O}}}$

下载: 导出CSV

表 2 Moore数据集的统计信息

类别	应用实例	流量数	百分比(%)
WWW	www	328,092	86.905
MAIL	Imap, pop2/3, smtp	28,567	7.567
FTP-CONTROL	ftp-control	3,054	0.809
FTP-PASV	ftp-pasv	2,688	0.712
ATTACK	Internet worm, virus attacks	1,793	0.475
P2P	KaZaA, BitTorrent, GnuTella	2,094	0.555
DATABASE	Postgres, sqlnet oracle, ingres	2,648	0.702
FTP-DATA	ftp-data	5,797	1.536
MULTIMEDIA	Windows media player, Real	576	0.152
SERVICES	X11, dns, ident, Idap, ntp	2,099	0.556
INTERACTIVE	ssh, klogin, rlogin, telenet	110	0.029
GAMES	Half-Life	8	0.002
total	28	377,526	100

下载: 导出CSV

表 3 不同特征选择方法所选特征数目

数据集	FCBF	CSDT	EFOA	MFHFS	WSU_AMB
DataSet1	8	9	9	8	6
DataSet2	7	5	6	7	6
DataSet3	5	10	7	6	6
DataSet4	6	14	5	5	6
DataSet5	7	5	8	7	6
DataSet6	6	9	5	6	6
DataSet7	7	11	6	5	6
DataSet8	8	9	8	7	6
DataSet9	6	14	7	7	6
DataSet10	7	6	7	6	6

下载: 导出CSV

表 4 不同特征选择方法的时间复杂度分析

算法	时间复杂度
FCBF	$O(MN{\log _2}N)$
CSDT	$O({N^2} + {\log _2}L)$
EFOA	$O\left({\displaystyle\sum\nolimits_{t=1}^{D}{N}_{t} \cdot {C}_{t} }\right)+MN{\mathrm{log} }_{2}N)$
MFHFS	$O(N{\log _2}N{\rm{ + }}{L^3})$
WSU_AMB	$O(M{N^2}) + O(N{K^2})$

下载: 导出CSV

表 5 分类时间的比较(ms)

算法	分类时间的均值
FCBF	153.6
CSDT	215.3
EFOA	234.6
MFHFS	146.4
WSU_AMB	120.7

下载: 导出CSV

参考文献(23)

XUE Yibo, ZHANG Luoshi, and WANG Dawei. Traffic classification: Issues and challenges[J]. Journal of Communications, 2013, 8(4): 240–248. doi: 10.12720/jcm.8.4.240-248

NGUYEN T T T and ARMITAGE G. A survey of techniques for internet traffic classification using machine learning[J]. IEEE Communications Surveys & Tutorials, 2008, 10(4): 56–76. doi: 10.1109/SURV.2008.080406

DAINOTTI A, PESCAPE A, and CLAFFY K C. Issues and future directions in traffic classification[J]. IEEE Network, 2012, 26(1): 35–40. doi: 10.1109/mnet.2012.6135854

MOORE A W and PAPAGIANNAKI K. Toward the accurate identification of network applications[C]. The 6th International Workshop on Passive and Active Network Measurement, Boston, USA, 2005: 41–54. doi: 10.1007/978-3-540-31966-5_4.

叶春明, 王珍, 陈思, 等. 基于节点行为特征分析的网络流量分类方法[J]. 电子与信息学报, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600

YE Chunming, WANG Zhen, CHEN Si, et al. Internet Traffic classification based on hosts behavior analysis[J]. Journal of Electronics &Information Technology, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600

DIAS K L, PONGELUPE M A, CAMINHAS W M, et al. An innovative approach for real-time network traffic classification[J]. Computer Networks, 2019, 158: 143–157. doi: 10.1016/j.comnet.2019.04.004

鲁刚, 张宏莉, 叶麟. P2P流量识别[J]. 软件学报, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995

LU Gang, ZHANG Hongli, and YE Lin. P2P traffic identification[J]. Journal of Software, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995

MOORE A W and ZUZV D. Internet traffic classification using Bayesian analysis techniques[J]. ACM SIGMETRICS Performance Evaluation Review, 2005, 33(1): 50–60. doi: 10.1145/1071690.1064220

DAI Lei, YUN Xiaochun, and XIAO Jun. Optimizing traffic classification using hybrid feature selection[C]. The 9th International Conference on Web-Age Information Management, Zhangjiajie, China, 2008: 520–525. doi: 10.1109/WAIM.2008.30.

XU Huali, YU Shuhao, CHEN Jiajun, et al. An improved firefly algorithm for feature selection in classification[J]. Wireless Personal Communications, 2018, 102(4): 2823–2834. doi: 10.1007/s11277-018-5309-1

张震, 汪斌强, 陈鸿昶, 等. 互联网中基于用户连接图的流量分类机制[J]. 电子与信息学报, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040

ZHANG Zhen, WANG Binqiang, CHEN Hongchang, et al. Internet traffic classification based on host connection graph[J]. Journal of Electronics &Information Technology, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040

SHAFIQ M, YU Xiangzhan, BASHIR A K, et al. A machine learning approach for feature selection traffic classification using security analysis[J]. The Journal of Supercomputing, 2018, 74(10): 4867–4892. doi: 10.1007/s11227-018-2263-3

SHI Hongtao, LI Hongping, ZHANG Dan, et al. An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification[J]. Computer Networks, 2018, 132: 81–89. doi: 10.1016/j.comnet.2018.01.007

WANG Youwei and FENG Lizhou. A new hybrid feature selection based on multi-filter weights and multi-feature weights[J]. Applied Intelligence, 2019, 49(12): 4033–4057. doi: 10.1007/s10489-019-01470-z

王勇, 周慧怡, 俸皓, 等. 基于深度卷积神经网络的网络流量分类方法[J]. 通信学报, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018

WANG Yong, ZHOU Huiyi, FENG Hao, et al. Network traffic classification method basing on CNN[J]. Journal on Communications, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018

REN Xinming, GU Huaxi, and WEI Wenting. Tree-RNN: Tree structural recurrent neural network for network traffic classification[J]. Expert Systems with Applications, 2021, 167: 114363. doi: 10.1016/j.eswa.2020.114363

LIN S Z, SHI Yong, and XUE Zhi. Character-level intrusion detection based on convolutional neural networks[C]. 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018: 1–8. doi: 10.1109/IJCNN.2018.8488987.

夏栋梁, 刘玉坤, 鲁书喜. 基于蚁群算法和改进SSO的混合网络入侵检测方法[J]. 重庆邮电大学学报: 自然科学版, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021

XIA Dongliang, LIU Yukun, and LU Shuxi. Hybrid network intrusion detection method based on ant colony algorithm and improved simplified swarm optimization[J]. Journal of Chongqing University of Posts and Telecommunications:Natural Science Edition, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021

LOPEZ-MARTIN M, CARRO B, SANCHEZ-ESGUEVILLAS A, et al. Shallow neural network with kernel approximation for prediction problems in highly demanding data networks[J]. Expert Systems with Applications, 2019, 124: 196–208. doi: 10.1016/j.eswa.2019.01.063

DASH M and LIU Huan. Consistency-based search in feature selection[J]. Artificial Intelligence, 2003, 151(1/2): 155–176. doi: 10.1016/s0004-3702(03)00079-1

ZHANG Hongli, LU Gang, QASSRAWI M T, et al. Feature selection for optimizing traffic classification[J]. Computer Communications, 2012, 35(12): 1457–1471. doi: 10.1016/j.comcom.2012.04.012

崔自峰, 徐宝文, 张卫丰, 等. 一种近似Markov Blanket最优特征选择算法[J]. 计算机学报, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002

CUI Zifeng, XU Baowen, ZHANG Weifeng, et al. An approximate markov blanket feature selection algorithm[J]. Chinese Journal of Computers, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002

MOORE A W. Dataset[EB/OL]. https://www.cl.cam.ac.uk/research/srg/netos/nprobe/data/papers/sigmetrics/index.html, 2005.

施引文献

资源附件(0)

访问统计