高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向类不平衡网络流量的特征选择算法

唐宏 刘丹 姚立霜 王云锋 裴作飞

唐宏, 刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡网络流量的特征选择算法[J]. 电子与信息学报, 2021, 43(4): 923-930. doi: 10.11999/JEIT190992
引用本文: 唐宏, 刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡网络流量的特征选择算法[J]. 电子与信息学报, 2021, 43(4): 923-930. doi: 10.11999/JEIT190992
Hong TANG, Dan LIU, LiShuang YAO, Yunfeng WANG, Zuofei PEI. Feature Selection Algorithm for Class Imbalanced Internet Traffic[J]. Journal of Electronics & Information Technology, 2021, 43(4): 923-930. doi: 10.11999/JEIT190992
Citation: Hong TANG, Dan LIU, LiShuang YAO, Yunfeng WANG, Zuofei PEI. Feature Selection Algorithm for Class Imbalanced Internet Traffic[J]. Journal of Electronics & Information Technology, 2021, 43(4): 923-930. doi: 10.11999/JEIT190992

面向类不平衡网络流量的特征选择算法

doi: 10.11999/JEIT190992
基金项目: 长江学者和创新团队发展计划(IRT_16R72)
详细信息
    作者简介:

    唐宏:男,1967年生,教授,研究方向为计算机网络、移动通信

    刘丹:女,1995年生,硕士生,研究方向为网络管理、机器学习

    姚立霜:女,1993年生,硕士生,研究方向为网络管理、机器学习

    王云锋:男,1992年生,硕士生,研究方向为机器学习、数据挖掘

    裴作飞:男,1994年生,硕士生,研究方向为机器学习、数据挖掘

    通讯作者:

    刘丹 s170101113@stu.cqupt.edu.cn

  • 中图分类号: TP393

Feature Selection Algorithm for Class Imbalanced Internet Traffic

Funds: Changjiang Scholars and Innovative Research Team in University (IRT_16R72)
  • 摘要: 针对网络流量分类过程中出现的类不平衡问题,该文提出一种基于加权对称不确定性(WSU)和近似马尔科夫毯(AMB)的特征选择算法。首先,根据类别分布信息,定义了偏向于小类别的特征度量,使得与小类别具有强相关性的特征更容易被选择出来;其次,充分考虑特征与类别间、特征与特征之间的相关性,利用加权对称不确定性和近似马尔科夫毯删除不相关特征及冗余特征;最后,利用基于相关性度量的特征评估函数以及序列搜索算法进一步降低特征维数,确定最优特征子集。实验表明,在保证算法整体分类精确率的前提下,算法能够有效提高小类别的分类性能。
  • 图  1  特征选择流程图

    图  2  WSU_AMB特征选择算法的总体框架

    图  3  特征子集数目L对算法的影响

    图  4  阈值δ对算法的影响

    图  5  不同算法在各数据集上的特征选择时间对比

    图  6  不同数据子集下各特征选择算法在4种分类器上的整体精确率对比

    图  7  各特征选择算法的小类准确率对比

    图  8  各特征选择算法的小类召回率对比

    图  9  各特征选择算法的小类F1值对比

    表  1  基于WSU_AMB的特征选择算法

     输入:$D({f_1},{f_2}, ···,{f_N},C)$,WSU阈值δ,$F = \{ {f_1},{f_2}, ···,{f_N}\} $,最优特征子集中特征数目L
     输出:最优特征子集${F_{\rm{O}}}$
     第1阶段:确定候选特征集合
     (1) FOR ${f_i} \in F$
     (2)  计算${\rm{WSU}}({f_i},\;C)$
     (3)  将特征按${\rm{WSU}}({f_i},\;C)$值降序排列
     (4)  IF ${\rm{WSU}}({f_i},\;C) > \delta $
     (5)    将特征${f_i}$添加到特征子集${S^*}$中
     (6)  WHILE ${S^*} \ne \varnothing$
     (7)    选择${S^*}$中的第1个特征${f_i}$作为显著特征,将特征${f_i}$加入特征子集$S$,从特征集合${S^*}$中删除特征${f_i}$
     (8)    查找以特征${f_i}$为近似马尔科夫毯的特征子集{${f_j}$}
     (9)   将特征子集{${f_j}$}从${S^*}$中删除
     第2阶段:选择最优特征子集
     (10) FOR ${f_d} \in S$
     (11)  计算$J({f_d})$
     (12)  IF $J\left( {{f_a}} \right) = \max \left\{ {J\left( {{f_d}} \right)} \right\}$
     (13)    将特征${f_a}$加入目标特征子集${F_{\rm{O}}}$,从候选特征集合$S$中删除特征${f_a}$
     (14) FOR ${f_x} \in S$
     (15)  计算$J({F_{\rm{O}}} \cup {f_a})$
     (16)  IF $J\left( {{f_1}} \right) = \max \left\{ {J({F_{\rm{O}}} \cup {f_a})} \right\}$
     (17)    将特征${f_1}$加入目标特征子集${F_{\rm{O}}}$,从候选特征集合S中删除特征
     (18) FOR ${\rm{Length}}({F_{\rm{O}}}) < L$
     (19)  重复(14)—(17)行
     (20) 输出${F_{\rm{O}}}$
    下载: 导出CSV

    表  2  Moore数据集的统计信息

    类别应用实例流量数百分比(%)
    WWWwww328,09286.905
    MAILImap, pop2/3, smtp28,5677.567
    FTP-CONTROLftp-control3,0540.809
    FTP-PASVftp-pasv2,6880.712
    ATTACKInternet worm, virus attacks1,7930.475
    P2PKaZaA, BitTorrent, GnuTella2,0940.555
    DATABASEPostgres, sqlnet oracle, ingres2,6480.702
    FTP-DATAftp-data5,7971.536
    MULTIMEDIAWindows media player, Real5760.152
    SERVICESX11, dns, ident, Idap, ntp2,0990.556
    INTERACTIVEssh, klogin, rlogin, telenet1100.029
    GAMESHalf-Life80.002
    total28377,526100
    下载: 导出CSV

    表  3  不同特征选择方法所选特征数目

    数据集FCBFCSDTEFOAMFHFSWSU_AMB
    DataSet189986
    DataSet275676
    DataSet3510766
    DataSet4614556
    DataSet575876
    DataSet669566
    DataSet7711656
    DataSet889876
    DataSet9614776
    DataSet1076766
    下载: 导出CSV

    表  4  不同特征选择方法的时间复杂度分析

    算法时间复杂度
    FCBF$O(MN{\log _2}N)$
    CSDT$O({N^2} + {\log _2}L)$
    EFOA$O\left({\displaystyle\sum\nolimits_{t=1}^{D}{N}_{t} \cdot {C}_{t} }\right)+MN{\mathrm{log} }_{2}N)$
    MFHFS$O(N{\log _2}N{\rm{ + }}{L^3})$
    WSU_AMB$O(M{N^2}) + O(N{K^2})$
    下载: 导出CSV

    表  5  分类时间的比较(ms)

    算法分类时间的均值
    FCBF153.6
    CSDT215.3
    EFOA234.6
    MFHFS146.4
    WSU_AMB120.7
    下载: 导出CSV
  • XUE Yibo, ZHANG Luoshi, and WANG Dawei. Traffic classification: Issues and challenges[J]. Journal of Communications, 2013, 8(4): 240–248. doi: 10.12720/jcm.8.4.240-248
    NGUYEN T T T and ARMITAGE G. A survey of techniques for internet traffic classification using machine learning[J]. IEEE Communications Surveys & Tutorials, 2008, 10(4): 56–76. doi: 10.1109/SURV.2008.080406
    DAINOTTI A, PESCAPE A, and CLAFFY K C. Issues and future directions in traffic classification[J]. IEEE Network, 2012, 26(1): 35–40. doi: 10.1109/mnet.2012.6135854
    MOORE A W and PAPAGIANNAKI K. Toward the accurate identification of network applications[C]. The 6th International Workshop on Passive and Active Network Measurement, Boston, USA, 2005: 41–54. doi: 10.1007/978-3-540-31966-5_4.
    叶春明, 王珍, 陈思, 等. 基于节点行为特征分析的网络流量分类方法[J]. 电子与信息学报, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600

    YE Chunming, WANG Zhen, CHEN Si, et al. Internet Traffic classification based on hosts behavior analysis[J]. Journal of Electronics &Information Technology, 2014, 36(9): 2158–2165. doi: 10.3724/SP.J.1146.2013.01600
    DIAS K L, PONGELUPE M A, CAMINHAS W M, et al. An innovative approach for real-time network traffic classification[J]. Computer Networks, 2019, 158: 143–157. doi: 10.1016/j.comnet.2019.04.004
    鲁刚, 张宏莉, 叶麟. P2P流量识别[J]. 软件学报, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995

    LU Gang, ZHANG Hongli, and YE Lin. P2P traffic identification[J]. Journal of Software, 2011, 22(6): 1281–1298. doi: 10.3724/SP.J.1001.2011.03995
    MOORE A W and ZUZV D. Internet traffic classification using Bayesian analysis techniques[J]. ACM SIGMETRICS Performance Evaluation Review, 2005, 33(1): 50–60. doi: 10.1145/1071690.1064220
    DAI Lei, YUN Xiaochun, and XIAO Jun. Optimizing traffic classification using hybrid feature selection[C]. The 9th International Conference on Web-Age Information Management, Zhangjiajie, China, 2008: 520–525. doi: 10.1109/WAIM.2008.30.
    XU Huali, YU Shuhao, CHEN Jiajun, et al. An improved firefly algorithm for feature selection in classification[J]. Wireless Personal Communications, 2018, 102(4): 2823–2834. doi: 10.1007/s11277-018-5309-1
    张震, 汪斌强, 陈鸿昶, 等. 互联网中基于用户连接图的流量分类机制[J]. 电子与信息学报, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040

    ZHANG Zhen, WANG Binqiang, CHEN Hongchang, et al. Internet traffic classification based on host connection graph[J]. Journal of Electronics &Information Technology, 2013, 35(4): 958–964. doi: 10.3724/SP.J.1146.2012.01040
    SHAFIQ M, YU Xiangzhan, BASHIR A K, et al. A machine learning approach for feature selection traffic classification using security analysis[J]. The Journal of Supercomputing, 2018, 74(10): 4867–4892. doi: 10.1007/s11227-018-2263-3
    SHI Hongtao, LI Hongping, ZHANG Dan, et al. An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification[J]. Computer Networks, 2018, 132: 81–89. doi: 10.1016/j.comnet.2018.01.007
    WANG Youwei and FENG Lizhou. A new hybrid feature selection based on multi-filter weights and multi-feature weights[J]. Applied Intelligence, 2019, 49(12): 4033–4057. doi: 10.1007/s10489-019-01470-z
    王勇, 周慧怡, 俸皓, 等. 基于深度卷积神经网络的网络流量分类方法[J]. 通信学报, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018

    WANG Yong, ZHOU Huiyi, FENG Hao, et al. Network traffic classification method basing on CNN[J]. Journal on Communications, 2018, 39(1): 14–23. doi: 10.11959/j.issn.1000-436x.2018018
    REN Xinming, GU Huaxi, and WEI Wenting. Tree-RNN: Tree structural recurrent neural network for network traffic classification[J]. Expert Systems with Applications, 2021, 167: 114363. doi: 10.1016/j.eswa.2020.114363
    LIN S Z, SHI Yong, and XUE Zhi. Character-level intrusion detection based on convolutional neural networks[C]. 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018: 1–8. doi: 10.1109/IJCNN.2018.8488987.
    夏栋梁, 刘玉坤, 鲁书喜. 基于蚁群算法和改进SSO的混合网络入侵检测方法[J]. 重庆邮电大学学报: 自然科学版, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021

    XIA Dongliang, LIU Yukun, and LU Shuxi. Hybrid network intrusion detection method based on ant colony algorithm and improved simplified swarm optimization[J]. Journal of Chongqing University of Posts and Telecommunications:Natural Science Edition, 2016, 28(3): 406–413. doi: 10.3979/j.issn.1673-825X.2016.03.021
    LOPEZ-MARTIN M, CARRO B, SANCHEZ-ESGUEVILLAS A, et al. Shallow neural network with kernel approximation for prediction problems in highly demanding data networks[J]. Expert Systems with Applications, 2019, 124: 196–208. doi: 10.1016/j.eswa.2019.01.063
    DASH M and LIU Huan. Consistency-based search in feature selection[J]. Artificial Intelligence, 2003, 151(1/2): 155–176. doi: 10.1016/s0004-3702(03)00079-1
    ZHANG Hongli, LU Gang, QASSRAWI M T, et al. Feature selection for optimizing traffic classification[J]. Computer Communications, 2012, 35(12): 1457–1471. doi: 10.1016/j.comcom.2012.04.012
    崔自峰, 徐宝文, 张卫丰, 等. 一种近似Markov Blanket最优特征选择算法[J]. 计算机学报, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002

    CUI Zifeng, XU Baowen, ZHANG Weifeng, et al. An approximate markov blanket feature selection algorithm[J]. Chinese Journal of Computers, 2007, 30(12): 2074–2081. doi: 10.3321/j.issn:0254-4164.2007.12.002
    MOORE A W. Dataset[EB/OL]. https://www.cl.cam.ac.uk/research/srg/netos/nprobe/data/papers/sigmetrics/index.html, 2005.
  • 加载中
图(9) / 表(5)
计量
  • 文章访问数:  1244
  • HTML全文浏览量:  490
  • PDF下载量:  81
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-12-11
  • 修回日期:  2021-02-22
  • 网络出版日期:  2021-03-04
  • 刊出日期:  2021-04-20

目录

    /

    返回文章
    返回