高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于树增强朴素贝叶斯的分类器学习方法

陈曦 张坤

陈曦, 张坤. 一种基于树增强朴素贝叶斯的分类器学习方法[J]. 电子与信息学报, 2019, 41(8): 2001-2008. doi: 10.11999/JEIT180886
引用本文: 陈曦, 张坤. 一种基于树增强朴素贝叶斯的分类器学习方法[J]. 电子与信息学报, 2019, 41(8): 2001-2008. doi: 10.11999/JEIT180886
Xi CHEN, Kun ZHANG. A Classifier Learning Method Based on Tree-Augmented Naïve Bayes[J]. Journal of Electronics & Information Technology, 2019, 41(8): 2001-2008. doi: 10.11999/JEIT180886
Citation: Xi CHEN, Kun ZHANG. A Classifier Learning Method Based on Tree-Augmented Naïve Bayes[J]. Journal of Electronics & Information Technology, 2019, 41(8): 2001-2008. doi: 10.11999/JEIT180886

一种基于树增强朴素贝叶斯的分类器学习方法

doi: 10.11999/JEIT180886
基金项目: 国家自然科学基金(61772087)
详细信息
    作者简介:

    陈曦:男,1963年生,教授,硕士生导师,研究方向为数据挖掘

    张坤:男,1993年生,硕士生,研究方向为数据挖掘

    通讯作者:

    张坤 zonkis2016@outlook.com

  • 中图分类号: TP311.1

A Classifier Learning Method Based on Tree-Augmented Naïve Bayes

Funds: The National Natural Science Foundation of China (61772087)
  • 摘要: 树增强朴素贝叶斯(TAN)结构强制每个属性结点必须拥有类别父结点和一个属性父结点,也没有考虑到各个属性与类别之间的相关性差异,导致分类准确率较差。为了改进TAN的分类准确率,该文首先扩展TAN结构,允许属性结点没有父结点或只有一个属性父结点;提出一种利用可分解的评分函数构建树形贝叶斯分类模型的学习方法,采用低阶条件独立性(CI)测试初步剔除无效属性,再结合改进的贝叶斯信息标准(BIC)评分函数利用贪婪搜索获得每个属性结点的父结点,从而建立分类模型。对比朴素贝叶斯(NB)和TAN,构建的分类器在多个分类指标上表现更好,说明该方法具有一定的优越性。
  • 图  1  结构示意图

    图  2  不同阈值$\varepsilon $的分类准确率

    图  3  多分类数据集的AUC polar图对比

    图  7  SETAN结构示意图

    图  4  平均分类准确率对比图

    图  5  二分类数据集的ROC曲线

    图  6  平均分类准确率比较图

    表  1  算法描述

     输入:变量集V,样本数据集D
     输出:SETAN结构
     步骤1 For each $X_i \in V\;$, $I(C,X_i) = {\rm{Calc\_MI}}(C,X_i)$# 计算属性
    与类别之间的互信息值
     步骤2 将每个互信息值$I(C,X_i)$存入数组,降序
     步骤3 For each $I(C,X_i) > \varepsilon $ in List
      $S_1 = S_1 \cup \{ X_i\} $
      Add path $C - X_i$to graph $E$# 若无连接边,则添加类
    别C到属性$X_i$的连接边
      $S_2 = S_{\rm{2}} \cup \{ X_j\} $,$X_j \in \{ I(C,X_j) < \varepsilon \} $
      Add path $X_i - X_j$to graph $E$# 互信息值小于阈值$\varepsilon $的
    结点则被添加到集合$S_2$
      Remove $I(C,X_i)$from List
      End for
     步骤4 For each $E' \in E$
      ${\rm{Score}}(E') = {\rm{Calc\_BIC}}(E')$# 计算改进的BIC评分
      K2-Search of the optimal SETAN Structure # 利用评
    分函数搜索最优结构
      End for
     步骤5 Return $G = (V',E')$with best BIC score
    下载: 导出CSV

    表  2  网络结构学习实验结果

    $\xi $Asia网Alarm网
    ADRADR
    0.0103502025
    0.0011173245
    0.000130815345
    下载: 导出CSV

    表  3  实验数据集信息

    数据集样本数量类别分布属性数量分类数量缺失值
    Balance62549/288/28843
    Car17281210/384/69/6564
    Connect6755844473/16635/6449423
    Mushroom81244208/3916222
    Nursery129604320/2/328/4266/404485
    SPECT8040/40222
    Cancer28685/20192
    Votes435168/267162
    下载: 导出CSV

    表  4  阈值$\varepsilon $信息

    数据集阈值$\varepsilon $平均准确率
    Balance0.01/0.05/0.100.915/0.914/0.910
    Connect0.01/0.05/0.100.767/0.764/0.760
    SPECT0.01/0.05/0.100.740/0.738/0.733
    Cancer0.01/0.05/0.100.710/0.710/0.698
    下载: 导出CSV

    表  5  NB, TAN和SETAN的各项分类指标对比情况

    数据集算法准确率F1值召回率精确率AUC面积
    NB0.9140.8760.9140.8420.961
    BalanceTAN0.8610.8340.8610.8360.904
    SETAN0.9140.8760.9140.8420.962
    NB0.8570.8490.8570.8540.976
    CarTAN0.9080.9110.9080.920.983
    SETAN0.9460.9470.9460.9470.988
    NB0.7210.6810.7210.6810.807
    ConnectTAN0.7630.7220.7630.7310.864
    SETAN0.7640.7240.7640.7350.866
    NB0.9580.9580.9580.960.998
    MushroomTAN0.9991.0000.9991.0001.000
    SETAN1.0001.0001.0001.0001.000
    NB0.9030.8940.9030.9060.982
    NurseryTAN0.9280.920.9280.9290.991
    SETAN0.9370.9270.9370.9370.993
    NB0.7380.7350.7380.7450.802
    SPECTTAN0.7130.7090.7130.7240.668
    SETAN0.7380.7360.7380.7410.755
    NB0.7340.7270.7340.7230.702
    CancerTAN0.7060.6920.7060.6870.667
    SETAN0.7100.7000.7100.6950.624
    NB0.9010.9020.9010.9050.973
    VotesTAN0.9400.9400.9400.9410.986
    SETAN0.9490.9500.9490.9500.985
    下载: 导出CSV
  • PEARL J. Probabilistic reasoning in intelligent systems: networks of plausible inference[J]. Computer Science Artificial Intelligence, 1991, 70(2): 1022–1027. doi: 10.2307/407557
    WEBB G I, CHEN Shenglei, and N A. Zaidi Scalable learning of Bayesian network classifiers[J]. Journal of Machine Learning Research, 2016, 17(1): 1515–1549.
    MURALIDHARAN V and SUGUMARAN V. A comparative study of Naïve Bayes classifier and Bayes net classifier for fault diagnosis of monoblock centrifugal pump using wavelet analysis[J]. Applied Soft Computing, 2012, 12(8): 2023–2029. doi: 10.1016/j.asoc.2012.03.021
    Friedman N, Geiger D, and Goldszmidt M. Bayesian network classifiers[J]. Machine Learning, 1997, 29(2-3): 131–163. doi: 10.1023/a:1007465528199
    GAN Hongxiao, ZHANG Yang, and SONG Qun. Bayesian belief network for positive unlabeled learning with uncertainty[J]. Pattern Recognition Letters, 2017, 90. doi: 10.1016/j.patrec.2017.03.007
    JIANG Liangxiao, CAI Zhihua, WANG Dianhong, et al. Improving Tree augmented Naive Bayes for class probability estimation[J]. Knowledge-Based Systems, 2012, 26: 239–245. doi: 10.1016/j.knosys.2011.08.010
    王中锋, 王志海. TAN分类器结构等价类空间及其在分类器学习算法中的应用[J]. 北京邮电大学学报, 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017

    WANG Zhongfeng and WANG Zhihai. Equivalent classes of TAN classifier structure and their application on learning algorithm[J]. Journal of Beijing University of Posts and Telecommunications, 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017
    DUAN Zhiyi and WANG Limin. K-dependence bayesian classifier ensemble[J]. Entropy, 2017, 19(12): 651. doi: 10.3390/e19120651
    冯月进, 张凤斌. 最大相关最小冗余限定性贝叶斯网络分类器学习算法[J]. 重庆大学学报, 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011

    FENG Yuejin and ZHANG Fengbi. Max-relevance min-redundancy restrictive BAN classifier learning algorithm[J]. Journal of Chongqing University:Natural Science, 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011
    WONG M L and LEUNG K S. An efficient data mining method for learning bayesian networks using an evolutionary algorithm-based hybrid approach[J]. IEEE Transactions on Evolutionary Computation, 2004, 8(4): 378–404. doi: 10.1109/TEVC.2004.830334
    LOU Hua, WANG Limin, DUAN Dingbo, et al. RDE: A novel approach to improve the classification performance and expressivity of KDB[J]. Plos One, 2018, 13(7): e0199822. doi: 10.1371/journal.pone.0199822
    ROBINSON R W. Counting Unlabeled Acyclic Digraphs[M]. Berlin Heidelberg: Springer, 1977: 28–43. doi: 10.1007/BFb0069178.
    SCHWARZ G. Estimating the Dimension of a Model[J]. Annals of Statistics, 1978, 6(2): 15–18.
    GREINER R and ZHOU W. Structural extension to logistic regression: discriminative parameter learning of belief net classifiers[J]. Machine Learning, 2005, 59(3): 297–322. doi: 10.1007/s10994-005-0469-0
    MADDEN M G. On the classification performance of TAN and general bayesian networks[J]. Knowledge-Based Systems, 2009, 22(7): 489–495. doi: 10.1016/j.knosys.2008.10.006
    DRUGAN M M and WIERING M A. Feature selection for Bayesian network classifiers using the MDL-FS score[J]. International Journal of Approximate Reasoning, 2010, 51(6): 695–717. doi: 10.1016/j.ijar.2010.02.001
    WU Jiehua. A generalized tree augmented naive bayes link prediction model[J]. Journal of Computational Science, 2018. doi: 10.1016/j.jocs.2018.04.006
    MEHRJOU A, HOSSEINI R, and ARAABI B N. Improved Bayesian information criterion for mixture model selection[J]. Pattern Recognition Letters, 2016, 69: 22–27. doi: 10.1016/j.patrec.2015.10.004
    杜瑞杰. 贝叶斯分类器及其应用研究[D]. [硕士论文], 上海大学, 2012.

    DU Ruijie. The Research of Bayesian Classifier and its applications[D]. [Master disertation], Shanghai University, 2012
  • 加载中
图(7) / 表(5)
计量
  • 文章访问数:  3041
  • HTML全文浏览量:  1291
  • PDF下载量:  108
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-09-18
  • 修回日期:  2019-03-27
  • 网络出版日期:  2019-04-20
  • 刊出日期:  2019-08-01

目录

    /

    返回文章
    返回