基于CNN与LSTM相结合的恶意域名检测模型

张斌; 廖仁杰

doi:10.11999/JEIT200679

基于CNN与LSTM相结合的恶意域名检测模型

doi: 10.11999/JEIT200679

张斌,
廖仁杰^,

1.
中国人民解放军战略支援部队信息工程大学郑州 450001
2.
河南省信息安全重点实验室郑州 450001

基金项目: 河南省基础与前沿技术研究计划基金(142300413201)，信息保障技术重点实验室开放基金项目(KJ-15-109)，信息工程大学科研项目(2019f3303)

详细信息

作者简介:
张斌：男，1969年生，教授，博士生导师，研究方向为信息系统安全

廖仁杰：男，1996年生，硕士生，研究方向为基于机器学习的恶意域名检测

通讯作者:
廖仁杰　lrj2803@163.com

中图分类号: TN915.08; TP393
计量
- 文章访问数: 1925
- HTML全文浏览量: 1184
- PDF下载量: 228
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-04
- 修回日期: 2020-12-13
- 网络出版日期: 2021-02-06
- 刊出日期: 2021-10-18

Malicious Domain Name Detection Model Based on CNN and LSTM

Bin ZHANG,
Renjie LIAO^,

1.
PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, China
2.
Henan Key Laboratory of Information Security, Zhengzhou 450001, China

Funds: The Foundation and Frontier Technology Research Project of Henan Province (142300413201), The Open Fund Project of Information Assurance Technology Key Laboratory (KJ-15-109), The Research Project of Information Engineering University (2019f3303)

摘要

摘要: 为提高恶意域名检测准确率，该文提出一种基于卷积神经网络(CNN)与长短期记忆网络(LSTM)相结合的域名检测模型。该模型通过提取域名字符串中不同长度字符组合的序列特征进行恶意域名检测：首先，为避免N-Gram特征稀疏分布的问题，采用CNN提取域名字符串中字符组合特征并转化为维度固定的稠密向量；其次，为充分挖掘域名字符串上下文信息，采用LSTM提取字符组合前后关联的深层次序列特征，同时引入注意力机制为填充字符所处位置的输出特征分配较小权重，降低填充字符对特征提取的干扰，增强对长距离序列特征的提取能力；最后，将CNN提取局部特征与LSTM提取序列特征的优势相结合，获得不同长度字符组合的序列特征进行域名检测。实验表明：该模型较单一采用CNN或LSTM的模型具有更高的召回率和F1分数，尤其对matsnu和suppobox两类恶意域名的检测准确率较单一采用LSTM的模型提高了24.8%和3.77%。
- 恶意域名 /
- 卷积神经网络 /
- 长短期记忆网络 /
- 注意力机制
Abstract: To improve the accuracy of malicious domain name detection, a new detection model based on Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) is proposed. The model extracts the sequence features from different length strings to classify the domain name. Firstly, in view of the sparseness of the N-Gram feature, the model utilizes CNN with different kernels to preserve the local association between the characters in the domain name strings and convert it to dense feature vectors. Secondly, in order to mine the context information of the domain name strings, LSTM is used to extract the deep-level sequence features of different character combinations. A sequence feature attention module is designed to assign little weight value to the sequence feature extracted from the padding characters, which decreases the interference by the padding characters and enhances the ability to capture distant sequence features. Finally, combining the advantages of CNN to extract local features and LSTM to extract sequence features, both partial and sequential information are put forward to improving the detection performance. Experimental results show that the recall rate and the F1-score of the proposed model are superior to other comparative models which are solely composed of CNN or LSTM. Particularly, when dealing with the matsnu and suppobox, the proposed model has increased by 24.8% and 3.77% in accuracy compared with the model based on LSTM, respectively.
- Malicious domain name /
- Convolutional Neural Network (CNN) /
- Long Short Term Memory (LSTM) /
- Attention mechanism

HTML全文

图 1 基于CNN与LSTM相结合的恶意域名检测模型(LSTM -Parallel CNN ATT-LSTM, L-PCAL)

下载: 全尺寸图片幻灯片

图 2 结合注意力机制的LSTM单元(ATT-LSTM)

下载: 全尺寸图片幻灯片

图 3 ROC曲线对比图

下载: 全尺寸图片幻灯片

图 4 注意力权值可视化

下载: 全尺寸图片幻灯片

表 1 模型检测性能对比表

模型	Recall (%)	Precision (%)	FPR (%)	F1-Score	Test Time(s)
Bi-Gram DT	84.37	75.32	22.60	0.7959	1.05
LSTM	93.75	93.58	6.57	0.9367	4.46
Bi-LSTM	90.88	96.49	3.38	0.9360	7.34
Stack-CNN	86.31	94.01	5.62	0.9001	0.62
Parallel-CNN	88.39	94.54	5.22	0.9136	0.57
PCAL	92.66	95.96	3.98	0.9428	12.16
L-PCL	92.17	96.38	3.54	0.9423	13.26
CAL-PCAL	93.02	95.41	3.98	0.9420	11.94
本文L-PCAL	93.91	95.42	4.61	0.9466	12.67

下载: 导出CSV

表 2 不同模型TPR与AUC对比表

模型	TPR (%)			AUC
模型	FPR: 1%	FPR:2%	FPR:3%	AUC
LSTM	80.12	85.82	89.83	0.9846
Bi-LSTM	83.11	88.19	90.23	0.9840
Stack-CNN	72.58	79.82	82.24	0.9613
Parallel-CNN	77.13	82.04	84.85	0.9671
本文L-PCAL	85.74	90.40	92.17	0.9867

下载: 导出CSV

表 3 单词拼接类恶意域名检测准确率对比表

模型	Accuracy (%)
模型	matsnu	suppobox
LSTM	0.78	81.57
Bi-LSTM	0.78	74.59
Stack-CNN	0	18.39
Parallel-CNN	0.78	16.08
PCAL	0	66.86
L-PCL	37.98	74.59
CAL-PCAL	7.75	74.59
本文L-PCAL	25.58	85.34

下载: 导出CSV

参考文献(16)

[1]	ZHAUNIAROVICH Y, KHALIL I, YU Ting, et al. A survey on malicious domains detection through DNS data analysis[J]. ACM Computing Surveys, 2018, 51(4): 67. doi: 10.1145/3191329
[2]	张维维, 龚俭, 刘茜, 等. 基于词素特征的轻量级域名检测算法[J]. 软件学报, 2016, 27(9): 2348–2364. doi: 10.13328/j.cnki.jos.004913 ZHANG Weiwei, GONG Jian, LIU Qian, et al. Lightweight domain name detection algorithm based on morpheme features[J]. Journal of Software, 2016, 27(9): 2348–2364. doi: 10.13328/j.cnki.jos.004913
[3]	SCHIAVONI S, MAGGI F, CAVALLARO L, et al. Phoenix: DGA-based botnet tracking and intelligence[C]. The 11th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Egham, UK, 2014: 192–211. doi: 10.1007/978-3-319-08509-8_11.
[4]	YADAV S, REDDY A K K, REDDY A L N, et al. Detecting algorithmically generated domain-flux attacks with DNS traffic analysis[J]. IEEE/ACM Transactions on Networking, 2012, 20(5): 1663–1677. doi: 10.1109/tnet.2012.2184552
[5]	YU Bin, PAN Jie, HU Jiaming, et al. Character level based detection of DGA domain names[C]. 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018: 1–8. doi: 10.1109/ijcnn.2018.8489147.
[6]	SAXE J and BERLIN K. eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys[EB/OL]. https://arxiv.org/abs/1702.08568, 2017.
[7]	杨路辉, 刘光杰, 翟江涛, 等. 一种改进的卷积神经网络恶意域名检测算法[J]. 西安电子科技大学学报, 2020, 47(1): 37–43. doi: 10.19665/j.issn1001-2400.2020.01.006 YANG Luhui, LIU Guangjie, ZHAI Jiangtao, et al. Improved algorithm for detection of the malicious domain name based on the convolutional neural network[J]. Journal of Xidian University, 2020, 47(1): 37–43. doi: 10.19665/j.issn1001-2400.2020.01.006
[8]	HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735
[9]	WOODBRIDGE J, ANDERSON H S, AHUJA A, et al. Predicting domain generation algorithms with long short-term memory networks[EB/OL]. https://arxiv.org/abs/1611.00791, 2016.
[10]	TRAN D, MAC H, TONG V, et al. A LSTM based framework for handling multiclass imbalance in DGA botnet detection[J]. Neurocomputing, 2018, 275: 2401–2413. doi: 10.1016/j.neucom.2017.11.018
[11]	杜鹏, 丁世飞. 基于混合词向量深度学习模型的DGA域名检测方法[J]. 计算机研究与发展, 2020, 57(2): 433–446. doi: 10.7544/issn1000-1239.2020.20190160 DU Peng and DING Shifei. A DGA domain name detection method based on deep learning models with mixed word embedding[J]. Journal of Computer Research and Development, 2020, 57(2): 433–446. doi: 10.7544/issn1000-1239.2020.20190160
[12]	MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. https://arxiv.org/abs/1301.3781, 2013.
[13]	RAFFEL C and ELLIS D P W. Feed-forward networks with attention can solve some long-term memory problems[EB/OL]. https://arxiv.org/abs/1512.08756, 2015.
[14]	谢金宝, 侯永进, 康守强, 等. 基于语义理解注意力神经网络的多元特征融合中文文本分类[J]. 电子与信息学报, 2018, 40(5): 1258–1265. doi: 10.11999/JEIT170815 XIE Jinbao, HOU Yongjin, KANG Shouqiang, et al. Multi-feature fusion based on semantic understanding attention neural network for Chinese text categorization[J]. Journal of Electronics &Information Technology, 2018, 40(5): 1258–1265. doi: 10.11999/JEIT170815
[15]	Alexa Internet, Inc. Alexa top-ranked websites[EB/OL]. http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, 2020.
[16]	Qihoo 360 Technology Co, Ltd. 360 DGA feeds[EB/OL]. https://data.netlab.360.com/dga/, 2020.