Performance Prediction Based on Random Forest for the Stream Processing Checkpoint
-
摘要:
物联网(IoT)的发展引起流数据在数据量和数据类型两方面不断增长。由于实时处理场景的不断增加和基于经验知识的配置策略存在缺陷,流处理检查点配置策略面临着巨大的挑战,如费事费力,易导致系统异常等。为解决这些挑战,该文提出基于回归算法的检查点性能预测方法。该方法首先分析了影响检查点性能的6种特征,然后将训练集的特征向量输入到随机森林回归算法中进行训练,最后,使用训练好的算法对测试数据集进行预测。实验结果表明,与其它机器学习算法相比,随机森林回归算法在CPU密集型基准测试,内存密集型基准测试和网络密集型基准测试上针对检查点性能的预测具有误差低,准确率高和运行高效的优点。
Abstract:Since real-time processing scenarios for ever-increasing amount and type of streaming data caused by the development of the Internet of Things (IoT) keep increasing, and strategies based on empirical knowledge for checkpoint configuration are deficiencies, the strategy faces huge challenges, such as time-consuming, labor-intensive, causing system anomalies, etc. To address these challenges, regression algorithm-based prediction is proposed for checkpoint performance. Firstly, six kinds of features, which have a huge influence on the performance, are analyzed, and then feature vectors of the training set are input into the regression algorithms for training, finally, test sets are used for the checkpoint performance prediction. Compared with other machine learning algorithms, the experimental results illustrat that the Random Forest (RF) has lower errors, higher accuracy and faster execution on CPU intensive benchmark, memory intensive benchmark and network intensive benchmark.
-
表 1 动态特征总结
特征名称 描述 本地进入记录数 算子每秒接收的本地记录数。 远程进入记录数 算子每秒接收的远程记录数。 本地缓存记录数 算子每秒缓存的本地记录数。 远程缓存记录数 算子每秒缓存的远程记录数。 表 2 数据集描述
基准测试 样本数量 特征数量 训练样本数量 预测样本数量 CKCPU 47100 332 37680 9420 CKMEM 10290 172 8232 2058 CKNET 18900 524 15120 3780 表 3 不同回归算法预测误差结果
基准测试 回归算法 MAE RMSE MediaAE CKCPU SVR poly 0.107006 1.900023 37.921288 SVR linear 0.095006 27.06338 37.529361 KNN 0.108006 0.323870 0.286494 BPNN 0.042380 0.070043 0.129856 RF 0.040178 0.068811 0.125560 CKMEM SVR poly 0.115007 0.037560 10.924428 SVR linear 0.178010 2.524596 4.085918 KNN 0.148008 0.370660 0.373577 BPNN 0.097356 0.199461 0.214980 RF 0.096046 0.196619 0.206272 CKMEM SVR poly 0.091005 0.645619 0.634070 SVR linear 0.301017 0.545833 0.523365 KNN 0.102006 0.742873 0.742375 BPNN 0.020343 0.103857 0.147659 RF 0.019501 0.089315 0.089082 -
彭建华, 张帅, 许晓明, 等. 物联网中一种抗大规模天线阵列窃听者的噪声注入方案[J]. 电子与信息学报, 2019, 41(1): 67–73. doi: 10.11999/JEIT180342PENG Jianhua, ZHANG Shuai, XU Xiaoming, et al. A noise injection scheme resistant to massive MIMO eavesdropper in IoT[J]. Journal of Electronics &Information Technology, 2019, 41(1): 67–73. doi: 10.11999/JEIT180342 刘素艳, 刘元安, 吴帆, 等. 物联网中基于相似性计算的传感器搜索[J]. 电子与信息学报, 2018, 40(12): 3020–3027. doi: 10.11999/JEIT171085LIU Suyan, LIU Yuan’an, WU Fan, et al. Sensor search based on sensor similarity computing in the Internet of Things[J]. Journal of Electronics &Information Technology, 2018, 40(12): 3020–3027. doi: 10.11999/JEIT171085 CARBONE P, EWEN S, FÓRA G, et al. State management in Apache Flink®: Consistent stateful distributed stream processing[J]. Proceedings of the VLDB Endowment, 2017, 10(12): 1718–1729. doi: 10.14778/3137765.3137777 VENKIVOLU D R and NALE M N. Adaptive encryption in checkpoint recovery of file transfers[P]. US, 20190306221, 2019-10-03. KIM Y, NAKAMURA J, KATAYAMA Y, et al. A cooperative partial snapshot algorithm for checkpoint-rollback recovery of large-scale and dynamic distributed systems[C]. The 6th International Symposium on Computing and Networking Workshops (CANDARW), Takayama, Japan, 2018: 285–291. doi: 10.1109/CANDARW.2018.00060. TAO Yangyang and YU Shucheng. kFHCO: Optimal VM consolidation via k-Factor horizontal checkpoint oversubscription[C]. 2019 International Conference on Computing, Networking and Communications (ICNC), Honolulu, USA, 2019: 380–384. doi: 10.1109/ICCNC.2019.8685604. GOUNTIA D and ROY S. Checkpoints assignment on cyber-physical digital microfluidic biochips for early detection of hardware Trojans[C]. The 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 2019: 16–21. doi: 10.1109/ICOEI.2019.8862598. ZHANG Hanlin, CHEN Ningjiang, TANG Yusi, et al. Multi-level container checkpoint performance optimization strategy in SDDC[C]. The 4th International Conference on Big Data and Computing, Guangzhou, China, 2019: 253–259. doi: 10.1145/3335484.3335487. TITOUNA C, MOUMEN H, and ARI A A A. Cluster head recovery algorithm for wireless sensor networks[C]. The 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019: 1905–1910. doi: 10.1109/CoDIT.2019.8820414. OVENS S and WOELFEL P. Strongly linearizable implementations of snapshots and other types[C]. 2019 ACM Symposium on Principles of Distributed Computing, Toronto, Canada, 2019: 197–206. doi: 10.1145/3293611.3331632. ATHEY S, TIBSHIRANI J, WAGER S, et al. Gemeralized random ferests[J]. Annals of statistics, 2019, 47(2): 1148–1178. doi: 10.1214/18-AOS1709 CHOI J, GU B, CHIN S, et al. Machine learning predictive model based on national data for fatal accidents of construction workers[J]. Automation in Construction, 2020, 110: 102974. doi: 10.1016/j.autcon.2019.102974 LYU J and MANOOCHEHRI S. Dimensional prediction for FDM machines using artificial neural network and support vector regression[C]. ASME 2019 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. Anaheim, USA, 2019. doi: 10.1115/DETC2019-97963. DABERDAKU S, TAVAZZI E, and DI CAMILLO B. Interpolation and K-nearest neighbours combined imputation for longitudinal ICU laboratory data[C]. 2019 IEEE International Conference on Healthcare Informatics (ICHI), Xi’an, China, 2019: 1–3. doi: 10.1109/ICHI.2019.8904624. ASAAD R R and ALI R I. Back Propagation Neural Network (BPNN) and sigmoid activation function in multi-layer networks[J]. Academic Journal of Nawroz University, 2019, 8(4): 216–221. doi: 10.25007/ajnu.v8n4a464