面向众核密码处理器的高效负载均衡技术

戴紫彬; 尹安琪; 曲彤洲; 南龙梅

doi:10.11999/JEIT180623

面向众核密码处理器的高效负载均衡技术

doi: 10.11999/JEIT180623

戴紫彬¹,
尹安琪^1, ,,
曲彤洲¹,
南龙梅^{1, 2}

1.
解放军信息工程大学郑州 450001
2.
复旦大学专用集成电路与系统国家重点实验室上海 201203

详细信息

作者简介:
戴紫彬：男，1966年生，教授，博士生导师，研究方向为可重构计算与安全专用芯片设计

尹安琪：女，1995年生，硕士生，研究方向为可重构计算与信息安全

曲彤洲：男，1994年生，硕士生，研究方向为可重构计算与信息安全

南龙梅：女，1981年生，讲师，博士，研究方向为可重构安全芯片设计

通讯作者:
尹安琪　yinaq0222@foxmail.com

中图分类号: TP338.6
计量
- 文章访问数: 1566
- HTML全文浏览量: 523
- PDF下载量: 60
- 被引次数: 0
出版历程
- 收稿日期: 2018-06-26
- 修回日期: 2018-11-27
- 网络出版日期: 2018-12-03
- 刊出日期: 2019-02-01

Efficient Workload Balance Technology on Many-core Crypto Processor

Zibin DAI¹,
Anqi YIN^{1
, ,},
Tongzhou QU¹,
Longmei NAN^{1, 2}

1.
The PLA Information Engineering University, Zhengzhou 450001, China
2.
State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203, China

摘要

摘要:
工作负载分配不均是制约众核密码平台资源利用率提高的重要因素，动态负载分配可提高平台资源利用率，但具有一定开销；所以更高的负载均衡频率并不一定带来更高的负载均衡增益。因此，该文建立了关于负载均衡增益率与负载均衡频率的数学模型。基于模型，提出一种面向众核密码平台的无冲突负载均衡策略和一种基于硬件作业队列的“可扩展-可移植”负载均衡引擎——“簇间微网络-簇内环阵列”。实验证明：在性能、延时功耗积、资源利用率和负载均衡度方面，该文设计的负载均衡引擎与基于“作业窃取”的软件技术相比平均优化约4.06倍、7.17倍、23.01%和2.15倍；与基于“作业窃取”的硬件技术相比约优化1.75倍、2.45倍、10.2%、和1.41倍；与理想硬件技术相比，密码算法吞吐率平均只降低了约5.67%(最低3%)。实验结果表明该文技术具有良好的可扩展性和可移植性。
- 众核密码处理器 /
- 负载均衡策略 /
- 负载均衡引擎 /
- 无冲突
Abstract:
Imbalanced workload distribution results in low resource utilization of many-core crypto-platform. Dynamic workload allocation can improve the resource utilization with some overhead. Therefore, a higher frequency of workload balancing is not equivalent to higher gains. This paper establishes a mathematical model for gain rate and frequency of workload balancing. Based on this model, a collision-free workload balancing policy is proposed for many-core crypto systems, and a hierarchical "expandable-portable" engine is put forward, which consists of "Inter-cluster micro-network and intra-cluster ring-array" adopting hardware job queue technology. Experiment results show that the proposed workload-balancing engine is 4.06, 7.17, 23.01% and 2.15 times higher than the software technology based on " job stealing” in terms of performance, delay power consumption, resource utilization and workload balance; 1.75, 2.45, 10.2%, and 1.41 times better compared with the hardware technology based on "job stealing". By contrast with the ideal hardware technology, the average throughput of encryption algorithms is only decreased by 5.67% (the lowest 3%). The experiment also proves the scalability and portability of the proposed technique.
- Many-core crypto processor /
- Workload balance strategy /
- Workload balance engine /
- Collision-free

HTML全文

图 1 众核密码处理器架构模型

下载: 全尺寸图片幻灯片

图 2 簇内作业队列架构

下载: 全尺寸图片幻灯片

图 3 全局及局部均衡操作执行流程

下载: 全尺寸图片幻灯片

图 4 簇间负载均衡微网络

下载: 全尺寸图片幻灯片

图 5 性能对比图

下载: 全尺寸图片幻灯片

图 6 延时功耗积及资源利用率、负载均衡度对比图

下载: 全尺寸图片幻灯片

表 1 无冲突负载均衡算法

Require:${L_{{\rm{td}}}},{N_{{\rm{cl}}}},{L_{{\rm{dt}}}},{T_{{\rm{cp}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{\rm{Thpu}}{{\rm{t}}_{{\rm{Nt}}}},{S_{\rm{c}}}$//簇间均衡操作集合

　1 Assign ${\rm{Balance\_c}} \leftarrow {\rm{True,}}{{{N}}_2} \leftarrow {\rm{Num}}[{S_{\rm{c}}}],{n_2} \leftarrow 0,{t_2} \leftarrow {T_{{\rm{op}}}}$
　　 ${t_3} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}},{t_4} \leftarrow ({L_{{\rm{td}}}} + {L_{{\rm{dt}}}})/{\rm{Thpu}}{{\rm{t}}_{Nt}}$

　2 while new ${\rm{job[}}k{\rm{][}}i{\rm{] = = True}}$ do//更新负载情况

　3 ${t_{\rm{1}}} \leftarrow {L_{{\rm{tdki}}}} \cdot {N_{{\rm{clki}}}} \cdot {L_{{\rm{dtki}}}};{\rm{Addc}}[k] \leftarrow {\rm{Addc}}[k] + {t_1};$

　4 end while

　5 while ${\rm{Balance\_c}} = = {\rm{True}}$ do//簇间负载均衡

　6 for $k \leftarrow 0$ to ${N_2}$ do

　7 for $w \leftarrow 0$ to ${N_2}$ do

　8 if $w \ne k$ then

　9 if ${\rm{Thpu}}{{\rm{t}}_{{\rm{SM}}}} \ge {\rm{Thpu}}{{\rm{t}}_{Nt}}$ then

　10 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}}{\rm{ > }}{t_{2}}{\rm{ + }}{t_{\rm{4}}}$ then//式(13)

　11 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$

　12 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$

　13 end if

　14 else then

　15 if ${\rm{Addp[}}k{\rm{] - Addp[}}w{\rm{] - }}{t_{\rm{1}}} > {t_2} + {t_3}$ then//式(13)

　16 ${\rm{Balancep\_Request[}}w{\rm{][}}k{]} \leftarrow {\rm{True;}}$

　17 ${\rm{Balancec\_Request[}}w{]} = {\rm{Balancec\_Request[}}w{]} + 1;$

　18 end if

　19 end else

　20 end for

　21 if ${\rm{Balancec\_Request[}}w{]} > 1$ then

　22 ${\rm{choose }}\ p{\rm{[}}{{\rm{n}}_{2}}{\rm{];}}{n_2} \leftarrow {(}{n_{2}}{\rm{ + 1)mol}}{N_{2}};$

　23 end if

　24 end for

　25 end while

下载: 导出CSV

表 2 仿真参数

参数	值
核心数目	4～64
本地缓存大小	8 kB
共享缓存大小	16 MB
簇内互连方式	二向环
簇间互连方式	2D-mesh
共享访问延时/本地访问延时	3

下载: 导出CSV

表 3 测试基准说明

测试基准	算法分类	算法特征 (bit)	作业数目	平均作业大小(kB)	测试基准	算法分类	算法特征 (bit)	作业数目	平均作业大小(kB)	备注
DES	分组	64	1024	54.4	A5	序列	1	1024	41.1	算法特征表示分组/杂凑算法的处理位宽或者序列算法的输出位宽
AES	分组	128	1024	46.3	ZUC	序列	32	1024	50.4
SM4	分组	512	1024	32.2	RC4	序列	8	1024	47.3
SHA256	杂凑	512	1024	44.6	RSA	公钥	—	1024	40.3
SM3	杂凑	512	1024	38.8	SM2	公钥	—	1024	22.8
SHA-1	杂凑	512	1024	33.2

下载: 导出CSV

参考文献(15)

KIM C and HUH J. Exploring the design space of fair scheduling supports for asymmetric multicore systems[J]. IEEE Transactions on Computers, 2018, 67(8): 1136–1152. doi: 10.1109/TC.2018.2796077

KIM K W, CHO Y, EO J, et al. System-wide time versus density tradeoff in real-time multicore fluid scheduling[J]. IEEE Transactions on Computers, 2018, 67(7): 1007–1022. doi: 10.1109/TC.2018.2793919

LEE J, NICOPOULOS C, LEE H G, et al. IsoNet: Hardware-based job queue management for many core architectures[J]. IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(6): 1080–1093. doi: 10.1109/TVLSI.2012.2202699

KUMAR S, HUGHES C J and NGUYEN A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors[C]. International Symposium on Computer Architecture, California, USA, 2007: 162–173.

CHEN J, JUANG P, KO K, et al. Hardware-modulated parallelism in chip multiprocessors[J]. ACM Sigarch Computer Architecture News, 2005, 33(4): 54–63. doi: 10.1145/1105734.1105742

LEE J, NICOPOULOS C, LEE Y, et al. Hardware-based job queue management for manycore architectures and OpenMP environments[J]. IEEE International Parallel & Distributed Processing Symposium, 2011, 21(6): 407–418. doi: 10.1109/IPDPS.2011.47

刘宗斌, 马原, 荆继武, 等. SM3哈希算法的硬件实现与研究[J]. 信息网络安全, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059

LIU Zongbin, MA Yuan, JING Jiwu, et al. Implementation of SM3 hash function on FPGA[J]. Information Network Security, 2011, 59(9): 191–193. doi: 10.3969/j.issn.1671-1122.2011.09.059

徐金甫, 杨宇航. SM4算法在粗粒度阵列平台的并行化映射[J]. 电子技术应用, 2017, 43(4): 39–42.

XU Jinfu and YANG Yuhang. Parallel mapping of SM4 algorithm on coarse-grained array platform[J]. Electronic Technology Application, 2017, 43(4): 39–42.

DUBEY P. Recognition, mining and synthesis moves computers to the era of tera[J]. Technology@Intel Magazine, 2005, 9(2): 1–10.

RATTNER J. Cool codes for hot chips: A quantitative basis for multi-core design[C]. Hot Chips 18 Symposium IEEE, Cupertino, USA, 2016: 1–28.

AN H, TAURA K, and SPOTTER D. A tool for spotting scheduler-caused delays in task parallel runtime systems[C]. IEEE International Conference on CLUSTER Computing, Hawaii, USA, 2017: 114–125.

KWOK Y K and AHMAD I. Static scheduling algorithms for allocating directed task graphs to multiprocessors[J]. ACM Computing Surveys, 1999, 31(4): 406–471. doi: 10.1145/344588.344618

TITHI J J, MATANI D, MENGHANI G, et al. Avoiding locks and atomic instructions in shared-memory parallel BFS using optimistic parallelization[C]. Parallel and Distributed Processing Symposium Workshops & Phd Forum IEEE, Cambridge, UK, 2013: 1628–1637.

MOON S W, REXFORD J, and SHIN K G. Scalable hardware priority queue architectures for high-speed packet switches[J]. IEEE Transactions on Computers, 2000, 49(11): 1215–1227. doi: 10.1109/RTTAS.1997.601359

CHEN Quan, ZHENG Long, and GUO Minyi. Adaptive Demand-aware Work-stealing in Multi-programmed Multi-core Architectures[J]. Concurrency and Computation: practice & Experience, 2016, 28(2): 455–471. doi: 10.1002.cpe.3619

施引文献

资源附件(0)

访问统计