面向自旋存内计算架构的图算法优化设计

王雪岩; 陈序航; 贾小涛; 杨建磊; 屈钢; 赵巍胜

doi:10.11999/JEIT230371

面向自旋存内计算架构的图算法优化设计

doi: 10.11999/JEIT230371 cstr: 32379.14.JEIT230371

1.
北京航空航天大学集成电路科学与工程学院北京 100191
2.
北京航空航天大学计算机学院北京 100191
3.
马里兰大学帕克分校电子与计算机工程系马里兰州 20742

基金项目: 国家自然科学基金(62004011, 62006011, U20A20204, 62072019)

详细信息

作者简介:
王雪岩：女，助理教授，研究方向为存算一体架构、图计算、芯片安全

陈序航：男，硕士，研究方向为存算一体架构、图计算

贾小涛：男，副教授，研究方向为贝叶斯神经网络、EDA算法设计

杨建磊：男，副教授，研究方向为智能计算系统与芯片设计

屈钢：男，教授，研究方向为硬件安全、低功耗设计

赵巍胜：男，教授，研究方向为自旋电子芯片

通讯作者:
赵巍胜　weisheng.zhao@buaa.edu.cn

中图分类号: TN402
计量
- 文章访问数: 748
- HTML全文浏览量: 490
- PDF下载量: 106
- 被引次数: 0
出版历程
- 收稿日期: 2023-05-04
- 修回日期: 2023-07-19
- 网络出版日期: 2023-07-25
- 刊出日期: 2023-09-27

Graph Algorithm Optimization for Spintronics-based In-memory Computing Architecture

1.
School of Integrated Circuit Science and Engineering, Beihang University, Beijing 100191, China
2.
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
3.
Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland 20742, USA

Funds: The National Natural Science Foundation of China (62004011, 62006011, U20A20204, 62072019)

摘要

摘要: 图计算广泛应用于社交网络分析、推荐系统等诸多关键领域，然而，传统的大规模图计算系统面临冯诺依曼架构下访存带来的性能瓶颈。新型存内计算架构成为加速大规模图计算非常有前景的方案，尤其是非易失自旋磁存储器(MRAM)具备超高耐擦写性和超快写入等优点，可使图计算的存内实现更为高效。实现这种潜力的关键挑战之一是如何优化存内计算架构下的图算法设计。该文的前期工作表明，三角形计数算法和图连通分量计算算法可以通过按位运算实现，从而高效地部署在自旋存内处理核中加速。该文探索了更多图算法的优化实现，例如单源最短路径、K-core、链路预测，并提出了面向新型存内计算架构的图算法优化设计模型。该研究对于突破冯诺依曼架构下大规模图计算的内存访问瓶颈具有关键意义。
- 图计算 /
- 存内计算架构 /
- 位逻辑运算 /
- 自旋磁存储器
Abstract: Graph computing has been widely applied to emerging fields such as social network analysis and recommendation systems. However, large-scale graph computing under the traditional Von-Neumann architecture faces the memory access bottleneck. The newly developed in-memory computing architecture becomes a promising alternative for accelerating graph computing. Due to its ultra-high endurance and ultra-fast writing speed, non-volatile Magnetoresistive Random Access Memory (MRAM) has the potential in building efficient in-memory accelerators. One of the key challenges to achieve such potential is how to optimize the graph algorithm design under the in-memory computing architecture. Our previous work shows that the triangle counting algorithms and graph connected component computing algorithms can be implemented with bitwise operations, which enables efficient spintronics in-memory computations. In this paper, the optimized implementation of more graph algorithms is explored such as single-source shortest path, K-core and link prediction, and an optimized design model of graph algorithms for the new in-memory computing architecture based is proposed. This research is of key significance for the breakthrough of solving the memory access bottleneck in large-scale graph computing under the Von Neumann architecture.
- Graph computing /
- In memory computing architecture /
- Bitwise logical operation /
- Magnetoresistive Random Access Memory (MRAM)

HTML全文

图 1 基于MRAM的图计算存内加速架构

下载: 全尺寸图片幻灯片

图 2 基于位逻辑运算优化实现单源最短路径

下载: 全尺寸图片幻灯片

图 3 基于位逻辑运算优化实现链路预测算法

下载: 全尺寸图片幻灯片

图 4 图计算存内加速架构实现与CPU实现的能耗标准化结果对比

下载: 全尺寸图片幻灯片

表 1 图算法存内计算优化模型

图计算步骤	边计算		社区发现		结构预测
图计算步骤	连通分量计算^[15]	单源最短路径	三角形计数^[12]	K-core	链路预测
目标定位	逻辑操作(AND)	逻辑操作(AND)	Scan(A[i][j]=1)	Scan(每行)	行选择
属性汇聚	逻辑操作(OR)	加法	逻辑操作(AND)	位计数(Bitcount)	逻辑操作(AND/OR)
计算更新	Bitcount比较	比较	Bitcount	比较删除顶点	Bitcount除法

下载: 导出CSV

表 2 MTJ关键参数

参数	值	参数	值
磁性隧道结表面长度	40 nm	隧道磁电阻	100%
磁性隧道结表面宽度	40 nm	饱和场	10⁶ A/m
自旋霍尔角	0.3	吉尔伯特阻尼常数	0.03
磁性隧道结电阻面积乘积	10^–12 Ω·m²	垂直磁各向异性	4.5×10⁵ A/m
氧化物阻挡层厚度	0.82 nm	温度	300 K

下载: 导出CSV

表 3 图计算存内加速架构实现与CPU实现速度对比(s)

图数据集	单源最短路径		K-core		链路预测
图数据集	CPU	PIM	CPU	PIM	CPU	PIM
p2p-Gnutella06	0.063	0.0014	0.187	0.006	0.001	0.00012
p2p-Gnutella31	1.187	0.021	1.084	0.031	0.002	0.00016
email-Enron	2.972	0.071	1.383	0.056	0.001	0.00010
email-EuAll	2.826	0.081	3.832	0.193	0.003	0.00041
soc-Slashdot0922	10.989	0.203	10.137	0.241	0.002	0.00017
web-NotreDame	12.567	0.184	43.975	0.754	0.005	0.00037
amazon0302	17.837	0.329	14.392	0.381	0.002	0.00013
amazon0505	38.608	0.565	50.632	1.649	0.003	0.00043

下载: 导出CSV

参考文献(17)

[1]	CHI Ping, LI Shuangchen, XU Cong, et al. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 27–39. doi: 10.1145/3007787.3001140
[2]	OZDAL M M, YESIL S, KIM T, et al. Energy efficient architecture for graph analytics accelerators[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 166–177. doi: 10.1145/3007787.3001155
[3]	HAM T J, WU Lisa, SUNDARAM N, et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics[C]. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, China, 2016: 1–13.
[4]	KYROLA A, BLELLOCH G, and GUESTRIN C. GraphChi: Large-scale graph computation on just a PC[C]. Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, Hollywood, USA, 2012: 31–46.
[5]	LIANG Shengwen, WANG Ying, LIU Cheng, et al. EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks[J]. IEEE Transactions on Computers, 2021, 70(9): 1511–1525. doi: 10.1109/TC.2020.3014632
[6]	DAI Guohao, HUANG Tianhao, CHI Yuze, et al. GraphH: A processing-in-memory architecture for large-scale graph processing[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 38(4): 640–653. doi: 10.1109/TCAD.2018.2821565
[7]	BEAMER S, ASANOVIC K, and PATTERSON D. Locality exists in graph processing: Workload characterization on an ivy bridge server[C]. 2015 IEEE International Symposium on Workload Characterization, Atlanta, USA, 2015: 56–65.
[8]	WANG Mengxing, CAI Wenlong, ZHU Daoqian, et al. Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin-orbit and spin-transfer torques[J]. Nature Electronics, 2018, 1(11): 582–588. doi: 10.1038/s41928-018-0160-7
[9]	GUO Zongxia, YIN Jialiang, BAI Yue, et al. Spintronics for energy-efficient computing: An overview and outlook[J]. Proceedings of the IEEE, 2021, 109(8): 1398–1417. doi: 10.1109/JPROC.2021.3084997
[10]	JAIN S, RANJAN A, ROY K, et al. Computing in memory with spin-transfer torque magnetic RAM[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(3): 470–483. doi: 10.1109/TVLSI.2017.2776954
[11]	ANGIZI S, SUN Jiao, ZHANG Wei, et al. GraphS: A graph processing accelerator leveraging SOT-MRAM[C]. 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 2019: 378–383.
[12]	WANG Xueyan, YANG Jianlei, ZHAO Yinglin, et al. Triangle counting accelerations: From algorithm to in-memory computing architecture[J]. IEEE Transactions on Computers, 2022, 71(10): 2462–2472. doi: 10.1109/TC.2021.3131049
[13]	LI Shuangchen, XU Cong, ZOU Qiaosha, et al. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories[C]. The 53rd ACM/EDAC/IEEE Design Automation Conference, Austin, USA, 2016: 1–6.
[14]	HAN Lei, SHEN Zhaoyan, LIU Duo, et al. A novel ReRAM-based processing-in-memory architecture for graph traversal[J]. ACM Transactions on Storage, 2018, 14(1): 9. doi: 10.1145/3177916
[15]	CHEN Xuhang, WANG Xueyan, JIA Xiaotao, et al. Accelerating graph-connected component computation with emerging processing-in-memory architecture[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(12): 5333–5342. doi: 10.1109/TCAD.2022.3163628
[16]	PEROZZI B, AL-RFOU R, and SKIENA S. DeepWalk: Online learning of social representations[C]. The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2014: 701–710.
[17]	LESKOVEC J and KREVL A. SNAP Datasets: Stanford large network dataset collection[EB/OL]. http://snap.stanford.edu/data, 2014.