高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向商用近存计算架构矩阵乘算子协同优化策略研究

贺煜凯 谢童欣 朱振华 高岚 李冰

贺煜凯, 谢童欣, 朱振华, 高岚, 李冰. 面向商用近存计算架构矩阵乘算子协同优化策略研究[J]. 电子与信息学报. doi: 10.11999/JEIT250364
引用本文: 贺煜凯, 谢童欣, 朱振华, 高岚, 李冰. 面向商用近存计算架构矩阵乘算子协同优化策略研究[J]. 电子与信息学报. doi: 10.11999/JEIT250364
HE Yukai, XIE Tongxin, ZHU Zhenhua, GAO Lan, LI Bing. Collaborative Optimization Strategies for Matrix Multiplication-Accumulation Operators on Commercial Processing-In-Memory Architectures[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250364
Citation: HE Yukai, XIE Tongxin, ZHU Zhenhua, GAO Lan, LI Bing. Collaborative Optimization Strategies for Matrix Multiplication-Accumulation Operators on Commercial Processing-In-Memory Architectures[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250364

面向商用近存计算架构矩阵乘算子协同优化策略研究

doi: 10.11999/JEIT250364 cstr: 32379.14.JEIT250364
基金项目: 国家自然科学基金(62204164)
详细信息
    作者简介:

    贺煜凯:男,硕士生,研究方向为计算机体系结构、近存计算、存内计算

    谢童欣:男,博士生,研究方向为近存计算、编译器、存内计算

    朱振华:男,博士后研究员,研究方向为存算一体、体系结构、软硬件协同设计

    高岚:女,博士,副教授,研究方向为GPU并行编程、并行程序性能优化

    李冰:女,博士,副研究员,研究方向为计算机系统结构、存内计算架构芯片设计

    通讯作者:

    李冰 libing2024@ime.ac.cn

  • 中图分类号: TN401

Collaborative Optimization Strategies for Matrix Multiplication-Accumulation Operators on Commercial Processing-In-Memory Architectures

Funds: The National Natural Science Foundation of China (62204164)
  • 摘要: 由于近存架构对数据密集型程序加速的潜力,Samsung等公司推出基于高带宽存储器与存内计算(HBM-PIM)的近存芯片用于大模型加速,得益于HBM的高带宽和天然并行特性,近存计算表现出对大模型极佳的加速。该文发现,矩阵规模变化时,HBM-PIM架构的加速性能表现出不稳定性,限制了大模型部署的加速提升。为了释放HBM-PIM的加速潜力,该文深度分析了不同规模算子在HBM-PIM上性能差异的根本原因在于当前HBM-PIM对矩阵乘数据划分、映射和执行的支持不足,进而提出融合动态Bank分配、奇偶Bank交错式地址映射与分片虚拟化计算优化方法,有效提高了资源利用率和计算并行性。评估结果表明,所提方法对不同规模的矩阵计算都取得了1.894~8.225的加速比,相比优化前,性能平均提升了2.7倍。该文所提方案有效增强了PIM体系结构在多尺度任务下的可扩展性与适配能力,为AI算子在存内计算平台上的高效映射与调度提供了有益参考。
  • 图  1  不同计算架构对比示意图

    图  2  GEMV算子在不同数据规模下加速比变化趋势

    图  3  动态Bank组分配与映射控制流程图

    图  4  不同Bank数PIM模式下2 048×1 024数据规模任务的执行周期变化

    图  5  4 096×4 096数据规模下不同虚拟Lane配置的加速效果对比

    图  6  传统方案和分片虚拟化计算的对比示意图

    图  7  不同数据规模下优化前后PIM模式执行周期/加速比对比

    表  1  Bank状态行为分布量化

    数据规模Bank状态
    总数
    Idle状态
    数量
    Precharge
    状态数量
    Activate
    状态数量
    2 048×2 04814 274 56012 662 46479 296457 600
    4 096×4 09626 963 96823 870 272122 368820 928
    下载: 导出CSV

    表  2  Bank0~7 2 048×2 048数据规模激活行数量详细统计

    Bank ID01234567
    激活行数量163 200229 5684 2244 2244 2244 2244 2244 352
    下载: 导出CSV

    1  分片虚拟化计算指令封装与执行

     输入:数据数组operand,虚拟Tile长度16,硬件lane宽度为8
     输出:完成1次Tile运算结果
     (1) 初始化:remain $ \leftarrow $ 16,offset $ \leftarrow $ 0
     (2) 循环处理每8-lane数据:
     (3)  (a)cur_load $ \leftarrow $ min(remain,8)
     (4)  (b)写入cur_load条数据至GRF
     (5)  (c)每通道执行addBarrier()同步
     (6)  (d)调用addTransactionAll()触发MAC运算
     (7)  (e)remain $ \leftarrow $ remain $ - $ cur_load,offset $ \leftarrow $ offset $ + $
        cur_load
     (8) 重复步骤(a)~(e)直至remain = 0
     (9) 返回
    下载: 导出CSV

    表  3  实验平台与资源配置参数

    类别操作系统模拟器平台模拟精度Bank数GRF
    配置参数Ubuntu 20.04 LTSSamsung PimsimulatorCycle-Level168-lane
    下载: 导出CSV

    表  4  各优化策略下不同矩阵规模的加速比

    数据规模BaselineDBAS奇偶Bank交错映射分片虚拟化计算协同优化
    1 024×1 0240.6810.6930.6861.8691.894
    2 048×1 0241.2431.2641.2513.4273.456
    2 048×2 0481.2961.3081.3053.4383.479
    4 096×2 0482.6372.6612.6566.1526.215
    4 096×4 0962.7412.7542.7518.1958.225
    下载: 导出CSV

    表  5  本文协同优化策略与UniNDP方案横向对比

    对比维度UniNDP本文协同优化策略
    算子类型MVM(GEMV)GEMV
    测试数据规模4 096×4 0964 096×4 096
    对HBM-PIM性能提升1.02×3.00×
    不同数据规模平均性能提升1.10×~1.62×平均提升2.70×,
    规模越大越优
    下载: 导出CSV
  • [1] GHOLAMI A, YAO Zhewei, KIM S, et al. AI and memory wall[J]. IEEE Micro, 2024, 44(3): 33–39. doi: 10.1109/MM.2024.3373763.
    [2] CHI Ping, LI Shuangchen, XU Cong, et al. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 27–39. doi: 10.1145/3007787.3001140.
    [3] DAKKAK A, LI Cheng, XIONG Jinjun, et al. Accelerating reduction and scan using tensor core units[C]. Proceedings of the ACM International Conference on Supercomputing, Phoenix, USA, 2019: 46–57. doi: 10.1145/3330345.3331057.
    [4] ALIAN M, MIN S W, ASGHARIMOGHADDAM H, et al. Application-transparent near-memory processing architecture with memory channel network[C]. Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, Fukuoka, Japan, 2018: 802–814. doi: 10.1109/MICRO.2018.00070.
    [5] GÓMEZ-LUNA J, EL HAJJ I, FERNANDEZ I, et al. Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture[J]. arXiv: 2105.03814, 2021: 1–25. doi: 10.48550/arXiv.2105.03814. (查阅网上资料,请核对文献类型及格式).
    [6] KIM J and KIM Y. HBM: Memory solution for bandwidth-hungry processors[C]. Proceedings of 2014 IEEE Hot Chips 26 Symposium, Cupertino, USA, 2014: 1–24. doi: 10.1109/HOTCHIPS.2014.7478812.
    [7] LEE S, KANG S H, LEE J, et al. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product[C]. Proceedings of 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture, Valencia, Spain, 2021: 43–56. doi: 10.1109/ISCA52012.2021.00013.
    [8] KIM D, KUNG J, CHAI S, et al. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 380–392. doi: 10.1145/3007787.3001178.
    [9] ASGARI B, HADIDI R, CAO Jiashen, et al. FAFNIR: Accelerating sparse gathering by using efficient near-memory intelligent reduction[C]. Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture, Seoul, Korea (South), 2021: 908–920. doi: 10.1109/HPCA51647.2021.00080.
    [10] WANG Haoyang, ZHANG Shengbing, FAN Xiaoya, et al. NDPGNN: A near-data processing architecture for GNN training and inference acceleration[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(11): 3997–4008. doi: 10.1109/TCAD.2024.3446871.
    [11] LEE S, KIM K, OH S, et al. A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based accelerator-in-memory supporting 1TFLOPS MAC operation and various activation functions for deep-learning applications[C]. Proceedings of 2022 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2022: 1–3. doi: 10.1109/ISSCC42614.2022.9731711.
    [12] DAI Guohao, ZHU Zhenhua, FU Tianyu, et al. DIMMining: Pruning-efficient and parallel graph mining on near-memory-computing[C]. Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, USA, 2022: 130–145. doi: 10.1145/3470496.3527388.
    [13] KE Liu, GUPTA U, CHO B Y, et al. RecNMP: Accelerating personalized recommendation with near-memory processing[C]. Proceedings of 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture, Valencia, Spain, 2020: 790–803. doi: 10.1109/ISCA45697.2020.00070.
    [14] WILKINSON F, COCKREAN A, LIN Weichen, et al. Assessing the GPU offload threshold of GEMM and GEMV kernels on modern heterogeneous HPC systems[C]. Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, USA, 2024: 1481–1495. doi: 10.1109/SCW63240.2024.00188.
    [15] HONG Ke, DAI Guohao, XU Jiaming, et al. FlashDecoding++: Faster large language model inference on GPUs[C]. Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024: 1–15. (查阅网上资料, 未找到本条文献信息, 请确认).
    [16] IBRAHIM M A, ISLAM M, and AGA S. PIMnast: Balanced data placement for GEMV acceleration with processing-in-memory[C]. Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, USA, 2024: 970–981. doi: 10.1109/SCW63240.2024.00137.
    [17] XIE Tongxin, ZHU Zhenhua, LI Bing, et al. UniNDP: A unified compilation and simulation tool for near DRAM processing architectures[C]. Proceedings of 2025 IEEE International Symposium on High Performance Computer Architecture, Las Vegas, USA, 2025: 624–640. doi: 10.1109/HPCA61900.2025.00054.
    [18] Samsung Advanced Institute of Technology. PIMSimulator[EB/OL]. https://github.com/SAITPublic/PIMSimulator. (查阅网上资料,未找到对应的作者和年份信息,请确认).
  • 加载中
图(7) / 表(6)
计量
  • 文章访问数:  17
  • HTML全文浏览量:  11
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-05-06
  • 网络出版日期:  2025-09-18

目录

    /

    返回文章
    返回