高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

规则压缩模型和灵活架构的Transformer加速器设计

姜小波 邓晗珂 莫志杰 黎红源

姜小波, 邓晗珂, 莫志杰, 黎红源. 规则压缩模型和灵活架构的Transformer加速器设计[J]. 电子与信息学报, 2024, 46(3): 1079-1088. doi: 10.11999/JEIT230188
引用本文: 姜小波, 邓晗珂, 莫志杰, 黎红源. 规则压缩模型和灵活架构的Transformer加速器设计[J]. 电子与信息学报, 2024, 46(3): 1079-1088. doi: 10.11999/JEIT230188
JIANG Xiaobo, DENG Hanke, MO Zhijie, LI Hongyuan. Design of Transformer Accelerator with Regular Compression Model and Flexible Architecture[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1079-1088. doi: 10.11999/JEIT230188
Citation: JIANG Xiaobo, DENG Hanke, MO Zhijie, LI Hongyuan. Design of Transformer Accelerator with Regular Compression Model and Flexible Architecture[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1079-1088. doi: 10.11999/JEIT230188

规则压缩模型和灵活架构的Transformer加速器设计

doi: 10.11999/JEIT230188
基金项目: 国家自然科学基金(U1801262),广东省科技计划(2019B010154003),广州市科技计划(202102080579)
详细信息
    作者简介:

    姜小波:男,副教授,研究方向为人工智能、自然语言处理以及人工智能芯片设计

    邓晗珂:男,硕士生,研究方向为大规模集成电路设计、自然语言处理

    莫志杰:男,硕士生,研究方向为大规模集成电路设计、LDPC编解码

    黎红源:男,工程师,研究方向为自然语音处理、通信编解码及智能机器人

    通讯作者:

    黎红源 gditlhy@163.com

  • 中图分类号: TN912.34

Design of Transformer Accelerator with Regular Compression Model and Flexible Architecture

Funds: The National Natural Science Foundation of China (U1801262), Science and Technology Project of Guangdong Province (2019B010154003), Science and Technology Project of Guangzhou City (202102080579)
  • 摘要: 基于注意力机制的Transformer模型具有优越的性能,设计专用的Transformer加速器能大幅提高推理性能以及降低推理功耗。Transformer模型复杂性包括数量上和结构上的复杂性,其中结构上的复杂性导致不规则模型和规则硬件之间的失配,降低了模型映射到硬件的效率。目前的加速器研究主要聚焦在解决模型数量上的复杂性,但对如何解决模型结构上的复杂性研究得不多。该文首先提出规则压缩模型,降低模型的结构复杂度,提高模型和硬件的匹配度,提高模型映射到硬件的效率。接着提出一种硬件友好的模型压缩方法,采用规则的偏移对角权重剪枝方案和简化硬件量化推理逻辑。此外,提出一个高效灵活的硬件架构,包括一种以块为单元的权重固定脉动运算阵列,同时包括一种准分布的存储架构。该架构可以高效实现算法到运算阵列的映射,同时实现高效的数据存储效率和降低数据移动。实验结果表明,该文工作在性能损失极小的情况下实现93.75%的压缩率,在FPGA上实现的加速器可以高效处理压缩后的Transformer模型,相比于中央处理器 (CPU)和图形处理器 (GPU)能效分别提高了12.45倍和4.17倍。
  • 图  1  偏移对角矩阵

    图  2  偏移对角剪枝

    图  3  所提出的加速器总体硬件架构

    图  4  加速器整体数据流

    图  5  准分布式存储架构示意图

    图  6  运算单元数据流动方案

    图  7  PE内部结构

    图  8  运算单元内数据

    图  9  加法单元

    图  10  偏移对角矩阵稀疏权重存储方案

    图  11  分批剪枝过程的精度下降趋势

    算法1 单位偏移对角剪枝
     输入: model
     输出 Pruned model
     model.train;
     GetOffset(model.weight (Q,K,V))
     Prune(model_weight (Q,K,V))
     model.train;
     GetOffset(model.weight (O))
     Prune(model_weight (O))
     model.train;
     GetOffset(model.weight (FFN1))
     Prune(model_weight (FFN1))
     model.train;
     GetOffset (model.weight (FFN2))
     Prune (model_weight (FFN2))
     model_train;
    下载: 导出CSV

    表  1  稀疏矩阵格式存储代价对比

    类别COO[18]CSR[19]BCSR[20]Bitmap[21]MBR[22]Wmark[23]本文
    Value2 5002 5005 0002 5002 5002 5002 500
    Col_idx3 1253 125312.5312.5312.5312.5
    Row_idx3 1257.81.6312.51.6
    Index351.6
    Bitmap625625312.5
    Perm156.25
    Total(Kb)8 7505 632.85 314.14 101.63 439.131252 656.25
    下载: 导出CSV

    表  2  Transformer参数设置

    模型Transformer
    编码器/解码器层数6
    注意力头数量8
    词向量维度512
    Q,K,V维度64
    前馈神经网络隐藏层维度2048
    下载: 导出CSV

    表  3  Transformer模型(base)实验结果

    数据集参数量(MB)BLEU
    IWSLT-2014(De-En)17634.5
    IWSLT-2014(En-De)17628.5
    下载: 导出CSV

    表  4  Transformer模型剪枝实验结果

    数据集参数量(MB)子矩阵大小BLEU压缩率(%)性能损失(%)
    IWSLT-2014(De-En)44434.3175.00.55
    IWSLT-2014(De-En)22832.3487.56.26
    IWSLT-2014(En-De)44428.0775.01.50
    IWSLT-2014(En-De)22826.8087.55.96
    下载: 导出CSV

    表  5  剪枝后的Transformer模型量化实验结果

    数据集参数量(MB)BLEU压缩率(%)性能损失(%)
    IWSLT-2014(De-En)11.034.1693.750.98
    IWSLT-2014(De-En)5.532.1496.876.84
    IWSLT-2014(En-De)11.027.9593.751.93
    IWSLT-2014(En-De)5.526.5896.876.74
    下载: 导出CSV

    表  6  算法结果对比(%)

    现有研究工作模型模型压缩方法数据集(任务)压缩率性能损失
    文献[13]RoBERTa块循环矩阵IMDB(情感分类)93.754.30
    文献[24]Transformer硬件感知搜索IWSLT-2014(德英翻译)28.200
    文献[23]Transformer分层剪枝SST-2(情感分类)90.002.37
    文献[14]Transformer内存感知结构化剪枝Multi30K(德英翻译)95.000.77
    文献[25]BERT全量化压缩SST-2(情感分类)87.500.88
    文献[15]Transformer不规则剪枝WMT-2015(德英翻译)77.251.92
    本文Transformer偏移对角结构化剪枝、硬件友好INT8量化IWSLT-2014(德英翻译)93.750.98
    下载: 导出CSV

    表  7  资源利用报告

    LUTFFBRAMDSP
    可用数量218 600437 200545900
    使用数152 425187 493262576
    利用率(%)69.7342.8848.0764.00
    下载: 导出CSV

    表  8  与其他计算平台计算时间对比

    网络层GPU[26]文献[26]本文
    MHA-RL 1557.8 µs
    1.0×
    106.7 µs
    14.6×
    115.16 µs
    13.5×
    FFN-RL713.4 µs
    210.5 µs
    3.4×
    170.71 µs
    4.2×
    下载: 导出CSV

    表  9  加速器总体硬件性能对比

    现有研究工作计算性能(GOPS)功耗(W)推理延迟(ms)吞吐量(fps)能效(fps/W)等效能量效率
    CPU(Intel i7-8700K)13879.018.653.00.681.00×
    GPU(NVIDIA 1080Ti)41780.06.131632.043.00×
    文献[26]13.223.842.03.204.71×
    文献[14]22.56.801476.509.56×
    文献[25]16.78.901126.709.85×
    文献[13]22.52.9068130.344.6×
    文献[23]14.16.45
    本文30514.08.401198.5012.5×
    下载: 导出CSV
  • [1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [2] SUN Yu, WANG Shuohuan, LI Yukun, et al. Ernie 2.0: A continual pre-training framework for language understanding[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 8968–8975.
    [3] LIU Yinhan, OTT M, GOYAL N, et al. Roberta: A robustly optimized BERT pretraining approach[EB/OL]. https://doi.org/10.48550/arXiv.1907.11692, 2019.
    [4] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, 2018: 4171–4186.
    [5] YANG Zhilin, DAI Zihang, YANG Yiming, et al. XLNet: Generalized autoregressive pretraining for language understanding[C]. The 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 517.
    [6] ROSSET C. Turing-NLG: A 17-billion-parameter language model by microsoft[EB/OL]. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/, 2020.
    [7] ZHANG Xiang, ZHAO Junbo, and LECUN Y. Character-level convolutional networks for text classification[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 649–657.
    [8] LAI Siwei, XU Liheng, LIU Kang, et al. Recurrent convolutional neural networks for text classification[C]. Twenty-ninth AAAI Conference on Artificial Intelligence, Austin, USA, 2015: 2267–2273.
    [9] VOITA E, TALBOT D, MOISEEV F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned[C]. The 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019: 5797–5808.
    [10] LIN Zi, LIU J, YANG Zi, et al. Pruning redundant mappings in transformer models via spectral-normalized identity prior[C]. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020: 719–730.
    [11] PENG Hongwu, HUANG Shaoyi, GENG Tong, et al. Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning[C]. 2021 22nd International Symposium on Quality Electronic Design (ISQED), Santa Clara, USA, 2021: 142–148.
    [12] QI Panjie, SONG Yuhong, PENG Hongwu, et al. Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization[C]. 2021 on Great Lakes Symposium on VLSI, 2021: 163–168.
    [13] LI Bingbing, PANDEY S, FANG Haowen, et al. FTRANS: Energy-efficient acceleration of transformers using FPGA[C]. ACM/IEEE International Symposium on Low Power Electronics and Design, Boston, USA, 2020: 175–180.
    [14] ZHANG Xinyi, WU Yawen, ZHOU Peipei, et al. Algorithm-hardware co-design of attention mechanism on FPGA devices[J]. ACM Transactions on Embedded Computing Systems, 2021, 20(5s): 71. doi: 10.1145/3477002.
    [15] PARK J, YOON H, AHN D, et al. OPTIMUS: OPTImized matrix MUltiplication structure for transformer neural network accelerator[C]. Machine Learning and Systems, Austin, USA, 2020: 363–378.
    [16] DENG Chunhua, LIAO Siyu, and YUAN Bo. PermCNN: Energy-efficient convolutional neural network hardware architecture with permuted diagonal structure[J]. IEEE Transactions on Computers, 2021, 70(2): 163–173. doi: 10.1109/TC.2020.2981068.
    [17] WU Shuang, LI Guoqi, DENG Lei, et al. L1-norm batch normalization for efficient training of deep neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(7): 2043–2051. doi: 10.1109/TNNLS.2018.2876179.
    [18] BARRETT R, BERRY M, CHAN T F, et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods[M]. Philadelphia: Society for Industrial and Applied Mathematics, 1994: 5–37.
    [19] LIU Weifeng and VINTER B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication[C]. The 29th ACM on International Conference on Supercomputing, Newport Beach, USA, 2015: 339–350.
    [20] PINAR A and HEATH M T. Improving performance of sparse matrix-vector multiplication[C]. SC'99: Proceedings of 1999 ACM/IEEE Conference on Supercomputing, Portland, USA, 1999: 30.
    [21] ZACHARIADIS O, SATPUTE N, GÓMEZ-LUNA J, et al. Accelerating sparse matrix–matrix multiplication with GPU tensor cores[J]. Computers & Electrical Engineering, 2020, 88: 106848. doi: 10.1016/j.compeleceng.2020.106848.
    [22] KANNAN R. Efficient sparse matrix multiple-vector multiplication using a bitmapped format[C]. 20th Annual International Conference on High Performance Computing, Bengaluru, India, 2013: 286–294.
    [23] QI Panjie, SHA E H M, ZHUGE Q, et al. Accelerating framework of transformer by hardware design and model compression co-optimization[C]. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), Munich, Germany, 2021: 1–9.
    [24] WANG Hanrui, WU Zhanghao, LIU Zhijian, et al. HAT: Hardware-aware transformers for efficient natural language processing[C]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7675–7688.
    [25] LIU Zejian, LI Gang, and CHENG Jian. Hardware acceleration of fully quantized BERT for efficient natural language processing[C]. 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 2021: 513–516.
    [26] LU Siyuan, WANG Meiqi, LIANG Shuang, et al. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer[C]. 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas, USA, 2020: 84–89.
  • 加载中
图(11) / 表(10)
计量
  • 文章访问数:  725
  • HTML全文浏览量:  413
  • PDF下载量:  97
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-03-28
  • 修回日期:  2023-08-21
  • 网络出版日期:  2023-08-25
  • 刊出日期:  2024-03-27

目录

    /

    返回文章
    返回