Design of Transformer Accelerator with Regular Compression Model and Flexible Architecture
摘要: 基于注意力机制的Transformer模型具有优越的性能,设计专用的Transformer加速器能大幅提高推理性能以及降低推理功耗。Transformer模型复杂性包括数量上和结构上的复杂性,其中结构上的复杂性导致不规则模型和规则硬件之间的失配,降低了模型映射到硬件的效率。目前的加速器研究主要聚焦在解决模型数量上的复杂性,但对如何解决模型结构上的复杂性研究得不多。该文首先提出规则压缩模型,降低模型的结构复杂度,提高模型和硬件的匹配度,提高模型映射到硬件的效率。接着提出一种硬件友好的模型压缩方法,采用规则的偏移对角权重剪枝方案和简化硬件量化推理逻辑。此外,提出一个高效灵活的硬件架构,包括一种以块为单元的权重固定脉动运算阵列,同时包括一种准分布的存储架构。该架构可以高效实现算法到运算阵列的映射,同时实现高效的数据存储效率和降低数据移动。实验结果表明,该文工作在性能损失极小的情况下实现93.75%的压缩率,在FPGA上实现的加速器可以高效处理压缩后的Transformer模型,相比于中央处理器 (CPU)和图形处理器 (GPU)能效分别提高了12.45倍和4.17倍。
- 自然语音处理 /
- Transformer /
- 模型压缩 /
- 硬件加速器 /
- 机器翻译
Abstract: The Transformer model based on attention mechanism demonstrates superior performance. The complexity of the Transformer model includes both quantity and structural complexity, where the structural complexity leads to a mismatch between irregular models and regular hardware, reducing the efficiency of mapping the model to the hardware. Current accelerator research mainly focuses on addressing the complexity in terms of model quantity, but there is limited research on how to tackle the complexity in model structure. A regularized compressed model is proposd to reduce the structural complexity of the model, improving the matching between the model and the hardware, and increasing the efficiency of mapping the model to the hardware. A hardware-friendly model compression method is introduced, which utilizes a rule-based pruning scheme for weight with offset diagonals and simplifies the hardware quantization inference logic.An efficient and flexible hardware architecture is also present, including a pulsatile operation array with weight fixed at the block level, as well as a quasi-distributed storage architecture. This architecture enables efficient mapping of algorithms to the operation array, while achieving high data storage efficiency and reducing data movement. Experimental results show that the proposed approach achieves a compression rate of 93.75% with minimal performance loss. The accelerator implemented on an FPGA can efficiently handle the compressed Transformer model, resulting in energy efficiency improvements of 12.45 times compared to Central Processing Unit (CPU) and 4.17 times compared to Graphics Processing Unit (GPU).n energy efficiency improvements of 12.45 times compared to Central Processing Unit (CPU) and 4.17 times compared to Graphics Processing Unit (GPU). -
算法1 单位偏移对角剪枝 输入: model 输出: Pruned model model.train; GetOffset(model.weight (Q,K,V)) Prune(model_weight (Q,K,V)) model.train; GetOffset(model.weight (O)) Prune(model_weight (O)) model.train; GetOffset(model.weight (FFN1)) Prune(model_weight (FFN1)) model.train; GetOffset (model.weight (FFN2)) Prune (model_weight (FFN2)) model_train; 表 1 稀疏矩阵格式存储代价对比
表 2 Transformer参数设置
模型 Transformer 编码器/解码器层数 6 注意力头数量 8 词向量维度 512 Q,K,V维度 64 前馈神经网络隐藏层维度 2048 表 3 Transformer模型(base)实验结果
数据集 参数量(MB) BLEU IWSLT-2014(De-En) 176 34.5 IWSLT-2014(En-De) 176 28.5 表 4 Transformer模型剪枝实验结果
数据集 参数量(MB) 子矩阵大小 BLEU 压缩率(%) 性能损失(%) IWSLT-2014(De-En) 44 4 34.31 75.0 0.55 IWSLT-2014(De-En) 22 8 32.34 87.5 6.26 IWSLT-2014(En-De) 44 4 28.07 75.0 1.50 IWSLT-2014(En-De) 22 8 26.80 87.5 5.96 表 5 剪枝后的Transformer模型量化实验结果
数据集 参数量(MB) BLEU 压缩率(%) 性能损失(%) IWSLT-2014(De-En) 11.0 34.16 93.75 0.98 IWSLT-2014(De-En) 5.5 32.14 96.87 6.84 IWSLT-2014(En-De) 11.0 27.95 93.75 1.93 IWSLT-2014(En-De) 5.5 26.58 96.87 6.74 表 6 算法结果对比(%)
现有研究工作 模型 模型压缩方法 数据集(任务) 压缩率 性能损失 文献[13] RoBERTa 块循环矩阵 IMDB(情感分类) 93.75 4.30 文献[24] Transformer 硬件感知搜索 IWSLT-2014(德英翻译) 28.20 0 文献[23] Transformer 分层剪枝 SST-2(情感分类) 90.00 2.37 文献[14] Transformer 内存感知结构化剪枝 Multi30K(德英翻译) 95.00 0.77 文献[25] BERT 全量化压缩 SST-2(情感分类) 87.50 0.88 文献[15] Transformer 不规则剪枝 WMT-2015(德英翻译) 77.25 1.92 本文 Transformer 偏移对角结构化剪枝、硬件友好INT8量化 IWSLT-2014(德英翻译) 93.75 0.98 表 7 资源利用报告
LUT FF BRAM DSP 可用数量 218 600 437 200 545 900 使用数 152 425 187 493 262 576 利用率(%) 69.73 42.88 48.07 64.00 表 8 与其他计算平台计算时间对比
表 9 加速器总体硬件性能对比
