高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计

郑集文 赵石磊 张子悦 刘志伟 于斌 黄海

郑集文, 赵石磊, 张子悦, 刘志伟, 于斌, 黄海. 面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计[J]. 电子与信息学报. doi: 10.11999/JEIT250687
引用本文: 郑集文, 赵石磊, 张子悦, 刘志伟, 于斌, 黄海. 面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计[J]. 电子与信息学报. doi: 10.11999/JEIT250687
ZHENG Jiwen, ZHAO Shilei, ZHANG Ziyue, LIU Zhiwei, YU Bin, HUANG Hai. High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250687
Citation: ZHENG Jiwen, ZHAO Shilei, ZHANG Ziyue, LIU Zhiwei, YU Bin, HUANG Hai. High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250687

面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计

doi: 10.11999/JEIT250687 cstr: 32379.14.JEIT250687
基金项目: 国家重点研发计划“微纳电子技术”重点专项(2023YFB4403500)
详细信息
    作者简介:

    郑集文:男,博士生,研究方向为后量子密码算法硬件实现、高性能计算

    赵石磊:男,副教授,研究方向为信息安全、可重构计算、后量子密码芯片设计等

    张子悦:男,本科生,研究方向为后量子密码算法硬件实现

    刘志伟:男,副教授,研究方向为可重构计算、密码芯片的安全设计等

    于斌:男,副教授,研究方向为密码算法、密码芯片设计和数字集成电路设计等

    黄海:男,教授,研究方向为信息安全、可重构计算、集成电路设计等

    通讯作者:

    黄海 ic@hrbust.edu.cn

  • 中图分类号: TP309

High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography

Funds: National Key R&D Program of China (2023YFB4403500)
  • 摘要: 针对格基后量子密码(PQC)算法中基-2数论变换(NTT)计算效率较低以及原位计算内存访问模式复杂的问题,该文提出一种高面积效率的基-4 NTT硬件设计。首先,介绍了负包裹卷积方法的运算流程及适用条件,在此基础上提出了一种恒定几何(CG)结构的低计算复杂度基-4 NTT/INTT算法。其次,深入分析不同PQC算法中模数的共性特征,设计了基于K2-RED约简的可扩展模乘单元。最后,通过优化存储器与蝶形单元之间的数据分解与重组,提出一种基于顺序循环和阶梯循环访存的读写地址生成方案,实现了高效的无访存冲突。与传统的乒乓存储模式相比,该方案可减少12.5%的存储空间。实验结果表明,在(项数,模数位宽)分别为(256, 13)、(256, 23)和(1024, 14)的3种配置下,该设计的面积-时间积(ATP)较现有方案分别降低56.4%、69.8%和50.3%以上,具有更高的面积效率。
  • 图  1  整体架构

    图  2  基-4蝶形计算单元架构

    图  3  基于K2-RED约简的可扩展模乘硬件架构

    图  4  内存空间与存储数据分配机制

    图  5  64点数据读写过程

    表  1  基-4 CG NTT算法

     输入:$ \mathrm{n} $$a(x) \in {R_p}$,${\varphi _{2N}} \in {\mathbb{Z}_p}$,$\omega _4^1$
     输出:$A(x) = NTT(a(x))$
     (a)存储预计算的旋转因子
     1. for $s = 1$ to ${\log _4}N$ do
     2.  $l = N{\text{/}}{4^s}$
     3.  ${\omega _m} = \varphi _{2N}^l$
     4.  for $j = 0$ to ${4^{s - 1}} - 1$ do
     5.   ROM_1.append$ [\omega _m^{2{\text{bit}} - {\text{reversed(}}j) + 1}] $
     6.   ROM_2.append$[\omega _m^{2 \cdot (2{\text{bit}} - {\text{reversed(}}j) + 1)}]$
     7.   ROM_3.append$[\omega _m^{3 \cdot (2{\text{bit}} - {\text{reversed(}}j) + 1)}]$
     8.  end for
     9. end for
     (b)基-4 CG NTT运算
     1. for $s = 1$ to ${\log _4}N$ do
     2.  for $i = 0$ to $N{\text{/}}4 - 1$ do
     3.   ${r^{'}} = i{\rm{mod}} {4^{s - 1}}$
     4.   $r = {\text{convert(}}{r^{'}}) + ({4^{s - 1}} - 1)/3$
     5.   ${\omega _1} = {\text{ROM\_1[}}r{\text{]}}$
     6.   ${\omega _2} = {\text{ROM}}\_2[r]$
     7.   ${\omega _3} = {\text{ROM\_3[}}r{\text{]}}$
     8.   ${T_0} = (a[i] + a[i + 2N/4] \cdot {\omega _2}){\rm{mod}} p$
     9.   ${T_1} = (a[i] - a[i + 2N/4] \cdot {\omega _2}){\rm{mod}} p$
     10.   ${T_2} = (a[i + N/4] \cdot {\omega _1} + a[i + 3N/4] \cdot {\omega _3}){\rm{mod}} p$
     11.   ${T_3} = (a[i + N/4] \cdot {\omega _1} - a[i + 3N/4] \cdot {\omega _3}){\rm{mod}} p$
     12.   $A[4i] = ({T_0} + {T_2}){\rm{mod}} p$
     13.   $A[4i + 1] = ({T_1} + {T_3} \cdot \omega _4^1){\rm{mod}} p$
     14.   $A[4i + 2] = ({T_0} - {T_2}){\rm{mod}} p$
     15.   $A[4i + 3] = ({T_1} - {T_3} \cdot \omega _4^1){\rm{mod}} p$
     16. end for
     17.end for
    下载: 导出CSV

    表  2  基-4 CG INTT算法

     输入:$ \mathrm{n} $$A(x) \in {R_P}$,${\varphi _{2N}} \in {\mathbb{Z}_p}$,$\omega _4^1$
     输出:$a(x) = INTT(A(x))$
     1. for $s = {\log _4}N$ to 1 do
     2.  for $i = 0$ to $N{\text{/}}4 - 1$ do
     3.   ${i^{'}} = N{\text{/}}4 - 1 - i$
     4.   ${r^{'}} = {i^{'}}{\rm{mod}} {4^{s - 1}}$
     5.   $r = {\text{convert(}}{r^{'}}) + ({4^{s - 1}} - 1)/3$
     6.   ${\omega _1} = {\text{ROM\_1}}[r]$
     7.   ${\omega _2} = {\text{ROM\_2[}}r]$
     8.   ${\omega _3} = {\text{ROM\_3}}[r]$
     9.   ${T_0} = {2^{ - 1}} \cdot (A[4i] + A[4i + 2]){\rm{mod}} p$
     10.   ${T_1} = (A[4i + 2] - A[4i]) \cdot \omega _4^1{\rm{mod}} p$
     11.   ${T_2} = {2^{ - 1}} \cdot (A[4i + 1] + A[4i + 3]){\rm{mod}} p$
     12.   ${T_3} = {2^{ - 1}} \cdot (A[4i + 3] - A[4i + 1]){\rm{mod}} p$
     13.   $a[i] = {2^{ - 1}} \cdot ({T_0} + {T_2}){\rm{mod}} p$
     14.   $a[i + N/4] = {2^{ - 1}} \cdot ({T_1} + {T_3}) \cdot {\omega _1}{\rm{mod}} p$
     15.   $a[i + 2N/4] = {2^{ - 1}} \cdot ({T_2} - {T_0}) \cdot {\omega _2}{\rm{mod}} p$
     16.   $a[i + 3N/4] = {2^{ - 1}} \cdot ({T_3} - {T_1}) \cdot {\omega _3}{\rm{mod}} p$
     17. end for
     18.end for
    下载: 导出CSV

    表  3  K2-RED约简算法

     输入:$Z \in [0,{p^2})$, $p = k \cdot {2^m} + 1$, $n = \left\lfloor {{{\log }_2}p} \right\rfloor $
     输出:${Z^{''}} = {k^2} \cdot Z{\rm{mod}} p$
     1: ${Z_l} = Z[m - 1:0]$ $ {\mathrm{Z}}_{\mathrm{l}}=\mathrm{Z}\mathrm{ }\left(\mathrm{m}\mathrm{o}\mathrm{d}{2}^{\mathrm{m}}\right) $
     2: ${Z_h} = Z[2n - 1:m]$
     3: ${Z^{'}} = k{Z_l} - {Z_h}$
     4: ${Z_l}^{'} = {Z^{'}}[m - 1:0]$
     5: ${Z_h}^{'} = {Z^{'}}[2n - m - 1:m]$
     6: ${Z^{''}} = k{Z_l}^{'} - {Z_h}^{'}$
     7: return ${Z^{''}}$
    下载: 导出CSV

    表  4  格基PQC方案中模乘核心步骤的定量比较

    Kyber v1算法 Dilithium算法 Falcon算法
    $(m,k,n)$ (9,15,13) (13,1023,23) (12,3,14)
    两阶段K-RED约简 ${Z^{'}} = ({Z_l} \lt \lt 4) - {Z_l} - {Z_h}$${Z^{''}} = ({Z_l}^{'} \lt \lt 4) - {Z_l}^{'} - {Z_h}^{'}$ ${Z^{'}} = ({Z_l} \lt \lt 10) - {Z_l} - {Z_h}$
    ${Z^{''}} = ({Z_l}^{'} \lt \lt 10) - {Z_l}^{'} - {Z_h}^{'}$
    ${Z^{'}} = ({Z_l} \lt \lt 2) - {Z_l} - {Z_h}$
    ${Z^{''}} = ({Z_l}^{'} \lt \lt 2) - {Z_l}^{'} - {Z_h}^{'}$
    修正值 ${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 01$ 7681 8380417 12289
    ${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 10$① -7425 -7331841 -12273
    ${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 10$② 256 1048576 16
    ${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 11$
    下载: 导出CSV

    表  5  时间性能及资源消耗对比

    $(N,l)$ 方案 平台 周期 频率/MHz 时延/μs 资源消耗(LUT/FF/BRAM/DSP) ATP(×103)

    (256,13)
    文献[7] Virtex-7 151 329 0.5 2800 2500 20 24 5.6
    文献[9] Virtex-7 160 125 1.28 2371 - 12 24 10.71
    文献[22] Virtex-7 135 256 0.5 6245 1864 12 24 6.12
    文献[23] Virtex-7 91 195 0.47 14758 8745 24 64 13.33
    本文 Virtex-7 172 363 0.47 2397 1904 4 16 2.44

    (256,23)
    文献[9] Virtex-7 200 125 1.6 5071 - 12 56 22.83
    文献[21] UltraScale 156 220 0.71 11006 8791 12 24 12.07
    文献[23] Virtex-7 107 195 0.55 14758 8745 24 64 15.60
    本文 Virtex-7 172 338 0.51 3760 3331 6 16 3.65
    (1024,14) 文献[7] Virtex-7 649 307 2.1 2900 2600 20 24 23.73
    文献[8] Virtex-7 668 250 2.7 2953 1875 5.5 27 19.72
    文献[9] Virtex-7 680 125 5.4 2584 - 16 24 52.83
    文献[16] Virtex-7 650 292 2.22 3400 2300 12 16 19.09
    文献[22] Virtex-7 654 179 3.7 3472 1789 7.5 25 30.42
    文献[23] Virtex-7 343 225 1.52 12836 7213 28 64 42.01
    本文 Virtex-7 654 357 1.83 2379 2219 4 16 9.48
    下载: 导出CSV
  • [1] SHOR P W. Algorithms for quantum computation: Discrete logarithms and factoring[C]. Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, USA, 1994: 124–134. doi: 10.1109/SFCS.1994.365700.
    [2] MOODY D. Post-quantum cryptography: NIST’s plan for the future[C]. The Seventh International Conference on Post-Quntum Cryptography, Fukuoka, Japan, 2016. (查阅网上资料, 未找到本条文献信息, 请确认).
    [3] GUO Wenbo and LI Shuguo. Split-radix based compact hardware architecture for CRYSTALS-Kyber[J]. IEEE Transactions on Computers, 2024, 73(1): 97–108. doi: 10.1109/TC.2023.3320040.
    [4] LI Guangyan, CHEN Donglong, MAO Gaoyu, et al. Algorithm-hardware co-design of split-radix discrete Galois transformation for KyberKEM[J]. IEEE Transactions on Emerging Topics in Computing, 2023, 11(4): 824–838. doi: 10.1109/TETC.2023.3270971.
    [5] ZHANG Neng, YANG Bohan, CHEN Chen, et al. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2020, 2020(2): 49–72. doi: 10.13154/tches.v2020.i2.49-72.
    [6] XING Yufei and LI Shuguo. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-Kyber on FPGA[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2021(2): 328–356. doi: 10.46586/tches.v2021.i2.328-356.
    [7] LIU Sihuang, KUO C Y, MO Yannan, et al. An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(3): 519–529. doi: 10.1109/TVLSI.2023.3336951.
    [8] CHEN Xiangren, YANG Bohan, YIN Shouyi, et al. CFNTT: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2022(1): 94–126. doi: 10.46586/tches.v2022.i1.94-126.
    [9] MERT A C, KARABULUT E, ÖZTÜRK E, et al. An extensive study of flexible design methods for the number theoretic transform[J]. IEEE Transactions on Computers, 2022, 71(11): 2829–2843. doi: 10.1109/TC.2020.3017930.
    [10] ZHAO Yanrui, LIU Xiaofan, HU Yue, et al. Design of an efficient NTT/INTT architecture with low-complex memory mapping scheme[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71(1): 400–404. doi: 10.1109/TCSII.2023.3296492.
    [11] 付秋兴, 李伟, 别梦妮, 等. 格基后量子密码的可重构NTT运算单元与高效调度算法研究[J]. 电子学报, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.

    FU Qiuxing, LI Wei, BIE Mengni, et al. Research on reconfigurable NTT arithmetic unit and efficient scheduling algorithm for lattice post-quantum cryptography[J]. Acta Electronica Sinica, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.
    [12] BANERJEE U, UKYAB T S, and CHANDRAKASAN A P. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2019, 2019(4): 17–61. doi: 10.13154/tches.v2019.i4.17-61.
    [13] SU Yang, YANG Bailong, YANG Chen, et al. A highly unified reconfigurable multicore architecture to speed up NTT/INTT for homomorphic polynomial multiplication[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(8): 993–1006. doi: 10.1109/TVLSI.2022.3166355.
    [14] WANG Jianfei, YANG Chen, MENG Yishuo, et al. A reconfigurable and area-efficient polynomial multiplier using a novel in-place constant-geometry NTT/INTT and conflict-free memory mapping scheme[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2025, 72(3): 1358–1371. doi: 10.1109/TCSI.2024.3483229.
    [15] SUN Junyan and BAI Xuefei. A high-speed hardware architecture of an NTT accelerator for CRYSTALS-Kyber[J]. Integrated Circuits and Systems, 2024, 1(2): 92–102. doi: 10.23919/ICS.2024.3419562.
    [16] GENG Yue, HU Xiao, LI Minghao, et al. Rethinking parallel memory access pattern in number theoretic transform design[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 70(5): 1689–1693. doi: 10.1109/TCSII.2023.3260811.
    [17] ALHASSANI A and BENAISSA M. High-speed polynomials multiplication HW accelerator for CRYSTALS-Kyber[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2024, 71(12): 6105–6113. doi: 10.1109/TCSI.2024.3427011.
    [18] AIKATA A, MERT A C, IMRAN M, et al. KaLi: A crystal for post-quantum security using Kyber and Dilithium[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 747–758. doi: 10.1109/TCSI.2022.3219555.
    [19] HU Xiao, TIAN Jing, LI Minghao, et al. AC-PM: An area-efficient and configurable polynomial multiplier for lattice based cryptography[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 719–732. doi: 10.1109/TCSI.2022.3218192.
    [20] BISHEH-NIASAR M, AZARDERAKHSH R, and MOZAFFARI-KERMANI M. High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography[C]. 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 2021: 94–101. doi: 10.1109/ARITH51176.2021.00028.
    [21] LI Bin, YAN Yunfei, WEI Yuanxin, et al. Scalable and parallel optimization of the number theoretic transform based on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(2): 291–304. doi: 10.1109/TVLSI.2023.3312423.
    [22] MU Jianan, REN Yi, WANG Wen, et al. Scalable and conflict-free NTT hardware accelerator design: Methodology, proof, and implementation[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(5): 1504–1517. doi: 10.1109/TCAD.2022.3205552.
    [23] 周清雷, 韩贺茹, 李斌, 等. 面向格密码的可配置基-4 NTT硬件优化与实现[J]. 通信学报, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188.

    ZHOU Qinglei, HAN Heru, LI Bin, et al. Configurable radix-4 NTT hardware optimization and implementation for lattice-based cryptography[J]. Journal on Communications, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188.
  • 加载中
图(5) / 表(5)
计量
  • 文章访问数:  33
  • HTML全文浏览量:  19
  • PDF下载量:  3
  • 被引次数: 0
出版历程
  • 修回日期:  2025-11-05
  • 录用日期:  2025-11-05
  • 网络出版日期:  2025-11-15

目录

    /

    返回文章
    返回