Advanced Search
Turn off MathJax
Article Contents
ZHENG Jiwen, ZHAO Shilei, ZHANG Ziyue, LIU Zhiwei, YU Bin, HUANG Hai. High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250687
Citation: ZHENG Jiwen, ZHAO Shilei, ZHANG Ziyue, LIU Zhiwei, YU Bin, HUANG Hai. High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250687

High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography

doi: 10.11999/JEIT250687 cstr: 32379.14.JEIT250687
Funds:  National Key R&D Program of China (2023YFB4403500)
  • Accepted Date: 2025-11-05
  • Rev Recd Date: 2025-11-05
  • Available Online: 2025-11-15
  •   Objective  With the advancement of Post-Quantum Cryptography (PQC) standardization, the efficient implementation of Number Theoretic Transform (NTT) modules has become crucial for enhancing the performance of PQC algorithms. While existing research on high-radix NTT primarily focuses on optimizing in-place computation structures and achieving configurability, it often overlooks performance bottlenecks caused by complex memory access patterns and lacks dedicated optimizations for the specific parameters of PQC algorithms. To address these challenges, this paper proposes a high area-efficiency radix-4 NTT design with a Constant-Geometry (CG) structure. The solution implements targeted optimizations for the modular multiplication unit by analyzing the common characteristics of moduli and integrating multi-level operations, while optimizing memory allocation and address generation strategies to effectively reduce memory capacity and improve data access efficiency. This paper provides an efficient solution for implementing radix-4 CG NTT in out-of-place storage, achieving conflict-free memory access.  Methods  At the algorithmic level, the proposed radix-4 CG NTT/INTT employs a low-complexity design and eliminates the bit-reversal operation to reduce multiplication count and computation cycles, with a correspondingly redesigned twiddle factor access mechanism. Regarding the most time-consuming modular multiplication in the radix-4 butterfly unit, the critical path is effectively shortened by integrating the multiplication with the first-stage ${\text{K}} - {\text{RED}}$ reduction and simplifying the correction logic. Furthermore, to support three parameter configurations, a scalable modular multiplication scheme is proposed by analyzing common characteristics of different moduli. At the architectural level, a strategy is employed where two coefficients are concatenated and stored at the same memory address. By designing a data decomposition and reorganization scheme, the interaction between the memory and the dual-butterfly units is efficiently managed. To efficiently achieve conflict-free memory access, a cyclic memory space reuse strategy is employed, and read and write address generation schemes based on sequential and stepped access patterns are designed, significantly reducing the required memory capacity and the complexity of control logic.  Results and Discussions  Experimental results on Field Programmable Gate Arrays (FPGAs) demonstrate that the proposed NTT architecture achieves high operating frequencies and low resource consumption under three parameter configurations, while also significantly improving the Area-Time Product (ATP) compared to existing designs (Table 5). For the configuration with 256 terms and a modulus of 7681, it consumes 2397 slices, 4 BRAMs, and 16 DSPs, achieving an operating frequency of 363 MHz, resulting in at least a 56.4% improvement in ATP. For the configuration with 256 terms and a modulus of 8380417, it consumes 3760 slices, 6 BRAMs, and 16 DSPs, achieving an operating frequency of 338 MHz, resulting in at least a 69.8% improvement in ATP. For the configuration with 1024 terms and a modulus of 12289, it consumes 2379 slices, 4 BRAMs, and 16 DSPs, achieving an operating frequency of 357 MHz, resulting in at least a 50.3% improvement in ATP.  Conclusions  This paper proposes a high area-efficiency radix-4 NTT hardware architecture for lattice-based PQC algorithms. By employing a low-complexity radix-4 CG NTT/INTT and eliminating the bit-reversal operation, the latency is reduced. By analyzing the common characteristics of three specific moduli and merging partial computations, a scalable modular multiplication architecture based on ${\text{K}}{}^2 - {\text{RED}}$ reduction is designed. This paper tackles the challenges of storage space doubling and address generation complexity by efficiently reusing memory and designing address generation schemes based on sequential and stepped access patterns. Experimental results demonstrate that the proposed scheme significantly improves operating frequency and reduces resource consumption, resulting in lower ATP under three parameter configurations. However, this study only considers a dual-butterfly unit architecture. Future research should further explore architectural designs with higher degrees of parallelism to meet performance requirements for various application scenarios.
  • loading
  • [1]
    SHOR P W. Algorithms for quantum computation: Discrete logarithms and factoring[C]. Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, USA, 1994: 124–134. doi: 10.1109/SFCS.1994.365700.
    [2]
    MOODY D. Post-quantum cryptography: NIST’s plan for the future[C]. The Seventh International Conference on Post-Quntum Cryptography, Fukuoka, Japan, 2016. (查阅网上资料, 未找到本条文献信息, 请确认).
    [3]
    GUO Wenbo and LI Shuguo. Split-radix based compact hardware architecture for CRYSTALS-Kyber[J]. IEEE Transactions on Computers, 2024, 73(1): 97–108. doi: 10.1109/TC.2023.3320040.
    [4]
    LI Guangyan, CHEN Donglong, MAO Gaoyu, et al. Algorithm-hardware co-design of split-radix discrete Galois transformation for KyberKEM[J]. IEEE Transactions on Emerging Topics in Computing, 2023, 11(4): 824–838. doi: 10.1109/TETC.2023.3270971.
    [5]
    ZHANG Neng, YANG Bohan, CHEN Chen, et al. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2020, 2020(2): 49–72. doi: 10.13154/tches.v2020.i2.49-72.
    [6]
    XING Yufei and LI Shuguo. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-Kyber on FPGA[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2021(2): 328–356. doi: 10.46586/tches.v2021.i2.328-356.
    [7]
    LIU Sihuang, KUO C Y, MO Yannan, et al. An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(3): 519–529. doi: 10.1109/TVLSI.2023.3336951.
    [8]
    CHEN Xiangren, YANG Bohan, YIN Shouyi, et al. CFNTT: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2022(1): 94–126. doi: 10.46586/tches.v2022.i1.94-126.
    [9]
    MERT A C, KARABULUT E, ÖZTÜRK E, et al. An extensive study of flexible design methods for the number theoretic transform[J]. IEEE Transactions on Computers, 2022, 71(11): 2829–2843. doi: 10.1109/TC.2020.3017930.
    [10]
    ZHAO Yanrui, LIU Xiaofan, HU Yue, et al. Design of an efficient NTT/INTT architecture with low-complex memory mapping scheme[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71(1): 400–404. doi: 10.1109/TCSII.2023.3296492.
    [11]
    付秋兴, 李伟, 别梦妮, 等. 格基后量子密码的可重构NTT运算单元与高效调度算法研究[J]. 电子学报, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.

    FU Qiuxing, LI Wei, BIE Mengni, et al. Research on reconfigurable NTT arithmetic unit and efficient scheduling algorithm for lattice post-quantum cryptography[J]. Acta Electronica Sinica, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.
    [12]
    BANERJEE U, UKYAB T S, and CHANDRAKASAN A P. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2019, 2019(4): 17–61. doi: 10.13154/tches.v2019.i4.17-61.
    [13]
    SU Yang, YANG Bailong, YANG Chen, et al. A highly unified reconfigurable multicore architecture to speed up NTT/INTT for homomorphic polynomial multiplication[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(8): 993–1006. doi: 10.1109/TVLSI.2022.3166355.
    [14]
    WANG Jianfei, YANG Chen, MENG Yishuo, et al. A reconfigurable and area-efficient polynomial multiplier using a novel in-place constant-geometry NTT/INTT and conflict-free memory mapping scheme[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2025, 72(3): 1358–1371. doi: 10.1109/TCSI.2024.3483229.
    [15]
    SUN Junyan and BAI Xuefei. A high-speed hardware architecture of an NTT accelerator for CRYSTALS-Kyber[J]. Integrated Circuits and Systems, 2024, 1(2): 92–102. doi: 10.23919/ICS.2024.3419562.
    [16]
    GENG Yue, HU Xiao, LI Minghao, et al. Rethinking parallel memory access pattern in number theoretic transform design[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 70(5): 1689–1693. doi: 10.1109/TCSII.2023.3260811.
    [17]
    ALHASSANI A and BENAISSA M. High-speed polynomials multiplication HW accelerator for CRYSTALS-Kyber[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2024, 71(12): 6105–6113. doi: 10.1109/TCSI.2024.3427011.
    [18]
    AIKATA A, MERT A C, IMRAN M, et al. KaLi: A crystal for post-quantum security using Kyber and Dilithium[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 747–758. doi: 10.1109/TCSI.2022.3219555.
    [19]
    HU Xiao, TIAN Jing, LI Minghao, et al. AC-PM: An area-efficient and configurable polynomial multiplier for lattice based cryptography[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 719–732. doi: 10.1109/TCSI.2022.3218192.
    [20]
    BISHEH-NIASAR M, AZARDERAKHSH R, and MOZAFFARI-KERMANI M. High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography[C]. 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 2021: 94–101. doi: 10.1109/ARITH51176.2021.00028.
    [21]
    LI Bin, YAN Yunfei, WEI Yuanxin, et al. Scalable and parallel optimization of the number theoretic transform based on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(2): 291–304. doi: 10.1109/TVLSI.2023.3312423.
    [22]
    MU Jianan, REN Yi, WANG Wen, et al. Scalable and conflict-free NTT hardware accelerator design: Methodology, proof, and implementation[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(5): 1504–1517. doi: 10.1109/TCAD.2022.3205552.
    [23]
    周清雷, 韩贺茹, 李斌, 等. 面向格密码的可配置基-4 NTT硬件优化与实现[J]. 通信学报, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188.

    ZHOU Qinglei, HAN Heru, LI Bin, et al. Configurable radix-4 NTT hardware optimization and implementation for lattice-based cryptography[J]. Journal on Communications, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(5)

    Article Metrics

    Article views (24) PDF downloads(0) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return