面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计

郑集文; 赵石磊; 张子悦; 刘志伟; 于斌; 黄海

doi:10.11999/JEIT250687

面向格密码的高面积效率基-4 NTT硬件架构与无访存冲突优化设计

doi: 10.11999/JEIT250687 cstr: 32379.14.JEIT250687

哈尔滨理工大学计算机科学与技术学院，哈尔滨，150080

基金项目: 国家重点研发计划“微纳电子技术”重点专项(2023YFB4403500)

详细信息

作者简介:
郑集文：男，博士生，研究方向为后量子密码算法硬件实现、高性能计算

赵石磊：男，副教授，研究方向为信息安全、可重构计算、后量子密码芯片设计等

张子悦：男，本科生，研究方向为后量子密码算法硬件实现

刘志伟：男，副教授，研究方向为可重构计算、密码芯片的安全设计等

于斌：男，副教授，研究方向为密码算法、密码芯片设计和数字集成电路设计等

黄海：男，教授，研究方向为信息安全、可重构计算、集成电路设计等

通讯作者:
黄海　ic@hrbust.edu.cn

中图分类号: TP309
计量
- 文章访问数: 33
- HTML全文浏览量: 19
- PDF下载量: 3
- 被引次数: 0
出版历程
- 修回日期: 2025-11-05
- 录用日期: 2025-11-05
- 网络出版日期: 2025-11-15

High Area-efficiency Radix-4 NTT Hardware Architecture with Conflict-free Memory Access Optimization for Lattice-based Cryptography

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

Funds: National Key R&D Program of China (2023YFB4403500)

摘要

摘要: 针对格基后量子密码(PQC)算法中基-2数论变换(NTT)计算效率较低以及原位计算内存访问模式复杂的问题，该文提出一种高面积效率的基-4 NTT硬件设计。首先，介绍了负包裹卷积方法的运算流程及适用条件，在此基础上提出了一种恒定几何(CG)结构的低计算复杂度基-4 NTT/INTT算法。其次，深入分析不同PQC算法中模数的共性特征，设计了基于K²-RED约简的可扩展模乘单元。最后，通过优化存储器与蝶形单元之间的数据分解与重组，提出一种基于顺序循环和阶梯循环访存的读写地址生成方案，实现了高效的无访存冲突。与传统的乒乓存储模式相比，该方案可减少12.5%的存储空间。实验结果表明，在(项数,模数位宽)分别为(256, 13)、(256, 23)和(1024, 14)的3种配置下，该设计的面积-时间积(ATP)较现有方案分别降低56.4%、69.8%和50.3%以上，具有更高的面积效率。
- 格基后量子密码 /
- 数论变换 /
- 恒定几何结构 /
- K²-RED约简 /
- 无访存冲突
Abstract: Objective With the advancement of Post-Quantum Cryptography (PQC) standardization, the efficient implementation of Number Theoretic Transform (NTT) modules has become crucial for enhancing the performance of PQC algorithms. While existing research on high-radix NTT primarily focuses on optimizing in-place computation structures and achieving configurability, it often overlooks performance bottlenecks caused by complex memory access patterns and lacks dedicated optimizations for the specific parameters of PQC algorithms. To address these challenges, this paper proposes a high area-efficiency radix-4 NTT design with a Constant-Geometry (CG) structure. The solution implements targeted optimizations for the modular multiplication unit by analyzing the common characteristics of moduli and integrating multi-level operations, while optimizing memory allocation and address generation strategies to effectively reduce memory capacity and improve data access efficiency. This paper provides an efficient solution for implementing radix-4 CG NTT in out-of-place storage, achieving conflict-free memory access. Methods At the algorithmic level, the proposed radix-4 CG NTT/INTT employs a low-complexity design and eliminates the bit-reversal operation to reduce multiplication count and computation cycles, with a correspondingly redesigned twiddle factor access mechanism. Regarding the most time-consuming modular multiplication in the radix-4 butterfly unit, the critical path is effectively shortened by integrating the multiplication with the first-stage ${\text{K}} - {\text{RED}}$ reduction and simplifying the correction logic. Furthermore, to support three parameter configurations, a scalable modular multiplication scheme is proposed by analyzing common characteristics of different moduli. At the architectural level, a strategy is employed where two coefficients are concatenated and stored at the same memory address. By designing a data decomposition and reorganization scheme, the interaction between the memory and the dual-butterfly units is efficiently managed. To efficiently achieve conflict-free memory access, a cyclic memory space reuse strategy is employed, and read and write address generation schemes based on sequential and stepped access patterns are designed, significantly reducing the required memory capacity and the complexity of control logic. Results and Discussions Experimental results on Field Programmable Gate Arrays (FPGAs) demonstrate that the proposed NTT architecture achieves high operating frequencies and low resource consumption under three parameter configurations, while also significantly improving the Area-Time Product (ATP) compared to existing designs (Table 5). For the configuration with 256 terms and a modulus of 7681, it consumes 2397 slices, 4 BRAMs, and 16 DSPs, achieving an operating frequency of 363 MHz, resulting in at least a 56.4% improvement in ATP. For the configuration with 256 terms and a modulus of 8380417, it consumes 3760 slices, 6 BRAMs, and 16 DSPs, achieving an operating frequency of 338 MHz, resulting in at least a 69.8% improvement in ATP. For the configuration with 1024 terms and a modulus of 12289, it consumes 2379 slices, 4 BRAMs, and 16 DSPs, achieving an operating frequency of 357 MHz, resulting in at least a 50.3% improvement in ATP. Conclusions This paper proposes a high area-efficiency radix-4 NTT hardware architecture for lattice-based PQC algorithms. By employing a low-complexity radix-4 CG NTT/INTT and eliminating the bit-reversal operation, the latency is reduced. By analyzing the common characteristics of three specific moduli and merging partial computations, a scalable modular multiplication architecture based on ${\text{K}}{}^2 - {\text{RED}}$ reduction is designed. This paper tackles the challenges of storage space doubling and address generation complexity by efficiently reusing memory and designing address generation schemes based on sequential and stepped access patterns. Experimental results demonstrate that the proposed scheme significantly improves operating frequency and reduces resource consumption, resulting in lower ATP under three parameter configurations. However, this study only considers a dual-butterfly unit architecture. Future research should further explore architectural designs with higher degrees of parallelism to meet performance requirements for various application scenarios.

HTML全文

图 1 整体架构

下载: 全尺寸图片幻灯片

图 2 基-4蝶形计算单元架构

下载: 全尺寸图片幻灯片

图 3 基于K²-RED约简的可扩展模乘硬件架构

下载: 全尺寸图片幻灯片

图 4 内存空间与存储数据分配机制

下载: 全尺寸图片幻灯片

图 5 64点数据读写过程

下载: 全尺寸图片幻灯片

表 1 基-4 CG NTT算法

输入：$ \mathrm{n} $$a(x) \in {R_p}$,${\varphi _{2N}} \in {\mathbb{Z}_p}$,$\omega _4^1$
输出：$A(x) = NTT(a(x))$
(a)存储预计算的旋转因子
1. for $s = 1$ to ${\log _4}N$ do
2. 　$l = N{\text{/}}{4^s}$
3. 　${\omega _m} = \varphi _{2N}^l$
4. 　for $j = 0$ to ${4^{s - 1}} - 1$ do
5. 　　ROM_1.append$ [\omega _m^{2{\text{bit}} - {\text{reversed(}}j) + 1}] $
6. 　　ROM_2.append$[\omega _m^{2 \cdot (2{\text{bit}} - {\text{reversed(}}j) + 1)}]$
7. 　　ROM_3.append$[\omega _m^{3 \cdot (2{\text{bit}} - {\text{reversed(}}j) + 1)}]$
8. 　end for
9. end for
(b)基-4 CG NTT运算
1. for $s = 1$ to ${\log _4}N$ do
2. 　for $i = 0$ to $N{\text{/}}4 - 1$ do
3. 　　${r^{'}} = i{\rm{mod}} {4^{s - 1}}$
4. 　　$r = {\text{convert(}}{r^{'}}) + ({4^{s - 1}} - 1)/3$
5. 　　${\omega _1} = {\text{ROM\_1[}}r{\text{]}}$
6. 　　${\omega _2} = {\text{ROM}}\_2[r]$
7. 　　${\omega _3} = {\text{ROM\_3[}}r{\text{]}}$
8. 　　${T_0} = (a[i] + a[i + 2N/4] \cdot {\omega _2}){\rm{mod}} p$
9. 　　${T_1} = (a[i] - a[i + 2N/4] \cdot {\omega _2}){\rm{mod}} p$
10. 　 ${T_2} = (a[i + N/4] \cdot {\omega _1} + a[i + 3N/4] \cdot {\omega _3}){\rm{mod}} p$
11. 　 ${T_3} = (a[i + N/4] \cdot {\omega _1} - a[i + 3N/4] \cdot {\omega _3}){\rm{mod}} p$
12. 　 $A[4i] = ({T_0} + {T_2}){\rm{mod}} p$
13. 　 $A[4i + 1] = ({T_1} + {T_3} \cdot \omega _4^1){\rm{mod}} p$
14. 　 $A[4i + 2] = ({T_0} - {T_2}){\rm{mod}} p$
15. 　 $A[4i + 3] = ({T_1} - {T_3} \cdot \omega _4^1){\rm{mod}} p$
16. end for
17.end for

下载: 导出CSV

表 2 基-4 CG INTT算法

输入：$ \mathrm{n} $$A(x) \in {R_P}$,${\varphi _{2N}} \in {\mathbb{Z}_p}$,$\omega _4^1$
输出：$a(x) = INTT(A(x))$
1. for $s = {\log _4}N$ to 1 do
2. 　for $i = 0$ to $N{\text{/}}4 - 1$ do
3. 　　${i^{'}} = N{\text{/}}4 - 1 - i$
4. 　　${r^{'}} = {i^{'}}{\rm{mod}} {4^{s - 1}}$
5. 　　$r = {\text{convert(}}{r^{'}}) + ({4^{s - 1}} - 1)/3$
6. 　　${\omega _1} = {\text{ROM\_1}}[r]$
7. 　　${\omega _2} = {\text{ROM\_2[}}r]$
8. 　　${\omega _3} = {\text{ROM\_3}}[r]$
9. 　　${T_0} = {2^{ - 1}} \cdot (A[4i] + A[4i + 2]){\rm{mod}} p$
10. 　 ${T_1} = (A[4i + 2] - A[4i]) \cdot \omega _4^1{\rm{mod}} p$
11. 　 ${T_2} = {2^{ - 1}} \cdot (A[4i + 1] + A[4i + 3]){\rm{mod}} p$
12. 　 ${T_3} = {2^{ - 1}} \cdot (A[4i + 3] - A[4i + 1]){\rm{mod}} p$
13. 　 $a[i] = {2^{ - 1}} \cdot ({T_0} + {T_2}){\rm{mod}} p$
14. 　 $a[i + N/4] = {2^{ - 1}} \cdot ({T_1} + {T_3}) \cdot {\omega _1}{\rm{mod}} p$
15. 　 $a[i + 2N/4] = {2^{ - 1}} \cdot ({T_2} - {T_0}) \cdot {\omega _2}{\rm{mod}} p$
16. 　 $a[i + 3N/4] = {2^{ - 1}} \cdot ({T_3} - {T_1}) \cdot {\omega _3}{\rm{mod}} p$
17. end for
18.end for

下载: 导出CSV

表 3 K²-RED约简算法

输入：$Z \in [0,{p^2})$, $p = k \cdot {2^m} + 1$, $n = \left\lfloor {{{\log }_2}p} \right\rfloor $
输出：${Z^{''}} = {k^2} \cdot Z{\rm{mod}} p$
1: ${Z_l} = Z[m - 1:0]$ $ {\mathrm{Z}}_{\mathrm{l}}=\mathrm{Z}\mathrm{ }\left(\mathrm{m}\mathrm{o}\mathrm{d}{2}^{\mathrm{m}}\right) $
2: ${Z_h} = Z[2n - 1:m]$
3: ${Z^{'}} = k{Z_l} - {Z_h}$
4: ${Z_l}^{'} = {Z^{'}}[m - 1:0]$
5: ${Z_h}^{'} = {Z^{'}}[2n - m - 1:m]$
6: ${Z^{''}} = k{Z_l}^{'} - {Z_h}^{'}$
7: return ${Z^{''}}$

下载: 导出CSV

表 4 格基PQC方案中模乘核心步骤的定量比较

		Kyber v1算法	Dilithium算法	Falcon算法
$(m,k,n)$		(9,15,13)	(13,1023,23)	(12,3,14)
两阶段K-RED约简		${Z^{'}} = ({Z_l} \lt \lt 4) - {Z_l} - {Z_h}$${Z^{''}} = ({Z_l}^{'} \lt \lt 4) - {Z_l}^{'} - {Z_h}^{'}$	${Z^{'}} = ({Z_l} \lt \lt 10) - {Z_l} - {Z_h}$ ${Z^{''}} = ({Z_l}^{'} \lt \lt 10) - {Z_l}^{'} - {Z_h}^{'}$	${Z^{'}} = ({Z_l} \lt \lt 2) - {Z_l} - {Z_h}$ ${Z^{''}} = ({Z_l}^{'} \lt \lt 2) - {Z_l}^{'} - {Z_h}^{'}$
修正值	${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 01$	7681	8380417	12289
	${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 10$①	-7425	-7331841	-12273
	${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 10$②	256	1048576	16
	${\text{Sig}}{{\text{n}}_1}{\text{Sig}}{{\text{n}}_2} = 11$	256	1048576	16

下载: 导出CSV

表 5 时间性能及资源消耗对比

$(N,l)$	方案	平台	周期	频率/MHz	时延/μs	资源消耗(LUT/FF/BRAM/DSP)				ATP(×10³)
(256,13)	文献[7]	Virtex-7	151	329	0.5	2800	2500	20	24	5.6
	文献[9]	Virtex-7	160	125	1.28	2371	-	12	24	10.71
	文献[22]	Virtex-7	135	256	0.5	6245	1864	12	24	6.12
	文献[23]	Virtex-7	91	195	0.47	14758	8745	24	64	13.33
	本文	Virtex-7	172	363	0.47	2397	1904	4	16	2.44
(256,23)	文献[9]	Virtex-7	200	125	1.6	5071	-	12	56	22.83
	文献[21]	UltraScale	156	220	0.71	11006	8791	12	24	12.07
	文献[23]	Virtex-7	107	195	0.55	14758	8745	24	64	15.60
	本文	Virtex-7	172	338	0.51	3760	3331	6	16	3.65
(1024,14)	文献[7]	Virtex-7	649	307	2.1	2900	2600	20	24	23.73
	文献[8]	Virtex-7	668	250	2.7	2953	1875	5.5	27	19.72
	文献[9]	Virtex-7	680	125	5.4	2584	-	16	24	52.83
	文献[16]	Virtex-7	650	292	2.22	3400	2300	12	16	19.09
	文献[22]	Virtex-7	654	179	3.7	3472	1789	7.5	25	30.42
	文献[23]	Virtex-7	343	225	1.52	12836	7213	28	64	42.01
	本文	Virtex-7	654	357	1.83	2379	2219	4	16	9.48

下载: 导出CSV

参考文献(23)

[1]	SHOR P W. Algorithms for quantum computation: Discrete logarithms and factoring[C]. Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, USA, 1994: 124–134. doi: 10.1109/SFCS.1994.365700.
[2]	MOODY D. Post-quantum cryptography: NIST’s plan for the future[C]. The Seventh International Conference on Post-Quntum Cryptography, Fukuoka, Japan, 2016. (查阅网上资料, 未找到本条文献信息, 请确认).
[3]	GUO Wenbo and LI Shuguo. Split-radix based compact hardware architecture for CRYSTALS-Kyber[J]. IEEE Transactions on Computers, 2024, 73(1): 97–108. doi: 10.1109/TC.2023.3320040.
[4]	LI Guangyan, CHEN Donglong, MAO Gaoyu, et al. Algorithm-hardware co-design of split-radix discrete Galois transformation for KyberKEM[J]. IEEE Transactions on Emerging Topics in Computing, 2023, 11(4): 824–838. doi: 10.1109/TETC.2023.3270971.
[5]	ZHANG Neng, YANG Bohan, CHEN Chen, et al. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2020, 2020(2): 49–72. doi: 10.13154/tches.v2020.i2.49-72.
[6]	XING Yufei and LI Shuguo. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-Kyber on FPGA[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2021(2): 328–356. doi: 10.46586/tches.v2021.i2.328-356.
[7]	LIU Sihuang, KUO C Y, MO Yannan, et al. An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(3): 519–529. doi: 10.1109/TVLSI.2023.3336951.
[8]	CHEN Xiangren, YANG Bohan, YIN Shouyi, et al. CFNTT: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021, 2022(1): 94–126. doi: 10.46586/tches.v2022.i1.94-126.
[9]	MERT A C, KARABULUT E, ÖZTÜRK E, et al. An extensive study of flexible design methods for the number theoretic transform[J]. IEEE Transactions on Computers, 2022, 71(11): 2829–2843. doi: 10.1109/TC.2020.3017930.
[10]	ZHAO Yanrui, LIU Xiaofan, HU Yue, et al. Design of an efficient NTT/INTT architecture with low-complex memory mapping scheme[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71(1): 400–404. doi: 10.1109/TCSII.2023.3296492.
[11]	付秋兴, 李伟, 别梦妮, 等. 格基后量子密码的可重构NTT运算单元与高效调度算法研究[J]. 电子学报, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788. FU Qiuxing, LI Wei, BIE Mengni, et al. Research on reconfigurable NTT arithmetic unit and efficient scheduling algorithm for lattice post-quantum cryptography[J]. Acta Electronica Sinica, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.
[12]	BANERJEE U, UKYAB T S, and CHANDRAKASAN A P. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2019, 2019(4): 17–61. doi: 10.13154/tches.v2019.i4.17-61.
[13]	SU Yang, YANG Bailong, YANG Chen, et al. A highly unified reconfigurable multicore architecture to speed up NTT/INTT for homomorphic polynomial multiplication[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(8): 993–1006. doi: 10.1109/TVLSI.2022.3166355.
[14]	WANG Jianfei, YANG Chen, MENG Yishuo, et al. A reconfigurable and area-efficient polynomial multiplier using a novel in-place constant-geometry NTT/INTT and conflict-free memory mapping scheme[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2025, 72(3): 1358–1371. doi: 10.1109/TCSI.2024.3483229.
[15]	SUN Junyan and BAI Xuefei. A high-speed hardware architecture of an NTT accelerator for CRYSTALS-Kyber[J]. Integrated Circuits and Systems, 2024, 1(2): 92–102. doi: 10.23919/ICS.2024.3419562.
[16]	GENG Yue, HU Xiao, LI Minghao, et al. Rethinking parallel memory access pattern in number theoretic transform design[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 70(5): 1689–1693. doi: 10.1109/TCSII.2023.3260811.
[17]	ALHASSANI A and BENAISSA M. High-speed polynomials multiplication HW accelerator for CRYSTALS-Kyber[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2024, 71(12): 6105–6113. doi: 10.1109/TCSI.2024.3427011.
[18]	AIKATA A, MERT A C, IMRAN M, et al. KaLi: A crystal for post-quantum security using Kyber and Dilithium[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 747–758. doi: 10.1109/TCSI.2022.3219555.
[19]	HU Xiao, TIAN Jing, LI Minghao, et al. AC-PM: An area-efficient and configurable polynomial multiplier for lattice based cryptography[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 719–732. doi: 10.1109/TCSI.2022.3218192.
[20]	BISHEH-NIASAR M, AZARDERAKHSH R, and MOZAFFARI-KERMANI M. High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography[C]. 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 2021: 94–101. doi: 10.1109/ARITH51176.2021.00028.
[21]	LI Bin, YAN Yunfei, WEI Yuanxin, et al. Scalable and parallel optimization of the number theoretic transform based on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32(2): 291–304. doi: 10.1109/TVLSI.2023.3312423.
[22]	MU Jianan, REN Yi, WANG Wen, et al. Scalable and conflict-free NTT hardware accelerator design: Methodology, proof, and implementation[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(5): 1504–1517. doi: 10.1109/TCAD.2022.3205552.
[23]	周清雷, 韩贺茹, 李斌, 等. 面向格密码的可配置基-4 NTT硬件优化与实现[J]. 通信学报, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188. ZHOU Qinglei, HAN Heru, LI Bin, et al. Configurable radix-4 NTT hardware optimization and implementation for lattice-based cryptography[J]. Journal on Communications, 2024, 45(10): 163–179. doi: 10.11959/j.issn.1000-436x.2024188.