Research on the Architecture of Dual-field Reconfigurable Polynomial Multiplication Unit for Lattice-based Post-quantum Cryptography
-
摘要: 多项式乘法作为格密码算法的核心运算,通过快速数论变换(NTT)/快速傅里叶变换(FFT)设计成多项式乘法架构能够降低计算复杂度,提升格密码算法的运算速度。为支持多参数的多项式乘法运算,并提升多项式乘法的运算速度,该文提出一种双域可重构多项式乘法运算单元架构。首先基于Kyber, Dilithium和Falcon算法中的参数特征,提取多项式乘法的运算网络,并对内部双域乘法运算在算法层进行优化设计。其次,基于多项式乘法运算网络设计出双域可重构多项式乘法运算单元架构,并对双域可重构乘法单元进行优化设计以提升运算速度。最后,为提高运算单元架构的资源利用率,该文对多项式乘法运算单元架构进行并行度分析,当支持1路64 bit、2路32 bit或4路16 bit的运算时,多项式乘法运算架构的面积效率最高。该文在Xilinx FPGA xc7v2000tflg1925上进行实验验证,能够支持Kyber, Dilithium和Falcon算法中所有多项式乘法运算,工作频率达到169 MHz,面积时间积降低了50%以上。Abstract:
Objective Polynomial multiplication accounts for more than 80% of the computational time in lattice cryptography algorithms. The Number Theoretic Transform (NTT) and Fast Fourier Transform (FFT) reduce the computational complexity of polynomial multiplication from exponential to logarithmic order. However, mainstream lattice cryptography algorithms, including Kyber, Dilithium, and Falcon, differ considerably in their parameter sets and polynomial multiplication implementations. To support polynomial multiplication under multiple parameter configurations and improve resource utilization, a dual-field reconfigurable polynomial multiplication unit architecture is proposed. Methods First, the computational network for polynomial multiplication is extracted according to the parameter characteristics of Kyber, Dilithium, and Falcon. The internal dual-field multiplication operations are optimized at the algorithm level. Next, a dual-field reconfigurable polynomial multiplication unit architecture is designed for the polynomial multiplication network. The dual-field reconfigurable multiplication unit is further optimized to improve computational speed. Finally, a parallelism analysis is conducted to improve resource utilization of the computational architecture. The proposed architecture achieves the highest area efficiency when supporting 1-lane 64 bit, 2-lane 32 bit, or 4-lane 16 bit operations. Results and Discussions The architecture is experimentally validated on the Xilinx FPGA XC7V2000TFLG1925. It simultaneously supports one channel of complex-form floating-point operations or two channels of 17$ \sim $32 bit internal NTT operations and four channels of 16 bit internal NTT operations. At an operating frequency of 169 MHz, the architecture reduces the area-time product by more than 50%. Conclusions The proposed dual-field reconfigurable processing unit architecture provides advantages in scalability, area efficiency, and core unit performance. Its configurable bit-width design adapts more easily to traditional cryptographic processors and provides a practical approach for migrating conventional public-key cryptosystems to post-quantum cryptography. -
表 1 格基后量子密码多项式乘法参数分析
算法 系数类型 位宽(bit) 项数n 模数q Kyber 整数 12 256 3 329 ≡1modnDilithium 整数 23 256 8380417 ≡1mod2nFalcon 整数 14 512/ 1024 12289 ≡1mod2n浮点数形式的复数 64 512/ 1024 - 表 2 多项式乘法运算提取
Kyber Dilithium/Falcon 变换 $ \left.\begin{array}{c}{a}^{\prime}=a+b\cdot \omega \\{b}^{\prime}=a-b\cdot \omega \end{array}\right\}\Rightarrow \begin{array}{c}a+b\cdot c\\a-b\cdot c\end{array} $ $ \left.\begin{array}{c}{a}^{\prime}=a+b\cdot \omega \\{b}^{\prime}=a-b\cdot \omega \end{array}\right\}\Rightarrow \begin{array}{c}a+b\cdot c\\a-b\cdot c\end{array} $ 点乘 $ \left.\begin{array}{c}{\hat{c}}_{2i}=({\hat{a}}_{2i}\cdot {\hat{b}}_{2i})+({\hat{a}}_{2i+1}\cdot {\hat{b}}_{2i+1})\cdot \omega \\{\hat{c}}_{2i+1}=({\hat{a}}_{2i}\cdot {\hat{b}}_{2i+1})+{\hat{a}}_{2i+1}\cdot {\hat{b}}_{2i}\end{array}\right\}\Rightarrow \begin{array}{c}b\cdot c\\a+d\cdot e\end{array} $ $ {c}_{i}={a}_{i}\cdot {b}_{i}\Rightarrow b\cdot c $ 逆变换 $ \left.\begin{array}{c}{a}^{\prime}=(a+b)/2\\{b}^{\prime}=\omega \cdot (a-b)/2\end{array}\right\}\Rightarrow \begin{array}{c}(a+b)/2\\\omega \cdot (a-b)/2\end{array} $ $ \left.\begin{array}{c}{a}^{\prime}=(a+b)/2\\{b}^{\prime}=\omega \cdot (a-b)/2\end{array}\right\}\Rightarrow \begin{array}{c}(a+b)/2\\\omega \cdot (a-b)/2\end{array} $ 算法1 Montgomery模乘 算法2 Barrett模乘 算法3 K-RED模乘 输入:$ A,B,R,P,{P}^{\prime} $ 输入:$ a,b,m=\left\lfloor {2}^{k}/q\right\rfloor ;k\approx 2\left\lceil \log_2 q\right\rceil $ 输入:$ a,b,q=k\times {2}^{m}+1 $ 输出:$ u=A\times B\times {R}^{-1}(\mathrm{mod}P) $ 输出:$ O=a\times b(\mathrm{mod}q) $ 输出:$ {C}^{\prime}=k\times a\times b\mathrm{mod}q $ 1: $ t=A\times B $ 1: $ O=a\times b $ 1: $ C=a\times b $ 2: $ s=(t+(t\times {P}^{\prime}(\mathrm{mod}R)\times P)))/R $ 2: $ t=(O\times m)\gg k $ 2: $ {C}_{l}=C\mathrm{mod}{2}^{m} $ 3: $ {\mathrm{if}}(s \gt P) $ 3: $ O=O-(t\times q) $ 3: $ {C}_{h}=C\gg m $ 4: $ s=s-P $ 4: $ {\mathrm{if}}(O\geq q) $ 4: $ {C}^{\prime}=k{C}_{l}-{C}_{h} $ 5: $ {\rm{returns}} $ 5: $ O=O-q $ 5: $ {\rm{return}}{C}^{\prime} $ 6: $ {\rm{returnO}} $ 4 双域可重构乘法单元
输入:$ \mathrm{a}\_ \mathrm{r},\mathrm{a}\_ \mathrm{i},\mathrm{b}\_ \mathrm{r},\mathrm{b}\_ \mathrm{i},{q},{m},{k},\mathrm{sel} $ 输出:$ \mathrm{c}\_ \mathrm{r},\mathrm{c}\_ \mathrm{i} $ 1: $ \mathrm{mul}\_ 1=\mathrm{mul}(\mathrm{a}\_ \mathrm{r},\mathrm{b}\_ \mathrm{r}) $ 2: $ \mathrm{add}\_ 1=\mathrm{add}(\mathrm{a}\_ \mathrm{r},\mathrm{a}\_ \mathrm{i}) $ 3: $ \mathrm{add}\_ 2=\mathrm{add}(\mathrm{b}\_ \mathrm{r},\mathrm{b}\_ \mathrm{i}) $ //sel=1,完成模乘运算;sel=0,完成复数乘法// 4: $ \mathrm{if}(\mathrm{sel}=1) $ 5: $ \mathrm{mul}\_ 2=\mathrm{mul}(\mathrm{mul}\_ 1[{w}-1\colon 0],{m}) $ 6: $ \mathrm{mul}\_ 3=\mathrm{mul}(\mathrm{mul}\_ 1[2{w}-1\colon {w}],{m}) $ 7: $ \mathrm{else} $ 8: $ \mathrm{mul}\_ 2=\mathrm{mul}(\mathrm{a}\_ \mathrm{i},\mathrm{b}\_ \mathrm{i}) $ 9: $ \mathrm{mul}\_ 3=\mathrm{mul}(\mathrm{add}\_ 1,\mathrm{add}\_ 2) $ 10: $ \mathrm{if}(\mathrm{sel}=1) $ 11: $ \mathrm{add}\_ 3=\mathrm{add}(\mathrm{mul}\_ 2,\mathrm{mul}\_ 3) $ 12: $ \mathrm{add}\_ {3}^{\prime}=\mathrm{add}\_ 3 \gt > {k} $ 13: $ \mathrm{add}\_ {3}^{\prime\prime}=\mathrm{mul}\_ \mathrm{q}(\mathrm{add}\_ {3}^{\prime}) $ 14: $ \mathrm{else} $ 15: $ \mathrm{add}\_ 3=\mathrm{add}(\mathrm{mul}\_ 1,\mathrm{mul}\_ 2) $ 16: $ \mathrm{if}(\mathrm{sel}=1) $ 17: $ \mathrm{sub}\_ 1=\mathrm{sub}(\mathrm{mul}\_ 1,\mathrm{add}\_ {3}^{\prime\prime}) $ 18: $ \mathrm{sub}\_ 2=\mathrm{sub}(\mathrm{sub}\_ 1,{q}) $ 19: $ \mathrm{if}(\mathrm{sub}\_ 1 \lt {q}) $ 20: $ \mathrm{c}\_ \mathrm{r}=\mathrm{sub}\_ 1;\mathrm{c}\_ \mathrm{i}=0 $ 21: $ \mathrm{else} $ 22: $ \mathrm{c}\_ \mathrm{r}=\mathrm{sub}\_ 2;\mathrm{c}\_ \mathrm{i}=0 $ 23: $ \mathrm{else} $ 24: $ \mathrm{sub}\_ 1=\mathrm{sub}(\mathrm{mul}\_ 1,\mathrm{mul}\_ 2) $ 25: $ \mathrm{sub}\_ 2=\mathrm{sub}(\mathrm{mul}\_ 3,\mathrm{add}\_ 3) $ 26: $ \mathrm{c}\_ \mathrm{r}=\mathrm{sub}\_ 1;\mathrm{c}\_ \mathrm{i}=\mathrm{sub}\_ 2 $ 27:$ \mathrm{return}\mathrm{c}\_ \mathrm{r},\mathrm{c}\_ \mathrm{i} $ 表 3 运算单元架构子模块FPGA资源对比
硬件单元 LUT FF DSP add 590 0 0 sub 475 0 0 div 143 0 0 add_64 64 0 0 add_96 288 0 0 mul_q 371 128 0 sub_64 256 0 0 表 4 多项式乘法单元性能对比
文献 频率
(MHz)时钟
周期时延
(μs)器件 n 系数位宽
(bit)资源消耗
(LUT/FF/DSP/BRAM)ATP 算法 电子学报2025[15] 169 204** 0.4 V7 256 16 18120 /4000 /64/134.79 Kyber 444** 1.56 512 14 75.72 Falcon TCAS-I 2022[21] 224 192 8.571 A7 256 13 25674 /3137 /64/6317.22 Kyber
DilithiumTCAS-I
2024[4]334 638 1.91 A7 256 13 1052 /1058 /0/13.511.77 Kyber TCAD
2023[19]154 1470 9.55 V7 512 14 28616 /4211 /24/48473.94 Falcon TCAS-I
2023[7]140 867 6.19 A7 256 23 11100 /8700 /32/4149.80 Dilithium TCAS-II
2025[13]362 - - US+ 1024 64 11408 /11634 /60/20.50.097* FFT 本文 169 859 5.08 V7 256 <16 6522 /3018 /48/580.47 Kyber 1883 11.15 512 <16 176.62 Falcon 1691 10.01 256 <32 158.56 Dilithium 7451 44.11 1024 64 698.70
(0.094*)FFT 注:1.在上述FPGA型号中,US+为Zynq UltraScale+, V7为Virtex-7, A7为Artix-7;
2.*表示处理单次运算的ATP值;
3.**表示由于未查询到整个多项式乘法流程消耗的资源,因此将NTT消耗的时钟进行3倍处理。 -
[1] BOS J, DUCAS L, KILTZ E, et al. CRYSTALS - Kyber: A CCA-secure module-lattice-based KEM[C]. 2018 IEEE European Symposium on Security and Privacy, London, UK, 2018: 353–367. doi: 10.1109/EuroSP.2018.00032. [2] DUCAS L, KILTZ E, LEPOINT T, et al. CRYSTALS-Dilithium: A lattice-based digital signature scheme[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2018, 2018(1): 238–268. doi: 10.13154/tches.v2018.i1.238-268. [3] FOUQUE P A, HOFFSTEIN J, KIRCHNER P, et al. Falcon: Fast-Fourier lattice-based compact signatures over NTRU[R]. Submission to the NIST’s Post-Quantum Cryptogr. Standardization Process, 2020. [4] ALHASSANI A and BENAISSA M. High-speed polynomials multiplication HW accelerator for CRYSTALS-Kyber[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2024, 71(12): 6105–6113. doi: 10.1109/TCSI.2024.3427011. [5] LI Aobo, LU Jiahao, LIU Dongsheng, et al. A 40nm 2.76μJ/Op energy-efficient secure post-quantum crypto-processor for crystals-Kyber on module-LWE[C]. 2023 IEEE Asian Solid-State Circuits Conference (A-SSCC), Haikou, China, 2023: 1–3. doi: 10.1109/A-SSCC58667.2023.10347915. [6] LI Xiang, LU Jiahao, LIU Dongsheng, et al. A high speed post-quantum crypto-processor for crystals-dilithium[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71(1): 435–439. doi: 10.1109/TCSII.2023.3304416. [7] HU Xiao, TIAN Jing, LI Minghao, et al. AC-PM: An area-efficient and configurable polynomial multiplier for lattice based cryptography[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 719–732. doi: 10.1109/TCSI.2022.3218192. [8] 陈韬, 李慧琴, 吴艾青, 等. 基于2KNTT的多项式乘法单元设计[J]. 电子学报, 2024, 52(2): 455–467. doi: 10.12263/DZXB.20220629.CHEN Tao, LI Huiqin, WU Aiqing, et al. A polynomial multiplier design based on 2KNTT[J]. Acta Electronica Sinica, 2024, 52(2): 455–467. doi: 10.12263/DZXB.20220629. [9] XU Tianyu, CUI Yijun, LIU Dongsheng, et al. Lightweight and efficient hardware implementation for saber using NTT multiplication[C]. 2022 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Shenzhen, China, 2022: 601–605. doi: 10.1109/APCCAS55924.2022.10090310. [10] AIKATA A, MERT A C, IMRAN M, et al. KaLi: A crystal for post-quantum security using Kyber and dilithium[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(2): 747–758. doi: 10.1109/TCSI.2022.3219555. [11] LU Jiahao, ZHANG Jiaming, LUO Zhixiang, et al. An efficient and configurable hardware architecture of polynomial modular operation for CRYSTALS-Kyber and dilithium[C]. 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, USA, 2024: 29–32. doi: 10.1109/MWSCAS60917.2024.10658892. [12] ZHU Yihong, ZHU Wenping, OUYANG Yi, et al. 16.2 A 28nm 69.4kOPS 4.4μJ/Op versatile post-quantum crypto-processor across multiple mathematical problems[C]. 2024 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2024: 298–300. doi: 10.1109/ISSCC49657.2024.10454332. [13] WANG Jianfei, YANG Chen, HOU Jia, et al. A compact and efficient hardware accelerator for RNS-CKKS En/decoding and En/decryption[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2025, 72(1): 243–247. doi: 10.1109/TCSII.2024.3454024. [14] 陈韬, 李慧琴, 李伟, 等. 面向格基后量子密码算法的可重构多项式乘法架构[J]. 电子与信息学报, 2023, 45(9): 3380–3392. doi: 10.11999/JEIT230284.CHEN Tao, LI Huiqin, LI Wei, et al. Reconfigurable polynomial multiplication architecture for lattice-based post-quantum cryptography algorithms[J]. Journal of Electronics & Information Technology, 2023, 45(9): 3380–3392. doi: 10.11999/JEIT230284. [15] 付秋兴, 李伟, 别梦妮, 等. 格基后量子密码的可重构NTT运算单元与高效调度算法研究[J]. 电子学报, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788.FU Qiuxing, LI Wei, BIE Mengni, et al. Research on reconfigurable NTT arithmetic unit and efficient scheduling algorithm for lattice post-quantum cryptography[J]. Acta Electronica Sinica, 2025, 53(4): 1182–1191. doi: 10.12263/DZXB.20240788. [16] LYUBASHEVSKY V, MICCIANCIO D, PEIKERT C, et al. SWIFFT: A modest proposal for FFT hashing[C]. The 15th International Workshop on Fast Software Encryption, Lausanne, Switzerland, 2008: 54–72. doi: 10.1007/978-3-540-71039-4_4. [17] COOLEY J W and TUKEY J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19(90): 297–301. doi: 10.1090/S0025-5718-1965-0178586-1. [18] GENTLEMAN W M and SANDE G. Fast Fourier transforms: For fun and profit[C]. The November 7-10, 1966, Fall Joint Computer Conference, San Francisco, USA, 1966: 563–578. doi: 10.1145/1464291.1464352. [19] MU Jianan, REN Yi, WANG Wen, et al. Scalable and conflict-free NTT hardware accelerator design: Methodology, proof, and implementation[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(5): 1504–1517. doi: 10.1109/TCAD.2022.3205552. [20] LIAO Junhao, CHO K T, and SHIEH M D. A unified FFT/NTT design for efficient NTRU equation solving in FALCON cryptography[C]. 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, United Kingdom, 2025: 1–5. doi: 10.1109/ISCAS56072.2025.11043318. [21] ZHAO Yifan, XIE Ruiqi, XIN Guozhu, et al. A high-performance domain-specific processor with matrix extension of RISC-V for module-LWE applications[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2022, 69(7): 2871–2884. doi: 10.1109/TCSI.2022.3162593. -
下载:
下载: