应用于格密码的可重构多通道数论变换硬件设计

刘冬生; 赵文定; 刘子龙; 张聪; 刘星杰

doi:10.11999/JEIT210114

应用于格密码的可重构多通道数论变换硬件设计

doi: 10.11999/JEIT210114 cstr: 32379.14.JEIT210114

华中科技大学光学与电子信息学院武汉 430074

基金项目: 国家自然科学基金(61874163)；国家科技重大专项基金(2017ZX01032-101)；中央高校基本科研基金(HUST: 2018KFYYXJJ056)

详细信息

作者简介:
刘冬生：男，1979年生，博士，教授，研究方向为能效无线传输技术及芯片设计、后量子密码算法及密码芯片设计

赵文定：男，1997年生，硕士生，研究方向为数论变换、格密码

刘子龙：男，1990年生，博士，研究方向为硬件安全

张聪：男，1994年生，博士生，研究方向为后量子密码

刘星杰：男，1997年生，硕士生，研究方向为后量子密码硬件实现

通讯作者:
赵文定　zhaowend@foxmail.com

中图分类号: TN918.2; TN402
计量
- 文章访问数: 1849
- HTML全文浏览量: 1095
- PDF下载量: 190
- 被引次数: 0
出版历程
- 收稿日期: 2021-02-01
- 修回日期: 2021-06-10
- 网络出版日期: 2021-06-22
- 刊出日期: 2022-02-25

Reconfigurable Hardware Design of Multi-lanes Number Theoretic Transform for Lattice-based Cryptography

School of Optical and Electronic Information, Huazhong University of Science and Technology, Wuhan 430074, China

Funds: The National Natural Science Foundation of China (61874163), The National Science and Technology Major Project (2017ZX01032-101), The Fundamental Research Funds for the Central Universities (HUST: 2018KFYYXJJ056)

摘要

摘要: 针对不同格密码体制带来的数论变换参数多样性，以及数论变换的性能优化设计，该文提出一种基于随机存取存储器(RAM)的可重构多通道数论变换单元。在数论变换单元设计中，在按时间抽取的基础上改进多通道架构，并提出一种优化地址分配方法。最后基于Xilinx Artix-7现场可编程逻辑门阵列(FPGA)平台进行原型实现，结果显示，所设计的数论变换单元消耗的资源为1744 Slices, 16 DSP，完成1次多项式乘法的时间为2.01 μs(n=256), 3.57 μs(n=512), 6.71 μs(n=1024)和13.43 μs(n=2048)，支持256～2048的不同参数n和13～32 bit模q的可重构配置，工作频率最高可达232 MHz。
- 格密码 /
- 多项式乘法 /
- 数论变换 /
- 硬件实现
Abstract: The performance of number theoretic transformation in lattice-based cryptography is insufficient, and the number theoretic transformation parameters are different. A Random Access Memory (RAM)-based reconfigurable multi-lanes number theoretic transform is proposed. In the design of number theory transformation unit, the multi-lanes architecture is improved on the time decimation operation architecture, and an optimized address allocation method is proposed. The number theory transform unit is implemented on Xilinx artix-7 Field Programmable Gate Array (FPGA) platform. The results show that the resource consumed by the unit is 1744 slices and 16 DSP, and the time to complete a polynomial multiplication is 2.01 μs (n=256), 3.57 μs (n=512), 6.71 μs (n=1024) and 13.43 μs (n=2048). The unit supports reconfigurable configurations of 256～2048 parameters n and 13～32-bit modulus q, and the maximum operating frequency is 232 MHz.
- Lattice-based cryptography /
- Polynomial multiplication /
- Number theoretic transform /
- Hardware implementation

HTML全文

图 1 基于RAM的模乘器结构图

下载: 全尺寸图片幻灯片

图 2 4通道NTT数据流向图

下载: 全尺寸图片幻灯片

图 3 可重构NTT结构图

下载: 全尺寸图片幻灯片

表 1 基2 DIT-NTT算法

输入：$ {\boldsymbol{A}}\left(x\right),{\omega }_{N} \in {R}_{q},N $
输出：$ {\boldsymbol{A}}\left(x\right) $
(1) for m = 1 to N/2 by m = 2m do
(2) for i = 0 to N–1 do
(3) for j = 0 to N–1 do
(4) $ u={\boldsymbol{A}}[k+j] $
(5) $ t={\boldsymbol{A}}[k+j+m]\times {\omega }_{N} $
(6) $ {\boldsymbol{A}}[k+j]\mathrm{ }=\mathrm{ }(u+t)\mathrm{ }\mathrm{m}\mathrm{o}\mathrm{d}\;q $
(7) $ {\boldsymbol{A}}[k+j+N]\mathrm{ }=\mathrm{ }(u-t)\mathrm{ }\mathrm{m}\mathrm{o}\mathrm{d}\;q $
(8) end for
(9) end for
(10) end for
(11) return $ A\left(x\right) $

下载: 导出CSV

表 2 可重构参数表

	n
	256	512	1024	2048	–	–	–	–
模q	7681	12289	40961	65537	786433	5767169	7640033	23068673
模q	104857601	16772161	469762049	998244353	1004535809	1998585857	2013265921	–

下载: 导出CSV

表 3 多项式乘法结果比较表

文献	n	q位数	LUTs	Slices	Memory(36 kB)	DSPs	MHz	周期数	时延(μs)	ATP¹⁾ (×10⁵)
NTT^[5]	1024	30	2317	997	11.5 BRAMs	4	194	21405	110.34	1.100
NTT^[5]	2048	57	3846	1310	22.5 BRAMs	16	161	45453	282.32	3.698
NTT^[6]	2048	17	2323	–	287820 Bits	22	228.99	10289	44.93	1.965²⁾
NTT^[11]	512	23	–	18 K	2.5 BRAMs	128	233.1	412	1.77	0.448³⁾
FFT^[12]	2048	22	–	4406	50 BRAMs	12	208.12	17402	83.615	3.684
本设计	256	32	4780	1744	24 BRAMs/655360 Bits	16	232	1627	7.00	0.122
	512							2798	12.03	0.210
	1024							5251	22.58	0.394
	2048							10455	44.96	0.784
¹⁾ATP是通过将消耗的Slices乘以时延来计算的。 ²⁾NTT^[3]由于未提供消耗的Slices数量，因此ATP的计算采用LUTs×Time，而本设计对应的ATP为2.149。 ³⁾为了合理地比较，多通道NTT^[7]的ATP计算公式为(1.5n×9+2n) / (n×9+2n) ×1.77×18k=0.448×10⁵。

下载: 导出CSV

参考文献(15)

[1]	SHOR P W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer[J]. SIAM Journal on Computing, 1997, 26(5): 1484–1509. doi: 10.1137/S0097539795293172
[2]	ARUTE F, ARYA K, BABBUSH R, et al. Quantum supremacy using a programmable superconducting processor[J]. Nature, 2019, 574(7779): 505–510. doi: 10.1038/s41586-019-1666-5
[3]	赵勇, 戚巍, 徐兵杰, 等. 量子安全技术白皮书(2020)[R]. 2020.
[4]	CHEN Zhaohui, MA Yuan, CHEN Tianyu, et al. Towards efficient kyber on FPGAs: A processor for vector of polynomials[C]. The 2020 25th Asia and South Pacific Design Automation Conference, Beijing, China, 2020. doi: 10.1109/ASP-DAC47756.2020.9045459.
[5]	PÖPPELMANN T and GÜNEYSU T. Towards efﬁcient arithmetic for lattice-based cryptography on reconﬁgurable hardware[C]. The 2nd International Conference on Cryptology and Information Security in Latin America, Santiago, Chile, 2012. doi: 10.1007/978-3-642-33481-8_8" target="_blank">href="http://dx.doi.org/10.1007/978-3-642-33481-8_8">10.1007/978-3-642-33481-8_8.
[6]	RENTERÍA-MEJÍA C P and VELASCO-MEDINA J. Hardware design of an NTT-Based polynomial multiplier[C]. The 2014 IX Southern Conference on Programmable Logic, Buenos Aires, Argentina, 2014: 1–5. doi: 10.1109/SPL.2014.7002209.
[7]	YE J H and SHIEH M D. High-performance NTT Architecture for large integer multiplication[C]. 2018 International Symposium on VLSI Design, Automation and Test, Hsinchu, China, 2018: 1–4. doi: 10.1109/VLSI-DAT.2018.8373254.
[8]	ZHANG Neng, QIN Qiao, YUAN Hang, et al. NTTU: An area-efficient low-power NTT-uncoupled architecture for NTT-based multiplication[J]. IEEE Transactions on Computers, 2020, 69(4): 520–533. doi: 10.1109/TC.2019.2958334
[9]	AYSU A, PATTERSON C, and SCHAUMONT P. Low-cost and area-efficient FPGA implementations of lattice-based cryptography[C]. Proceedings of 2013 IEEE International Symposium on Hardware-Oriented Security and Trust, Austin, USA, 2013.
[10]	RENTERÍA-MEJÍA C R and VELASCO-MEDINA J. Lattice-based cryptoprocessor for CCA-Secure identity-based encryption[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(7): 2331–2344. doi: 10.1109/TCSI.2020.2981089
[11]	FENG Xiang, LI Shuguo, and XU Sufen. RLWE-oriented high-speed polynomial multiplier utilizing multi-lane stockham NTT algorithm[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2020, 67(3): 556–559. doi: 10.1109/TCSII.2019.2917621
[12]	CHEN D D, MENTENS N, VERCAUTEREN F, et al. High-speed polynomial multiplication architecture for ring-LWE and SHE cryptosystems[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2015, 62(1): 157–166. doi: 10.1109/TCSI.2014.2350431
[13]	MERT A C, KARABULUT E, OZTURK E, et al. An extensive study of flexible design methods for the number theoretic transform[J/OL]. IEEE Transactions on Computers, 2020, 1–15.
[14]	LIU Dongsheng, ZHANG Cong, LIN Hui, et al. A resource-efficient and side-channel secure hardware implementation of ring-LWE cryptographic processor[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2019, 66(4): 1474–1483. doi: 10.1109/TCSI.2018.2883966
[15]	KIM S, LEE K, CHO W, et al. Hardware architecture of a number theoretic transform for a bootstrappable RNS-based homomorphic encryption scheme[C]. The 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines, Fayetteville, USA, 2020, 56–64. doi: 10.1109/FCCM48280.2020.00017.