高性能Ed25519算法硬件架构设计与实现

于斌; 黄海; 刘志伟; 赵石磊; 那宁

doi:10.11999/JEIT200876

高性能Ed25519算法硬件架构设计与实现

doi: 10.11999/JEIT200876 cstr: 32379.14.JEIT200876

哈尔滨理工大学计算机科学与技术学院哈尔滨 150080

基金项目: 黑龙江省自然科学基金(YQ2019F010)，黑龙江省博士后科研启动基金(LBH-Q18065)，中央引导地方科技发展专项(ZY20B11)

详细信息

作者简介:
于斌：男，1984年生，讲师，研究方向为密码算法、密码芯片设计和数字集成电路设计等

黄海：男，1982年生，硕士生导师，研究方向为信息安全、可重构技术、集成电路设计等

刘志伟：男，1987年生，讲师，研究方向为可重构计算、高速密码算法、并行加密技术、密码芯片的安全设计等

赵石磊：男，1979年生，硕士生导师，研究方向为信息安全、高速密码算法、密码芯片的安全设计等

那宁：男，1995年生，博士生，研究方向为信息安全和集成电路设计等

通讯作者:
黄海　ic@hrbust.edu.cn

中图分类号: TN918; TP309
计量
- 文章访问数: 2358
- HTML全文浏览量: 1383
- PDF下载量: 164
- 被引次数: 0
出版历程
- 收稿日期: 2020-10-12
- 修回日期: 2021-01-29
- 网络出版日期: 2021-03-01
- 刊出日期: 2021-07-10

High-performance Hardware Architecture Design and Implementation of Ed25519 Algorithm

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

Funds: The Natural Science Foundation of Heilongjiang (YQ2019F010), Heilongjiang Postdoctoral Funds for Scientific Research Initiation (LBH-Q18065), The Science and Technology Development Special Project of Central Guide the Local Government of China (ZY20B11)

摘要

摘要: 针对签名验签速度难以满足特定应用领域需求的问题，该文设计了一种高性能Ed25519算法的硬件实现架构。采用宽度为2 bit的窗口法实现标量乘运算，减少了标量乘所需的总周期数；通过优化点加倍点操作步骤，提高了乘法器的硬件使用率；使用低计算复杂度的快速模约简实现模乘，提高了整体运算速度。为了使模L运算可复用标量乘中的快速模约简，该文提出一种基于Barrett约简的模L算法。通过优化解压过程中模幂操作过程，精简了步骤并使其可复用模乘。对所提架构做硬件实现，在TSMC的55 nm CMOS工艺下，面积为746×10³等效门，最高频率360 MHz，每秒能够执行公钥生成9.06×10⁴次、签名8.82×10⁴次和验签3.99×10⁴次。
- 椭圆曲线数字签名算法 /
- 爱德华兹曲线 /
- 硬件实现 /
- 标量乘 /
- 快速模约简
Abstract: The speed of existing signature and verification architecture is difficult to meet the requirement of the specific applications domain, to solve this problem a high-performance hardware architecture of Ed25519 algorithm is developed. The scalar multiplication algorithm is implemented by using the window method with 2 bit width to reduce the total cycle numbers of the algorithm significantly. By optimizing the order of operations of point addition and point doubling, the hardware utilization rate of multiplier is improved. The module multiplication is realized by using fast module reduction with low computational complexity, thus the overall operation speed is improved. The modular L algorithm based on Barrett reduction is proposed to reuse the fast modular reduction in scalar multiplications. By optimizing the modular power computation in the decompression process, the steps are simplified and the modular multiplication can be reused. Under the TSMC 55 nm CMOS process, the area of the proposed hardware architecture is 7.46×10⁵ equivalent gate, and the maximum frequency is up to 360 MHz. It can perform 9.06×10⁴ key generations, 8.82×10⁴ signatures and 3.99×10⁴ verifications per second.
- Elliptic Curve Digital Signature Algorithm (ECDSA) /
- Edwards-curve /
- Hardware implementation /
- Scalar multiplication /
- Fast modular reduction

HTML全文

图 1 Ed25519完整结构图

下载: 全尺寸图片幻灯片

图 2 模运算单元

下载: 全尺寸图片幻灯片

图 3 模约简整体结构

下载: 全尺寸图片幻灯片

图 4 b的2²⁵²–3次模幂操作步骤图

下载: 全尺寸图片幻灯片

表 1 Ed25519公钥生成算法流程

(1)选择256 bit私钥sk=(sk₂₅₅, sk₂₅₄,···, sk₁, sk₀)₂；
(2)对sk做SHA-512运算，即H(sk)= (h₅₁₁,h₅₁₀,···,h₁,h₀)₂；
(3)取H(sk)的低256 bit，并整理为s=(0,1,h₂₅₃,h₂₅₂,···,h₃,0,0,0)₂；
(4)做标量乘A=sB=(x,y)，其中x=(x₂₅₄,x₂₅₃,···,x₁,x₀)₂, 　　 y=(y₂₅₄,y₂₅₃,···,y₁,y₀)₂；
(5)压缩sB结果，得公钥pk=(x₀,y₂₅₄,y₂₅₃,···,y₁,y₀)₂。

下载: 导出CSV

表 2 Ed25519签名算法流程

(1)取H(sk)的高256 bit h=(h₅₁₁,h₅₁₀,···,h₂₅₇,h₂₅₆)₂；
(2)做SHA-512运算，r=H(h\|\|M) mod L；
(3)做标量乘R'=rB=(x,y)，其中x=(x₂₅₄,x₂₅₃,···,x₁,x₀)₂, y=(y₂₅₄, 　　 y₂₅₃,···,y₁,y₀)₂；
(4)压缩rB结果，得R=(x₀,y₂₅₄,y₂₅₃,···,y₁,y₀)₂；
(5)做SHA-512运算，k=H(R\|\|pk\|\|M) mod L；
(6)计算S= (r+ks) mod L；
(7)返回签名结果(R\|\|S)。

下载: 导出CSV

表 3 Ed25519验签算法流程

(1)解压R为点坐标R'；
(2)解压pk为点坐标A；
(3)做SHA-512运算，k=H(R\|\|pk\|\|M) mod L；
(4)验证SB=R'+kA是否成立，若成立则验签成功。

下载: 导出CSV

表 4 宽度为2的窗口法

输入：255 bit二进制数k={k₂₅₄k₂₅₃···k₁k₀}，点加标志位n, 　　　　n∈{0,1}，椭圆曲线上任意点P₁, P₂。
输出：标量乘Q=kP₁+nP₂
(1)预计算2P₁,3P₁，共2个点坐标；
(2)若k₂₅₄=0，则令Q=0，为无穷远点，若k₂₅₄=1，则令Q=P₁；
(3)对于i从126～0，循环计算：
Q=4Q；
Q=Q+{k₂_i₊₁,k₂_i}P₁；
(4)若n=1，则计算：Q=Q+P₂ //为适应验签算法；
(5)返回Q值。

下载: 导出CSV

表 5 点加和倍点操作流程^[14]

输入：P(X₁,Y₁,Z₁,T₁), Q(X₂,Y₂,Z₂,T₂)
输出：Q(X₃,Y₃,Z₃,T₃)=P+Q, Q(X₃,Y₃,Z₃,T₃)=2P
(1)	A=(Y₁–X₁)·(Y₂–X₂)	A=(X₁)²
(2)	B=(Y₁+X₁)·(Y₂+X₂)	B=(Y₁)²
(3)	C=T₁·2·d·T₂	C=2·(Z₁)²
(4)	D=Z₁·2·Z₂	H=A+B
(5)	E=B–A	E=H–(X₁+Y₁)²
(6)	F=D–C	G=A–B
(7)	G=D+C	F=C+G
(8)	H=B+A	X₃=E·F
(9)	X₃=E·F	Y₃=G·H
(10)	Y₃=G·H	T₃=E·H
(11)	T₃=E·H	Z₃=F·G
(12)	Z₃=F·G

下载: 导出CSV

表 6 倍点操作流程

输入：点坐标P(X₁,Y₁,Z₁,T₁), Q(X₂,Y₂,Z₂,T₂)
输出：Q(X₃,Y₃,Z₃,T₃)=2P
步骤	模乘	模加减
(1)	A=X₁·X₁
(2)	t₁=Z₁·Z₁
(3)	B=Y₁·Y₁	t₂=X₁+Y₁
(4)	t₂=t₂·t₂	C=t₁+t₁, H=A+B, G=A–B
(5)	Y₂=G·H	E=H–t₂, F=C+G
(6)	X₂=E·F
(7)	Z₂=F·G
(8)	T₂=E·H
(9)	等待T₂约简

下载: 导出CSV

表 7 点加操作流程

输入：点坐标P(X₁,Y₁,Z₁, T₁)，Q(X₂,Y₂,Z₂,T₂)
输出：Q(X₃,Y₃,Z₃,T₃)=P+Q
步骤	模乘	模加减
(1)		A₁←Y₁–X₁, B₁←Y₁+X₁
(2)		A₂←Y₂–X₂, B₂←Y₂+X₂
(3)	A←A₁·A₂
(4)	B←B₁·B₂	t₁=T₁+T₁
(5)	t₁←t₁·d	t₂=Z₁+Z₁
(6)	D←t₂·Z₂
(7)	C←t₁·T₂	E←B–A, H←B+A
(8)	X₃←E·H	F←D–C, G←D+C
(9)	Y₃←G·H
(10)	Z₃←F·G
(11)	T₃←E·H
(12)	等待Z₃约简

下载: 导出CSV

表 8 Ed25519快速约简算法

输入：A=(A₂₅₄\|\|A₂₅₃\|\|A₂₅₂\|\|···\|\|A₂\|\|A₁\|\|A₀)
输出：Q=A mod p
(1) T=(A₁₂₇\|\| A₁₂₆\|\| A₁₂₅\|\| ···\|\| A₂\|\| A₁\|\| A₀)
(2) S₁=(A₂₅₂\|\| A₂₅₁\|\| A₂₅₀\|\| ···\|\| A₁₂₉\|\| A₁₂₈\|\|5′h0)
(3) S₂=(A₂₅₃\|\| A₂₅₂\|\| A₂₅₁\|\| ···\|\| A₁₂₉\|\| A₁₂₈\|\|3′h0)
(4) S₃=(247′h0\|\| A₂₅₄\|\| A₂₅₄\|\|4′h0)
(5) S₄=(249′h0\|\| A₂₅₃\|\| A₂₅₃\|\|2′h0)
(6) S₅=(249′h0\|\| A₂₅₄\|\| 4′h0)
(7) D₁=(A₂₅₄\|\| A₂₅₃\|\| A₂₅₂\|\| ···\|\| A₁₂₉\|\| A₁₂₈\|\|1′h0)
(8) D₂=(253′h0\|\| A₂₅₄)
(9) D₃=(253′h0\|\| A₂₅₃)
(10) Q=(T+S₁+S₂+S₃+S₄+S₅–D₁–D₂–D₃)mod p

下载: 导出CSV

表 9 基于Barrett约简的模L算法

输入：素数L, 512 bit数值a, T=[2⁵¹²/(8L)]
输出：r=a mod L
(1)q₁=[a/2²⁵⁵]·T；
(2)q₂=[ q₁/2²⁵⁷]·(8L)；
(3)r=(a mod 2²⁵⁷)–(q₂ mod 2²⁵⁷)；
(4)若果r<0，则r=r+2²⁵⁷；
(5)如果r≥8L，则r=r–11L，否则r=r–3L；//可复用快速模约简
(6)计算r=r mod L并返回r值。

下载: 导出CSV

表 10 Ed25519性能参数

	时钟周期	主频	面积	等效门数
参数值	2.78 ns	360 MHz	1.433×10⁶	746×10³
注1：按工艺库提供的系数1.92从面积折算

下载: 导出CSV

表 11 周期及运算次数

	平均周期	理论最大周期	平均每秒运算次数	理论最慢每秒运算次数
公钥生成	3237	3975	111.2×10³	90.6×10³
签名	3345	4083	107.6×10³	88.2×10³
验签	7488	9020	48.1×10³	39.9×10³

下载: 导出CSV

表 12 Ed25519运算速度对比(μs)

	公钥生成	签名	验签
本文	9.0	9.3	20.8
文献[8]	2109.7	1610.3	3676.5

下载: 导出CSV

表 13 标量乘单元性能对比

	器件	硬件开销Slices/LUT/FF/DSP/BRAM	周期数Cycles	折算值Slices×Cycles×10^–6
本文	Zynq-7035	14759/52512/9342/225/0	3895	57.5
文献[8]	Zynq-7020	775/2707/962/15/0	120260	93.2
文献[9]¹⁾	Zynq-7000	2170/8680/3472/0/0	86348	187.4
文献[9]²⁾	Zynq-7000	2193/8770/3729/0/0	74783	189.3
文献[11]	Zynq-7020	6161/17939/21077/175/0	10465	64.5
文献[12]	Zynq-7020	1006/0/0/20/2	114980	115.7
文献[13]	Zynq-7020	1029/2783/3592/20/2	79400	81.7
1)：采用D&A算法；2)：采用NAF算法

下载: 导出CSV

参考文献(19)

[1]	RESCORLA E. IETF RFC 8446 The Transport Layer Security (TLS) protocol version 1.3[S]. 2018.
[2]	LANGLEY A, HAMBURG M, and TURNER S. IRTF RFC 7748 Elliptic curves for security[S]. 2016.
[3]	FAZ-HERNÁNDEZ A, LÓPEZ J, and DAHAB R. High-performance implementation of elliptic curve cryptography using vector instructions[J]. ACM Transactions on Mathematical Software, 2019, 45(3): 25.1–25.35. doi: 10.1145/3309759
[4]	ISLAM M M, HOSSAIN M S, HASAN M K, et al. FPGA implementation of high-speed area-efficient processor for elliptic curve point multiplication over prime field[J]. IEEE Access, 2019, 7: 178811–178826. doi: 10.1109/ACCESS.2019.2958491
[5]	戴紫彬, 易肃汶, 李伟, 等. 椭圆曲线密码处理器的高效并行处理架构研究与设计[J]. 电子与信息学报, 2017, 39(10): 2487–2494. doi: 10.11999/JEIT161380 DAI Zibin, YI Suwen, LI Wei, et al. Research and design of efficient parallel processing architecture for elliptic curve cryptographic processor[J]. Journal of Electronics &Information Technology, 2017, 39(10): 2487–2494. doi: 10.11999/JEIT161380
[6]	KIM J, PARK J H, KIM D C, et al. Complete addition law for Montgomery curves[C]. The 22nd International Conference on Information Security and Cryptology– ICISC 2019, Seoul, South Korea, 2019: 260–277. doi: 10.1007/978-3-030-40921-0_16.
[7]	SALARIFARD R and BAYAT-SARMADI S. An efficient low-latency point-multiplication over curve25519[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2019, 66(10): 3854–3862. doi: 10.1109/TCSI.2019.2914247
[8]	TURAN F and VERBAUWHEDE I. Compact and flexible FPGA implementation of Ed25519 and X25519[J]. ACM Transactions on Embedded Computing Systems, 2019, 18(3): 24. doi: 10.1145/3312742
[9]	MEHRABI M A and DOCHE C. Low-cost, low-power FPGA implementation of ED25519 and CURVE25519 point multiplication[J]. Information, 2019, 10(9): 285. doi: 10.3390/info10090285
[10]	魏伟, 陈佳哲, 李丹, 等. 椭圆曲线Diffie-Hellman密钥交换协议的比特安全性研究[J]. 电子与信息学报, 2020, 42(8): 1820–1827. doi: 10.11999/JEIT190845 WEI Wei, CHEN Jiazhe, LI Dan, et al. Research on the bit security of elliptic curve Diffie-Hellman[J]. Journal of Electronics &Information Technology, 2020, 42(8): 1820–1827. doi: 10.11999/JEIT190845
[11]	KOPPERMANN P, DE SANTIS F, HEYSZL J, et al. Low-latency X25519 hardware implementation: Breaking the 100 microseconds barrier[J]. Microprocessors and Microsystems, 2017, 52: 491–497. doi: 10.1016/j.micpro.2017.07.001
[12]	SASDRICH P and GÜNEYSU T. Exploring RFC 7748 for hardware implementation: Curve25519 and Curve448 with side-channel protection[J]. Journal of Hardware and Systems Security, 2018, 2(4): 297–313. doi: 10.1007/s41635-018-0048-z
[13]	SASDRICH P and GÜNEYSU T. Implementing Curve25519 for side-channel--protected elliptic curve cryptography[J]. ACM Transactions on Reconfigurable Technology and Systems, 2015, 9(1): 3. doi: 10.1145/2700834
[14]	JOSEFSSON S and LIUSVAARA I. IRTF RFC 8032 Edwards-curve digital signature algorithm (EdDSA)[S]. 2017.
[15]	VENGALA D V K, KAVITHA D, and KUMAR A P S. Secure data transmission on a distributed cloud server with the help of HMCA and data encryption using optimized CP-ABE-ECC[J]. Cluster Computing, 2020, 23(3): 1683–1696. doi: 10.1007/s10586-020-03114-1
[16]	LI Hui. Pseudo-random scalar multiplication based on group isomorphism[J]. Journal of Information Security and Applications, 2020, 53: 102534. doi: 10.1016/j.jisa.2020.102534
[17]	ZHANG Neng, YANG Bohan, CHEN Chen, et al. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT[J]. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2020, 2020(2): 49–72. doi: 10.13154/tches.v2020.i2.49-72
[18]	HOSSAIN M S, KONG Yinan, SAEEDI E, et al. High-performance elliptic curve cryptography processor over NIST prime fields[J]. IET Computers & Digital Techniques, 2017, 11(1): 33–42. doi: 10.1049/iet-cdt.2016.0033
[19]	KNEZEVIC M, VERCAUTEREN F, and VERBAUWHEDE I. Faster interleaved modular multiplication based on Barrett and Montgomery reduction methods[J]. IEEE Transactions on Computers, 2010, 59(12): 1715–1721. doi: 10.1109/TC.2010.93