高吞吐率双模浮点可重构FFT处理器设计实现

魏星; 黄志洪; 杨海钢

doi:10.11999/JEIT180170

高吞吐率双模浮点可重构FFT处理器设计实现

doi: 10.11999/JEIT180170 cstr: 32379.14.JEIT180170

魏星^{1, 2},
黄志洪¹,
杨海钢^{1, 2, ,}

1.
中国科学院电子学研究所北京 100190
2.
中国科学院大学北京 100190

基金项目: 国家自然科学基金(61704173, 61474120)，北京市科技重大专项课题(Z171100000117019)

详细信息

作者简介:
魏星：男，1991年生，博士，研究方向为算法硬件加速设计、可重构计算芯片架构设计

黄志洪：男，1984年生，助理研究员，研究方向为可编程逻辑结构设计、新型卷积神经网络芯片体系架构开发

杨海钢：男，1960年生，研究员，研究方向为数模混合信号集成电路设计、超大规模集成电路设计等

通讯作者:
杨海钢　 yanghg@mail.ie.ac.cn

中图分类号: TN47
计量
- 文章访问数: 2526
- HTML全文浏览量: 1100
- PDF下载量: 62
- 被引次数: 0
出版历程
- 收稿日期: 2018-02-08
- 修回日期: 2018-07-05
- 网络出版日期: 2018-07-24
- 刊出日期: 2018-12-01

High Throughput Dual-mode Reconfigurable Floating-point FFT Processor

Xing WEI^{1, 2},
Zhihong HUANG¹,
Haigang YANG^{1, 2
, ,}

1.
Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China
2.
University of Chinese Academy of Sciences, Beijing 100190, China

Funds: The National Natural Science Foundation of China (61704173, 61474120), The Major Program of Beijing Science and Technology (Z171100000117019)

摘要

摘要: 高吞吐浮点可灵活重构的快速傅里叶变换(FFT)处理器可满足尖端雷达实时成像和高精度科学计算等多种应用需求。与定点FFT相比，浮点运算复杂度更高，使得浮点型FFT的运算吞吐率与其实现面积、功耗之间的矛盾问题尤为突出。鉴于此，为降低运算复杂度，首先将大点数FFT分解成若干个小点数基2^k 级联子级实现，提出分别针对128/256/512/1024/2048点FFT的优化混合基算法。同时，结合所提出同时支持单通道单精度和双通道半精度两种浮点模式的新型融合加减与点乘运算单元，首次提出一款高吞吐率双模浮点可变点FFT处理器结构，并在28 nm标准CMOS工艺下进行设计并实现。实验结果表明，单通道单精度和双通道半精度浮点两种模式下的运算吞吐率和输出平均信号量化噪声比分别为3.478 GSample/s, 135 dB和6.957 GSample/s, 60 dB。归一化吞吐率面积比相比于现有其他浮点FFT实现可提高约12倍。
- 快速傅里叶变换 /
- 双模浮点 /
- 混合基 /
- 融合运算单元
Abstract: In the advanced applications of real-time radar imaging and high-precision scientific computing systems, the design of high throughput and reconfigurable Floating-Point (FP) FFT accelerator is significant. Achieving high throughput FP FFT with low area and power cost poses a greater challenge due to high complexity of FP operations in comparison to fixed-point implementations. To address these issues, a serial of mixed-radix algorithms for 128/256/512/1024/2048-point FFT are proposed by decomposing long FFT into short implementations with cascaded radix-2^k stages so that the complexity of multiplications can be significantly reduced. Besides, two novel fused FP add-subtract and dot-product units for dual-mode functionality are proposed, which can either compute on a pair of double precision operands or on two pairs of single precision operands in parallel. Thus, a high throughput dual-mode floating-point variable length FFT is designed. The proposed processor is implemented based on SMIC 28 nm CMOS technology. Simulation results show that the throughput and Signal-to-Quantization Noise Ratio (SQNR) in single-channel single precision and dual-channel half precision floating-point mode are 3.478 GSample/s, 135 dB and 6.957 GSample/s, 60 dB respectively. Compare to the other FP FFT, this processor can achieve 12 times improvement of normalized throughput-area ratio.
- Fast Fourier Transform (FFT) /
- Dual-mode floating point /
- Mixed-radix /
- Fused arithmetic unit

HTML全文

图 1 双模浮点128/256/512/1024/2048点FFT顶层结构框图

下载: 全尺寸图片幻灯片

图 2 双模浮点数据格式示意图

下载: 全尺寸图片幻灯片

图 3 双模浮点融合加减运算单元结构

下载: 全尺寸图片幻灯片

图 4 双模式前导零预测电路结构

下载: 全尺寸图片幻灯片

图 5 双模浮点融合点乘运算单元结构

下载: 全尺寸图片幻灯片

图 6 双模式4:2 CSA结构

下载: 全尺寸图片幻灯片

表 1 本文所提出的混合基算法

点数	优化算法	每个子级相应的基底
点数	优化算法	1	2	3	4	5	6	7	8	9	10
128	2³-2²-2²	4	8	128	4	16	4
256	2⁴-2²-2²	4	16	4	256	4	16	4
512	2⁴-2²-2³	4	16	4	512	4	32	4	8
1024	2⁵-2²-2³	4	8	32	4	1024	4	32	4	8
2048	2⁵-2³-2³	4	8	32	4	2048	4	8	64	4	8

下载: 导出CSV

表 2 双模浮点融合加减运算单元关键路径

模块名	流水级	延时(ns)
数据提取与尾数生成	1	0.43
指数比较与尾数交换	1	0.77
指数阶差与尾数对齐	2	1.98
尾数求和差	3	1.47
LZA	3	0.35
规格化左移	3	0.16

下载: 导出CSV

表 3 双模浮点融合加减运算单元性能比较结果

参数	文献[6]	参考结构T1	本文
工艺 (nm)	45	28	28
归一化面积 (μm²)	5226 (100%)	2665 (51%)	2317 (44%)
工作频率 (MHz)	1920	500	435
计算延迟 (ns)	1.0	6.0	6.9
功耗 (mW)	5.2	0.6	0.4
功耗×周期	2.70 (100%)	1.20 (44%)	0.84 (31%)

下载: 导出CSV

表 4 双模浮点融合点乘运算单元关键路径

模块名	流水级	延时(ns)
数据提取与指数比较	1	0.33
尾数部分积相乘	1	0.87
指数阶差与乘积对齐	2	1.25
双模4:2 CSA	2	0.20
双加法舍入路径	2	0.53
LZD与规格化左移	3	1.98

下载: 导出CSV

表 5 双模浮点融合点乘运算单元性能比较结果

参数	文献[5]	参考结构T2	本文
工艺 (nm)	45	28	28
归一化面积 (μm²)	12865 (100%)	10336 (80%)	10701 (83%)
工作频率 (MHz)	1493	500	435
计算延迟 (ns)	2.1	6.0	6.9
功耗 (mW)	16.9	2.7	2.5
功耗×周期	11.3 (100%)	5.3 (47%)	5.6 (50%)

下载: 导出CSV

表 6 双模浮点FFT整体性能对比

性能参数	文献[2]	文献[9]	文献[16]	本文
工艺 (nm)	90	65	55	28
FFT结构	基于存储器	Hybrid	MDF	MDF
FFT点数	512	1024	1024	128/256/512/1024/2048
并行度	8	1	1	8
数据类型：字长 (bit)	块浮点：12	浮点：32	定点：16	浮点：32/浮点：16
平均输出SQNR (dB)	57	139	55	SP：135/HP：60
时钟频率 (MHz)	324	400	200	435
计算时间 (μs)	0.3	2.6	5.1	0.2/0.3/0.6/1.1/2.2
运算吞吐率 (MSample/s)	2592	400	200	SP：3478/HP：6957
平均功耗 (mW)	42	417	8	104 @435 MHz
有效面积 (mm²)	0.93	1.19	0.15	1.41
归一化面积	10.0	22.1	1.9	16.0
归一化吞吐率面积比	259	18	103	SP：220/HP：440

下载: 导出CSV

参考文献(17)

吕倩, 苏涛. 基于改进型快速双线性参数估计的复杂运动目标ISAR成像[J]. 电子与信息学报, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359

LÜ Qian and SU Tao. ISAR imaging of targets with complex motion based on the modified fast bilinear parameter estimation[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2301–2308 doi: 10.11999/JEIT151359

HUANG Shenjui and CHEN S. A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15.3c systems[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(8): 1752–1765 doi: 10.1109/TCSI.2011.2180430

LAN G and FRANK H. Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation[M]. Boston: Artech House Publishers, 2005: 154–210.

陈杰男, 费超, 袁建生, 等. 超高速全并行快速傅里叶变换器[J]. 电子与信息学报, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036

CHEN Jienan, FEI Chao, YUAN Jiansheng, et al. An ultra-high-speed fully-parallel fast Fourier transform design[J]. Journal of Electronics&Information Technology, 2016, 38(9): 2410–2414 doi: 10.11999/JEIT160036

JONGWOOK S and EARL E. Improved architectures for a floating-point fused dot product unit[C]. IEEE Symposium on Computer Arithmetic, Austin, USA, 2013: 41–48. doi: 10.1109/ARITH.2013.26.

JONGWOOK S and EARL E. Improved architectures for a fused floating-point add-subtract unit[J]. IEEE Transactions on Circuits and Systems Ⅰ:Regular Papers, 2012, 59(10): 2285–2291 doi: 10.1109/TCSI.2012.2188955

CHO T and LEE H. A high-speed low-complexity modified radix-2⁵ FFT processor for high rate WPAN applications[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(1): 187–191 doi: 10.1109/TVLSI.2011.2182068

WANG Chao, YAN Yuwei, and FU Xiaoyu. A high-throughput low-complexity radix-2⁴-2²-2³ FFT/IFFT processor with parallel and normal input/output order for IEEE 802.11ad systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2015, 23(11): 2728–2732 doi: 10.1109/TVLSI.2014.2365586

WANG Mingyu and LI Zhaolin. A hybrid SDC/SDF architecture for area and power minimization of floating-point FFT computations[C]. IEEE International Symposium on Circuits and Systems, Montreal, Canada, 2016: 2170–2173. doi: 10.1109/ISCAS.2016.7539011.

EARL E and HANI H. FFT implementation with fused floating-point operations[J]. IEEE Transactions on Computers, 2012, 61(2): 284–288 doi: 10.1109/TC.2010.271

TANG S N, TSAI J W, and CHANG T Y. A 2.4-GS/s FFT processor for OFDM-based WPAN applications[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2010, 57(6): 451–455 doi: 10.1109/TCSII.2010.2048373

NIE Zedong, ZHANG Fengjuan, LI Jie, et al. Low-power digital ASIC for on-chip spectral analysis of low-frequency physiological signals[J]. Journal of Semiconductors, 2012, 33(6): 67–70 doi: 10.1088/1674-4926

IEEE 754-2008. IEEE Standard for Floating-Point Arithmetic[S]. 2008. doi: 10.1109/IEEESTD.2008.5976968.

PETER K. Correcting the normalization shift of redundant binary representations[J]. IEEE Transactions on Computers, 2009, 58(10): 1453–1439 doi: 10.1109/TC.2009.38

YANG C H, YU T H, and DEJAN M. Power and area minimization of reconfigurable FFT processors: A 3GPP-LTE example[J]. IEEE Journal of Solid-State Circuits, 2011, 47(3): 757–768 doi: 10.1109/JSSC.2011.2176163

MARIO G, HUANG S J, CHEN S G, et al. The serial commutator FFT[J]. IEEE Transactions on Circuits and Systems Ⅱ:Express Briefs, 2016, 63(10): 974–978 doi: 10.1109/TCSII.2016.2538119

YANG K J, TSAI S H, and CHUANG G. MDC FFT/IFFT processor with variable length for MIMO-OFDM systems[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2013, 21(4): 720–731 doi: 10.1109/TVLSI.2012.2194315

施引文献

资源附件(0)

访问统计