高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

FPGA双端口存储器映射优化算法

徐宇 林郁 杨海钢

徐宇, 林郁, 杨海钢. FPGA双端口存储器映射优化算法[J]. 电子与信息学报, 2020, 42(10): 2549-2556. doi: 10.11999/JEIT190077
引用本文: 徐宇, 林郁, 杨海钢. FPGA双端口存储器映射优化算法[J]. 电子与信息学报, 2020, 42(10): 2549-2556. doi: 10.11999/JEIT190077
Yu XU, Yu LIN, Haigang YANG. Optimization Algorithm of Dual-port Memory Mapping on FPGA[J]. Journal of Electronics & Information Technology, 2020, 42(10): 2549-2556. doi: 10.11999/JEIT190077
Citation: Yu XU, Yu LIN, Haigang YANG. Optimization Algorithm of Dual-port Memory Mapping on FPGA[J]. Journal of Electronics & Information Technology, 2020, 42(10): 2549-2556. doi: 10.11999/JEIT190077

FPGA双端口存储器映射优化算法

doi: 10.11999/JEIT190077
基金项目: 国家自然科学基金(61474120, 61404140, 61704173)
详细信息
    作者简介:

    徐宇:男,1990年生,博士生,研究方向为FPGA软件设计自动化

    林郁:男,1982年生,高级工程师,研究方向为FPGA软件设计自动化

    杨海钢:男,1960年生,研究员/教授,博士生导师,研究方向微电子学与集成电路技术

    通讯作者:

    杨海钢 yanghg@mail.ie.ac.cn

  • 中图分类号: TN43

Optimization Algorithm of Dual-port Memory Mapping on FPGA

Funds: The National Natural Science Foundation of China (61474120, 61404140, 61704173)
  • 摘要: FPGA存储器映射算法负责将用户的逻辑存储需求映射到芯片中的分布式存储资源上实现。前人对双端口存储器的映射算法研究相对较少,成熟的商业EDA工具的映射结果仍有不少改进空间。该文分别针对面积、延时、功耗这3个常用指标,提出一种双端口存储器映射的优化算法,并给出了具体配置方案。实验表明,在面向简单存储需求时,与商用工具Vivado的映射结果一致;在面向复杂存储需求时,面积优化和功耗优化的映射结果对比商用工具改善了至少50%。
  • 图  1  AlgoPower映射算法图示(视图${V_k}$)

    图  2  AlgoDelay映射算法图示(视图${V_k}$)

    图  3  ${V_{{\rm{maxR}} }}$纵向存储器块设计

    图  4  AlgoArea映射策略配置方案

    表  1  Virtex-4存储器块配置方式

    i 索引
    1 2 3 4 5 6 7
    地址深度 d[i] 256 512 1k 2k 4k 8k 16k
    地址位宽 b[i] 8 9 10 11 12 13 14
    数据位宽 w[i] 72 36 18 9 4 2 1
    下载: 导出CSV

    表  2  AlgoPower映射策略方案

    序号 视图 配置 地址端口连接 数据端口连接 译码/选通
    #1 V1 1k×18 a9 a8 ··· a1 a0 d33 d32 ··· d17 d16
    V2 1k×18 a10 a9 ··· a2 a1 d25 d24 ··· d9 d8 (a0) = 0
    V3 V4 1k×18 a11 a10 ··· a3 a2 d21 d20 ··· d5 d4 (a1 a0) = 00
    #2 V1 1k×18 a9 a8 ··· a1 a0 d51 d50 ··· d35 d34
    V2 1k×18 a10 a9 ··· a2 a1 d43 d42 ··· d27 d26 (a0) = 0
    V3 V4 1k×18 a11 a10 ··· a3 a2 d21 d20 ··· d5 d4 (a1 a0) = 01
    #3 V1 1k×18 a9 a8 ··· a1 a0 d69 d68 ··· d53 d52
    V2 1k×18 a10 a9 ··· a2 a1 d25 d24 ··· d9 d8 (a0) = 1
    V3 V4 1k×18 a11 a10 ··· a3 a2 d21 d20 ··· d5 d4 (a1 a0) = 10
    #4 V1 1k×18 a9 a8 ··· a1 a0 d87 d86 ··· d71 d70
    V2 1k×18 a10 a9 ··· a2 a1 d43 d42 ··· d27 d26 (a0) = 1
    V3 V4 1k×18 a11 a10 ··· a3 a2 d21 d20 ··· d5 d4 (a1 a0) = 11
    #5 V1 1k×18 a9 a8 ··· a1 a0 d15 d14 ··· d1 d0
    V2 2k×9 a10 a9 ··· a1 a0 d7 d6 ··· d1 d0
    V3 V4 4k×4 a11 a10 ··· a1 a0 d3 d2 d1 d0
    下载: 导出CSV

    表  3  AlgoDelay映射策略方案

    序号 视图 配置 地址端口连接 数据端口连接 译码/选通
    #1 V1 8k×2 a12 a11 ··· a1 a0 d1 d0 (a′15 a′14) = 00
    V2 4k×4 a11 a10 ··· a1 a0 d3 d2 d1 d0 (a′14 a′13) = 00
    V3 V4 2k×8 a10 a9 ··· a1 a0 d7 d6 d5 d4 d3 d2 d1 d0 (a′13 a′12) = 00
    #2 V1 8k×2 a12 a11 ··· a1 a0 d1 d0 (a′15 a′14) = 11
    V2 4k×4 a11 a10 ··· a1 a0 d3 d2 d1 d0 (a′14 a′13) = 11
    V3 V4 2k×8 a10 a9 ··· a1 a0 d7 d6 d5 d4 d3 d2 d1 d0 (a′13 a′12) = 11
    #3 V1 16k×1 a12 a11 ··· a1 a0 d1 (a′15 a′14) = 01
    V2 8k×2 a11 a10 ··· a1 a0 d3 d1 (a′14 a′13) = 01
    V3 V4 4k×4 a10 a9 ··· a1 a0 d7 d5 d3 d1 (a′13 a′12) = 01
    #4 V1 16k×1 a12 a11 ··· a1 a0 d0 (a′15 a′14) = 01
    V2 8k×2 a11 a10 ··· a1 a0 d2 d0 (a′14 a′13) = 01
    V3 V4 4k×4 a10 a9 ··· a1 a0 d6 d4 d2 d0 (a′13 a′12) = 01
    #5 V1 4k×4 a13 a12 ··· a1 a0 d3 d2 d1 d0 (a′15 a′14) = 10
    V2 2k×8 a12 a11 ··· a1 a0 d7 d6 d5 d4 d3 d2 d1 d0 (a′14 a′13) = 10
    V3 V4 1k×16 a11 a10 ··· a1 a0 d15 d14 ··· d1 d0 (a′13 a′12) = 10
    下载: 导出CSV

    表  4  功耗优化实验结果

    序号 地址1 读1 写1 地址2 读2 写2 Vivado平均触发数 AlgoPower平均触发数 优化比例(%)
    1 10 32 32 10 32 32 1 1.00 0
    2 10 32 64 10 32 256 8 1.14 85.7
    3 11 16 32 10 32 256 8 1.09 86.4
    4 11 16 32 10 32 128 4 1.06 73.5
    5 11 16 256 10 32 128 8 1.33 83.4
    6 11 32 32 10 64 64 2 1.00 50.0
    7 11 32 64 10 64 128 4 1.11 72.2
    8 11 16 32 11 16 128 4 1.05 73.8
    下载: 导出CSV

    表  5  延时优化实验结果

    序号 V1(rd) V2(rd) V3(wr) V4(wr) AlgoDelay选通级数
    1 36k×4 72k×2 18k×8 9k×16 V1: 2 V2: 3
    2 81k×16 324k×4 162k×8 162k×8 V1: 3 V2: 5
    3 18k×4 9k×8 36k×2 18k×4 V1: 2 V2: 1
    4 4k×25 2k×50 1k×100 4k×25 V1: 1 V2: 1
    5 16k×4 32k×2 8k×8 4k×16 V1: 1 V2: 2
    下载: 导出CSV

    表  6  面积优化实验结果

    序号 地址1 读1 写1 地址2 读2 写2 Vivado使用资源 AlgoArea使用资源 优化比例(%)
    1 10 32 32 10 32 32 2 2 0
    2 10 32 64 10 32 256 8 4 50
    3 11 16 32 10 32 256 8 4 50
    4 11 16 32 10 32 128 4 2 50
    5 11 16 256 10 32 128 4 4 0
    6 11 32 32 10 64 64 4 4 0
    7 11 32 64 10 64 128 4 4 0
    8 11 16 32 11 16 128 4 2 50
    下载: 导出CSV
  • TRIMBERGER S M. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology[J]. Proceedings of the IEEE, 2015, 103(3): 318–331. doi: 10.1109/JPROC.2015.2392104
    KUON I and ROSE J. Measuring the gap between FPGAs and ASICs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2007, 26(2): 203–215. doi: 10.1109/TCAD.2006.884574
    WILTON S J E. Architectures and algorithms for Field-Programmable Gate Arrays with embedded memory[D]. [Ph. D. dissertation], University of Toronto, 1997.
    TESSIER R, BETZ V, NETO D, et al. Power-efficient RAM mapping algorithms for FPGA embedded memory blocks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2007, 26(2): 278–290. doi: 10.1109/TCAD.2006.887924
    HSU T Y and WANG Tingchi. A generalized network flow based algorithm for power-aware FPGA memory mapping[C]. The 45th ACM/IEEE Design Automation Conference, Anaheim, USA, 2008: 30–33.
    DU Fangqing, LIN C Y, CUI Xiuhai, et al. Timing-constrained minimum area/power FPGA memory mapping[C]. The 23rd International Conference on Field programmable Logic and Applications, Porto, Portugal, 2013: 1–4.
    HO W K C and WILTON S J E. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays[C]. The 9th International Workshop on Field Programmable Logic and Applications, Glasgow, UK, 1999: 111–123.
    CONG J and YAN K. Synthesis for FPGAs with embedded memory blocks[C]. 2000 ACM/SIGDA Eighth International Symposium on Field Programmable Gate Arrays, Monterey, USA, 2000: 75–82.
    MA Yufei, CAO Yu, VRUDHULA S, et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks[C]. The 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 2017: 1–8.
    GUAN Yijin, LIANG Hao, XU Ningyi, et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates[C]. The 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, USA, 2017: 152–159.
    LIANG Shuang, YIN Shouyi, LIU Leibo, et al. FP-BNN: Binarized neural network on FPGA[J]. Neurocomputing, 2018, 275: 1072–1086. doi: 10.1016/j.neucom.2017.09.046
    GUO Kaiyuan, SUI Lingzhi, QIU Jiantao, et al. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(1): 35–47. doi: 10.1109/TCAD.2017.2705069
    MA Yufei, SUDA N, CAO Yu, et al. ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler[J]. Integration, 2018, 62: 14–23. doi: 10.1016/j.vlsi.2017.12.009
    Xilinx. Virtex-4 FPGA user guide[EB/OL]. https://china.xilinx.com/support/documentation/user_guides/ug070.pdf, 2008.
    Xilinx. LogiCORE IP product guide block memory generator v8.4[EB/OL]. https://china.xilinx.com/support/documentation/ip_documentation/blk_mem_gen/v8_4/pg058-blk-mem-gen.pdf, 2019.
  • 加载中
图(4) / 表(6)
计量
  • 文章访问数:  1836
  • HTML全文浏览量:  671
  • PDF下载量:  110
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-01-28
  • 修回日期:  2020-01-20
  • 网络出版日期:  2020-07-20
  • 刊出日期:  2020-10-13

目录

    /

    返回文章
    返回