一种新型的高性能CPU时钟树自适应优化策略

樊凌雁; 张哲; 黄灿坤; 骆建平; 刘海銮

doi:10.11999/JEIT240811

一种新型的高性能CPU时钟树自适应优化策略

doi: 10.11999/JEIT240811 cstr: 32379.14.JEIT240811

1.
杭州电子科技大学微电子研究院杭州 310018
2.
上海奕斯伟计算技术有限公司上海 200131

基金项目: 国家自然科学基金(U22A2071)，科技部重大攻关项目(GG20210104)

详细信息

作者简介:
樊凌雁：女，研究员，研究方向为数据存储，高速接口集成电路设计

张哲：男，硕士生，研究方向为高性能CPU的物理设计

黄灿坤：男，硕士生，研究方向为高性能CPU的物理设计

骆建平：男，高工，研究方向为高性能数字芯片设计

刘海銮：男，正高级高工，研究方向为数据存储，高速接口集成电路设计

通讯作者:
刘海銮　lloyd.liu@sage-micro.com.cn

中图分类号: TN4; TN431.2
计量
- 文章访问数: 468
- HTML全文浏览量: 274
- PDF下载量: 43
- 被引次数: 0
出版历程
- 收稿日期: 2024-09-23
- 修回日期: 2025-03-10
- 网络出版日期: 2025-03-19
- 刊出日期: 2025-04-01

A Novel Adaptive Optimization Strategy for High-Performance CPU Clock Trees

1.
Microelectronics Research Institute, Hangzhou Dianzi University, Hangzhou 310018, China
2.
Shanghai ESWIN Computing Technology Co., Ltd. Shanghai 200131, China

Funds: The National Natural Science Foundation of China (U22A2071), The National Key Research and Development Program (GG20210104)

摘要

摘要: 该文基于精简指令集系统(RISC-V)架构提出了一种新型的自适应全流程(ADFF)时钟树优化方法，高效利用有用偏差(useful skew)来优化高性能CPU时钟树，以满足市场对芯片高性能和低功耗的双重需求。针对时钟树，通过选择关键路径并结合理论延迟和缓冲器制造有用偏差，采用循环迭代的方式，在不同流程自适应修复常规流程无法解决的建立时间违例(setup violation)和保持时间违例(hold violation)。为了在提升性能的同时，最大限度降低功耗，该文对加入的延迟单元进行合并(merge)处理，实现功耗与时序的联合优化。最后采用RISC_V CPU核进行验证，研究结果表明，在确保合理功耗的基础上，所提方法显著改善了时序情况，总时序裕量违例几乎完全消除。
- 时钟树 /
- 有用偏差 /
- 自适应 /
- 时间违例 /
- 联合优化
Abstract: Objective With continuous advancements in Integrated Circuit (IC) process technology, chip integration levels have steadily increased, driving higher market demands for performance. In the era of intelligence and digitalization, an inherent challenge arises: as the number of logic gates increases, both main frequency and power consumption rise, imposing stricter requirements on digital IC designers. Although existing Electronic Design Automation (EDA) tools optimize timing using useful skew in clock trees, this technique has notable limitations. To address this issue, a novel adaptive full-flow clock tree timing violation correction method is proposed. This method corrects timing violations unresolved by conventional flows while reducing power consumption and improving performance, meeting the market’s dual demands for high-performance and low-power chips. Methods The ADaptive Full Flow (ADFF) clock tree optimization method is based on the RISC-V CPU architecture. As an open-source architecture, RISC-V offers openness and flexibility, making it widely used in high-performance, low-power processor design. The method exploits imbalances in key path logic depth to enhance optimization. Useful skew is introduced to adjust logic delay distribution, improving overall performance. Timing feedback is incorporated at multiple stages, ultimately forming a joint optimization strategy for power consumption and timing, which enhances clock tree quality and reduces chip load. The method integrates feedback optimization during both the Clock Tree Synthesis (CTS) and routing stages. In the CTS stage, timing paths are traversed to gather feedback, which is then returned to the pre-CTS stage for early intervention. Adaptive iteration accurately identifies critical paths and resolves setup time violations. In the routing stage, targeted strategies address hold time violations, and the merging method reduces power consumption while optimizing timing correction. This enables full-flow correction of clock tree timing violations while improving power efficiency. Results and Discussions The ADFF clock tree optimization strategy is implemented using Synopsys IC Compiler II for layout and routing, establishing an adaptive full-flow framework for correcting clock tree timing violations (Fig. 5). For setup time violations in the reg2reg group, a loop iteration algorithm dynamically adjusts path delays, updating CTS guidance files to iteratively optimize critical path timing (Fig. 6). Using the ADFF method, total timing violations are nearly eliminated, achieving a 55.6% efficiency improvement over the built-in auto useful skew function (Table 2). For hold time violations in the reg2mem group (Figs. 9 and 10), 125 buffers are inserted. Across 50 critical paths, the total timing margin improves significantly, reducing the worst slack from –362.2 ns to near zero (Table 3). To further optimize timing, when a clock signal is transmitted to two physically close registers, and buffers in the final clock path stage are used for timing correction, a merging plan consolidates multiple register buffers into a single delay unit (Fig. 11). Through a rigorous filtering mechanism in the script language, nearly 700 clock delay units are reduced while maintaining timing integrity. Additionally, clock network nets are reduced (Table 4), improving clock tree quality, achieving timing convergence, and enhancing overall design efficiency. Conclusions This paper proposes a novel ADFF clock tree optimization strategy that integrates loop iteration adjustment with IC Compiler II, leveraging useful skew for adaptive full-flow automatic correction of setup and hold time violations. The method extends the traditional concept and has demonstrated significant results on a high-performance RISC-V-based CPU, achieving a main clock frequency of 800 MHz. Compared to conventional timing optimization methods, this strategy resolves timing violations that standard layout and routing processes cannot address, significantly improving timing convergence. Through joint optimization of power consumption and timing, the method reduces the cost and power overhead associated with useful skew optimization and is applicable to CPU pipelines, providing a valuable reference for chip design. Future research may refine filtering conditions, optimize script traversal statements, and incorporate mainstream tool techniques to improve path selection efficiency while minimizing runtime overhead in large designs. Additionally, further refinement of the mathematical model could help identify more suitable targets for power optimization, improving overall performance.
- Clock tree /
- Useful skew /
- Adaptation /
- Timing violation /
- Joint optimization

HTML全文

图 1 不同架构的关键路径(critical path)逻辑深度分布图

下载: 全尺寸图片幻灯片

图 2 时钟树基本层级结构

下载: 全尺寸图片幻灯片

图 3 同步时序电路基本模型

下载: 全尺寸图片幻灯片

图 4 建立时间和保持时间

下载: 全尺寸图片幻灯片

图 5 新型ADFF时钟树优化方法流程图

下载: 全尺寸图片幻灯片

图 6 useful skew修复setup violation循环流程图

下载: 全尺寸图片幻灯片

图 7 该路径的具体时序情况

下载: 全尺寸图片幻灯片

图 8 人工insert_buffer实例模型

下载: 全尺寸图片幻灯片

图 9 useful skew修复hold violation模型

下载: 全尺寸图片幻灯片

图 10 修复reg2mem的hold violation模型

下载: 全尺寸图片幻灯片

图 11 merge delay cell的物理实现

下载: 全尺寸图片幻灯片

图 12 net分支到reg_63的insert_buffer情况

下载: 全尺寸图片幻灯片

图 13 net分支到reg_10的insert_buffer情况

下载: 全尺寸图片幻灯片

图 14 merge操作前后版图变化情况

下载: 全尺寸图片幻灯片

表 1 工具设置不同postpone值的时钟树质量和时序情况

	Clock QoR			Timing QoR
工具auto useful skew的postpone值(ps)	Clock Cell(个)	Nets(个)	Latency(ns)	WNS(ns)	TNS(ns)
0	2547	264888	0.386	–0.260	–855.5
100	2920	265203	0.411	–0.228	–782.5
250	3251	265526	0.504	–0.193	–690.3

下载: 导出CSV

1 基于时钟树综合的关键路径列表自适应时序优化算法

For i=1 to n
If (S_i>S_th and H_i>H_th) Then
If (P_i$ \notin $DelayList) Then
△=D₀
Else
D_i= D₀+△
△=D_i
DelayList ← DelayList∪{(P_i, D_i)}
WriteToGuidanceFile(P_i, D_i)
Run CTS

下载: 导出CSV

2 基于后缀名筛选适合联合优化的时序路径算法

get_net -of [get_cells indelaybuffer -hier]
For i=1 to n get_pins -of nets
Size of_collection pins number=2

下载: 导出CSV

表 2 不同优化方法reg2reg组测试时序情况(ns)

优化策略	时序路径组别	WNS	TNS	WNS (hold)	TNS (hold)
balance case	reg2reg(func_N40CSSG0P81_cworst_postcts)	–0.252	–876.3	–0.187	–45.8
auto useful skew	reg2reg(func_N40CSSG0P81_cworst_postcts)	–0.219	–597.2	–0.192	–54.1
ADFF(本文)	reg2reg(func_N40CSSG0P81_cworst_postcts)	–0.079	–9.100	–0.212	–72.7

下载: 导出CSV

表 3 不同优化方法reg2mem组测试时序情况(ns)

优化策略	时序路径组别	WNS(hold)	TNS(hold)
balance case	reg2mem(func_125CFFG0P99_rcbest_postcts)	–0.199	–362.2
auto useful skew	reg2mem(func_125CFFG0P99_rcbest_postcts)	–0.162	–237.4
ADFF(本文)	reg2mem(func_125CFFG0P99_rcbest_postcts)	–0.024	–0.2

下载: 导出CSV

表 4 不同情况下的时钟树质量和时序结果

	Clock QoR			Timing QoR
优化策略	Clock Cell(个)	Nets(个)	Latency(ns)	WNS(ns)	TNS(ns)	WNS(hold) (ns)	TNS(hold) (ns)
balance case	2962	272003	0.386	–0.252	–876.3	–0.238	–45.8
ADFF(不含merge)	3754	280599	0.411	–0.044	–9.100	–0.020	–0.201
ADFF(＋merge)	3095	274423	0.399	–0.040	–8.112	–0.014	–0.122

下载: 导出CSV

参考文献(17)

[1]	杨亮, 王亚军, 张竣昊, 等. 处理器体系结构模拟器综述[J]. 电子与封装, 2024, 24(8): 080301. doi: 10.16257/j.cnki.1681-1070.2024.0097. YANG Liang, WANG Yajun, ZHANG Junhao, et al. Overview of processor architecture simulators[J]. Electronics & Packaging, 2024, 24(8): 080301. doi: 10.16257/j.cnki.1681-1070.2024.0097.
[2]	韩宇昕, 卜刚, 郭钰. 基于RISC-V内核的UHF RFID阅读器SoC设计[J]. 计算机工程与设计, 2024, 45(5): 1588–1594. doi: 10.16208/j.issn1000-7024.2024.05.040. HAN Yuxin, BU Gang, and GUO Yu. SoC of UHF RFID interrogator design based on RISC-V core[J]. Computer Engineering and Design, 2024, 45(5): 1588–1594. doi: 10.16208/j.issn1000-7024.2024.05.040.
[3]	柳耀勇, 王研博, 杨龙波, 等. RISC-V技术生态发展趋势及未来展望[J]. 信息化研究, 2024(6): 66–67, 63. doi: 10.3969/j.issn.1672-5158.2024.06.027. LIU Yaoyong, WANG Yanbo, YANG Longbo, et al. Trends and future perspectives of RISC-V technology ecosystem[J]. Informatization-Research, 2024(6): 66–67, 63. doi: 10.3969/j.issn.1672-5158.2024.06.027.
[4]	丁志远, 朱家鑫, 吴国全, 等. 面向RISC-V适配开发的x86 built-in函数转换方法[J]. 广西大学学报(自然科学版), 2024, 49(3): 620–636. doi: 10.13624/j.cnki.issn.1001-7445.2024.0620. DING Zhiyuan, ZHU Jiaxin, WU Guoquan, et al. An approach to adapting x86 built-in functions for RISC-V development[J]. Journal of Guangxi University (Natural Science Edition), 2024, 49(3): 620–636. doi: 10.13624/j.cnki.issn.1001-7445.2024.0620.
[5]	XU Yinan, YU Zihao, TANG Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C]. The 55th IEEE/ACM International Symposium on Microarchitecture, Chicago, USA, 2022: 1178–1199. doi: 10.1109/MICRO56248.2022.00080.
[6]	ZYUBAN V, TAYLOR S A, CHRISTENSEN B, et al. IBM POWER7+ design for higher frequency at fixed power[J]. IBM Journal of Research and Development, 2013, 57(6): 1: 1–1: 18. doi: 10.1147/JRD.2013.2279597.
[7]	戈喆, 王志鸿, 厉媛玥. 基于Innovus的低功耗物理设计[J]. 电子技术应用, 2016, 42(8): 21–24. doi: 10.16157/j.issn.0258-7998.2016.08.003. GE Zhe, WANG Zhihong, and LI Yuanyue. Low power physical design in Innovus[J]. Application of Electronic Technique, 2016, 42(8): 21–24. doi: 10.16157/j.issn.0258-7998.2016.08.003.
[8]	王虎虎, 雷倩倩, 刘露, 等. 一种快速实现时序收敛的设计方法[J]. 微电子学与计算机, 2024, 41(4): 123–131. doi: 10.19304/J.ISSN1000-7180.2023.0050. WANG Huhu, LEI Qianqian, LIU Lu, et al. A design methodology for fast timing closure[J]. Microelectronics & Computer, 2024, 41(4): 123–131. doi: 10.19304/J.ISSN1000-7180.2023.0050.
[9]	朱佳琪, 陈岚, 王海永. 一种低功耗时钟树的设计和优化方法[J]. 微电子学与计算机, 2021, 38(10): 85–90. doi: 10.19304/J.ISSN1000-7180.2021.0015. ZHU Jiaqi, CHEN Lan, and WANG Haiyong. A low-power clock tree design and optimization method[J]. Microelectronics & Computer, 2021, 38(10): 85–90. doi: 10.19304/J.ISSN1000-7180.2021.0015.
[10]	杜文静. 基于TSMC6nm工艺的GPU模块低功耗物理设计[D]. [硕士论文], 西安理工大学, 2023. doi: 10.27398/d.cnki.gxalu.2023.000795. DU Wenjing. Low-power physical design of GPU module based on TSMC 6nm process[D]. [Master dissertation], Xi’an University of Technology, 2023. doi: 10.27398/d.cnki.gxalu.2023.000795.
[11]	翟金标, 李建成. 基于28 nm数字芯片的分步式时钟树综合设计[J]. 中国集成电路, 2022, 31(8): 40–44. doi: 10.3969/j.issn.1681-5289.2022.08.007. ZHAI Jinbiao and LI Jiancheng. Clock tree syntehsis of step by step based on 28nm digital chip[J]. China Integrated Circuit, 2022, 31(8): 40–44. doi: 10.3969/j.issn.1681-5289.2022.08.007.
[12]	NAIR R K R, POTHIRAJ S, NAIR T R R, et al. A novel power aware placement and adaptive radix tree based clock tree synthesis for 3D-integrated circuits[J]. Microprocessors and Microsystems, 2020: 103455. doi: 10.1016/j.micpro.2020.103455.
[13]	DO S G, KIM S, KANG S. Skew control methodology for useful-skew implementation[C]. 2016 International SoC Design Conference, Jeju, Korea (South), 2016: 221–222. doi: 10.1109/ISOCC.2016.7799867.
[14]	GARG V. Common path pessimism removal: An industry perspective: Special session: Common path pessimism removal[C]. 2014 IEEE/ACM International Conference on Computer-Aided Design, San Jose, USA, 2014: 592–595. doi: 10.1109/ICCAD.2014.7001412.
[15]	YANG Tianhao, ZHAO Zhenyu, HAN Ao, et al. Automatic timing ECO using stage-based path delay prediction[C]. The 20th IEEE Interregional NEWCAS Conference (NEWCAS), Quebec City, Canada, 2022: 455–459. doi: 10.1109/NEWCAS52662.2022.9842155.
[16]	张祥, 赵启林. 基于缓冲器的ASIC芯片时序优化设计[J]. 集成电路与嵌入式系统, 2024, 24(12): 33–37. doi: 10.20193/j.ices2097-4191.2024.0046. ZHANG Xiang and ZHAO Qilin. Timing optimization design of ASIC chip based on buffer[J]. Integrated Circuits and Embedded Systems, 2024, 24(12): 33–37. doi: 10.20193/j.ices2097-4191.2024.0046.
[17]	FENG W, WEI Q, LI Y, et al. Analysis and testing NeoKylin’s clock system[C]. 2014 International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SMTA 2014 VI), Information Engineering Research Institute, USA, Department of Computer Simulation Techniques, Luoyang Electronic Equipment Test Center of China, 2014: 8.