一种面向AV1粗模式决策的高吞吐量硬件设计方法

盛庆华; 陶泽浩; 黄小芳; 赖昌材; 黄晓峰; 殷海兵; 董哲康

doi:10.11999/JEIT240823

一种面向AV1粗模式决策的高吞吐量硬件设计方法

doi: 10.11999/JEIT240823

1.
杭州电子科技大学电子信息学院杭州 310018
2.
杭州电子科技大学信息工程学院杭州 311305
3.
杭州电子科技大学通信工程学院杭州 310018

基金项目: 国家重点研发计划(2023YFB4502804)

详细信息

作者简介:
盛庆华：男，副教授，研究方向为视频编码、FPGA硬件加速、电子系统集成等

陶泽浩：男，硕士生，研究方向为视频编解码、FPGA硬件加速等

黄小芳：女，讲师，研究方向为视频编码、嵌入式应用等

赖昌材：男，高级工程师，研究方向为图像视频压缩、智能处理及其软硬件加速实现等

黄晓峰：男，副教授，研究方向为视频编解码与芯片架构设计等

殷海兵：男，教授，研究方向为数字视频编解码、多媒体信号处理、芯片结构设计验证等

董哲康：男，副教授，研究方向为忆阻器及忆阻系统、人工神经网络等

通讯作者:
黄小芳　20221016@hdu.edu.cn

中图分类号: TN919.8
计量
- 文章访问数: 77
- HTML全文浏览量: 19
- PDF下载量: 6
- 被引次数: 0
出版历程
- 收稿日期: 2024-09-27
- 修回日期: 2025-01-02
- 网络出版日期: 2025-01-09

A High-Throughput Hardware Design for AV1 Rough Mode Decision

1.
School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
2.
School of Information Engineering, Hangzhou Dianzi University, Hangzhou 311305, China
3.
School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

Funds: The National Key R&D Program of China (2023YFB4502804)

摘要

摘要: 随着视频编码标准的不断更新迭代，开放媒体联盟(AOM)发布最新视频编码标准开放媒体视频编码标准(AV1)。其中，帧内编码技术采用更加丰富的预测模式来提高预测效率，预测种类从VP9中的10种扩展至61种。为了应对预测种类增加的变化并提高硬件的处理吞吐能力，该文提出基于全流水线结构的AV1粗模式决策硬件架构设计。在算法层面，以4×4块为最小处理单元，按照Z顺序对64×64编码树单元(CTU)中不同尺寸的预测单元(PUs)进行粗模式决策，同时采用基于1:1 PU的代价累加近似方法来完成1:2, 1:4, 2:1和4:1 PU的代价计算，以减少计算复杂度；在硬件层面，设计兼容4×4至32×32等多尺寸PU的粗模式决策电路，取代为不同尺寸PU单独设计电路的方法，有效减少逻辑资源的闲置。实验结果表明，在全帧内(AI)配置下，提出的改进算法相较于AV1标准算法平均节省了45.78%的时间，提高了1.94% BD-Rate。同时，提出的硬件架构设计能够在1057个时钟周期内完成64×64 CTU的粗模式决策，使用Synopsys公司的Design Compiler 2016工具及UMC 28 nm工艺库对硬件设计综合得到，该设计能够在432.7 MHz工作频率下实时处理8k@50.6fps的视频。
- 开放媒体视频编码标准 /
- 帧内预测 /
- 粗模式决策 /
- 视频编码 /
- 流水线
Abstract: Objective As demand for 4K and 8K Ultra High Definition (UHD) videos increases, the latest generation of video coding standards has been developed to meet the growing need for UHD video transmission. UHD video coding requires processing more pixels and details, resulting in significant increases in computational complexity and resource consumption. Optimizing algorithms and implementing hardware acceleration are essential for achieving real-time encoding and decoding of UHD videos. In Alliance for Open Media Video 1 (AV1), richer intra-prediction modes have been introduced, expanding the number of modes from 10 in VP9 to 61, thereby increasing computational complexity. To address the added complexity of these modes and enhance hardware processing throughput, a hardware design for AV1 Rough Mode Decision (RMD) based on a fully pipelined architecture is proposed. Methods At the algorithm level, a 4×4 block is used as the minimum processing unit. RMD is applied to various sizes of Prediction Units (PUs) within a 64×64 Coding Tree Unit (CTU) following Z-order scanning. This approach allows for efficient processing of large blocks by dividing them into smaller, manageable units. To reduce computational complexity, the SATD cost calculations for different PU sizes (e.g., 1:2, 1:4, 2:1, and 4:1) are performed using a cost accumulation approximation method based on the 1:1 PU. This method minimizes the need to recalculate costs for every possible configuration, thus improving efficiency and reducing computational load. At the hardware level, the architecture supports RMD for PUs of various sizes (4×4 to 32×32) within a 64×64 CTU. This architecture differs from traditional designs, which use separate circuits for each PU size. It optimizes logical resource use and minimizes downtime. The design incorporates a 28-stage pipeline that enables parallel processing of intra-prediction modes, ensuring RMD for at least 16 pixels per clock cycle and significantly enhancing throughput and encoding efficiency. Additionally, the design emphasizes circuit compatibility and reusability across various PU sizes, reducing redundancy and maximizing hardware resource utilization. Results and Discussions Software analysis shows that the proposed AV1 coarse mode decision algorithm reduces processing time by an average of 45.78% compared to the standard AV1 algorithm under the All-Intra (AI) configuration, while achieving a 1.94% improvement in BD-Rate. The testing platform is an Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz with 16.0 GB of DRAM. Compared to existing methods, the algorithm significantly reduces processing time while maintaining encoding efficiency. It offers an optimized trade-off, with a slight BD-Rate loss in exchange for substantial reductions in encoding time. Hardware analysis reveals that the proposed hardware architecture has a total circuit area of 0.556 mm² after synthesis, with a maximum operating frequency of 432.7 MHz, enabling real-time encoding of 8k@50.6fps video. Although the circuit area is slightly larger than in existing designs, the architecture demonstrates significant improvements in processing speed and video resolution capability, providing a balanced trade-off between hardware resource usage and throughput/area efficiency. These results further confirm the design's superiority in terms of hardware resource efficiency and processing performance. Conclusions This paper presents a high-throughput hardware design for AV1 RMD, capable of processing all PU sizes with 56 directional and 5 non-directional prediction modes. The design employs a 28-stage pipeline for parallel intra-frame prediction mode processing, enabling RMD for at least 16 pixels per clock cycle and significantly improving encoding efficiency. Techniques such as false-reconstructed reference pixels, Z-order scanning, PMCM circuit structures, and circuit reuse address the increased hardware resource demands of parallel processing. Experimental results show that the proposed algorithm reduces processing time by an average of 45.78% and improves BD-Rate by 1.94% compared to the AV1 standard, ensuring high speed and encoding quality. Circuit synthesis confirms the architecture's capability for real-time 8k@50.6fps video processing, meeting the demands of future UHD video encoding with exceptional performance and efficiency.
- Alliance for Open Media Video 1 (AV1) /
- Intra prediction /
- Rough Mode Decision (RMD) /
- Video coding /
- Pipeline

HTML全文

图 1 RMD硬件总体架构设计

下载: 全尺寸图片幻灯片

图 2 硬件实现RMD流程图

下载: 全尺寸图片幻灯片

图 3 整体架构时空图

下载: 全尺寸图片幻灯片

图 4 4×4 PU参考像素填充情况

下载: 全尺寸图片幻灯片

图 5 输入顺序示意图

下载: 全尺寸图片幻灯片

图 6 方向性模式硬件设计

下载: 全尺寸图片幻灯片

图 7 DC模式硬件设计

下载: 全尺寸图片幻灯片

图 8 平滑模式硬件设计

下载: 全尺寸图片幻灯片

图 9 平滑模式权重PMCM硬件设计

下载: 全尺寸图片幻灯片

图 10 Paeth模式硬件设计

下载: 全尺寸图片幻灯片

图 11 4×4 PU的SATD代价计算硬件设计

下载: 全尺寸图片幻灯片

图 12 长度为8的乱序列双调排序示例

下载: 全尺寸图片幻灯片

图 13 输入序列长度为8的双调排序硬件设计

下载: 全尺寸图片幻灯片

表 1 改进算法与AV1标准算法的性能比较(%)

测试序列	BD-Rate	TS
A1(UHD 4K)	2.21	49.2
A2(UHD 4K)	1.77	46.4
B(1080P)	1.93	48.1
C(480P)	2.23	38.4
E(720P)	1.56	46.8
平均结果	1.94	45.78

下载: 导出CSV

表 2 本文改进算法与现有工作比较(%)

文献	BD-Rate	TS
[33]	1.28	29.80
[34]	7.41	50.19
[35]	0.60	15.36
本文	1.94	45.78

下载: 导出CSV

表 3 基于ASIC实现的RMD相关硬件设计工作对比

对比指标	文献[36]	文献[37]	文献[38]	文献[39]	本文
工艺	TSMC 40 nm	TSMC 40 nm	TSMC 40 nm	TSMC 40 nm	UMC 28 nm
门电路(Kgates)	455.8	821.8	584.8	128.5	1011.3
工作频率(MHz)	1,296	1,902	1,296	648	432.7
时钟周期(Cycle)	7104	7104	7104	7104	1057
功耗(mW)	40.9	1613.3	4110.0	65.5	1891.6
吞吐量	4k@60fps	4k@60fps	4k@60fps	4k@30fps	8k@50.6fps
吞吐量/面积(px/gate)	1091.85	605.55	850.93	1936.44	1660.03
非方向性预测	×	×	×	√	√
方向性预测	√	√	√	×	√
模式决策	×	×	×	×	√

下载: 导出CSV

参考文献(39)

[1]	BENDER I, BORGES A, AGOSTINI L, et al. Complexity and compression efficiency analysis of libaom AV1 video codec[J]. Journal of Real-Time Image Processing, 2023, 20(3): 50. doi: 10.1007/s11554-023-01308-5.
[2]	REN Huiwen, WANG Shanshe, MA Siwei, et al. SVT-AVS3: An open-source high-performance AVS3 encoder with scalable video technology[J]. IEEE Transactions on Multimedia, 2024, 26: 3291–3301. doi: 10.1109/TMM.2023.3309549.
[3]	LEE M, SONG H J, PARK J, et al. Overview of versatile video coding (H. 266/VVC) and its coding performance analysis[J]. IEIE Transactions on Smart Processing & Computing, 2023, 12(2): 122–154. doi: 10.5573/IEIESPC.2023.12.2.122.
[4]	MUKHERJEE D, HAN Jingning, BANKOSKI J, et al. A technical overview of VP9—the latest open-source video codec[J]. SMPTE Motion Imaging Journal, 2015, 124(1): 44–54. doi: 10.5594/j18499.
[5]	林浩, 饶丰. AV1视频编码标准在我国的发展趋势分析[J]. 广播电视信息, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022. LIN Hao and RAO Feng. Analysis on the development trend of AV1 video coding standard in China[J]. Radio & Television Information, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022.
[6]	杜红青. 下一代视频编码高效帧内预测算法研究[D]. [硕士论文], 西安电子科技大学, 2023. doi: 10.27389/d.cnki.gxadu.2023.001917. DU Hongqing. Research on high efficiency intra prediction algorithm for next generation video coding[D]. [Master dissertation], Xidian University, 2023. doi: 10.27389/d.cnki.gxadu.2023.001917.
[7]	GROIS D, GILADI A, CHOI K, et al. Performance comparison of emerging EVC and VVC video coding standards with HEVC and AV1[J]. SMPTE Motion Imaging Journal, 2021, 130(4): 1–12. doi: 10.5594/JMI.2021.3065442.
[8]	UHRINA M, SEVCIK L, BIENIK J, et al. Performance comparison of VVC, AV1, HEVC, and AVC for high resolutions[J]. Electronics, 2024, 13(5): 953. doi: 10.3390/electronics13050953.
[9]	刘畅, 贾克斌, 刘鹏宇. 基于多分支网络的深度图帧内编码单元快速划分算法[J]. 电子与信息学报, 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010. LIU Chang, JIA Kebin, and LIU Pengyu. Fast partition algorithm in depth map intra-frame coding unit based on multi-branch network[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010.
[10]	WANG Yizhao, ZHANG Chaobo, and SUN Songlin. Intra prediction fast algorithm in AVS3 based on image texture characteristics[C]. 2021 20th International Symposium on Communications and Information Technologies, Tottori, Japan, 2021: 6–10. doi: 10.1109/ISCIT52804.2021.9590620.
[11]	ZHANG Yongfei, LI Zhe, and LI Bo, et al. Gradient-based fast decision for intra prediction in HEVC[C]. 2012 Visual Communications and Image Processing, San Diego, USA, 2012: 1–6. doi: 10.1109/VCIP.2012.6410739.
[12]	ZHU Linwei, ZHANG Yun, Li Na, et al. Deep learning-based intra mode derivation for versatile video coding[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2s): 96. doi: 10.1145/356369.
[13]	DUARTE A, ZATT B, CORREA G, et al. Fast intra mode decision using machine learning for the versatile video coding standard[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10181769.
[14]	STORCH I, ROMA N, PALOMINO D, et al. GPU acceleration of MIP intra prediction in VVC[C]. 2023 31st European Signal Processing Conference, Helsinki, Finland, 2023: 600–604. doi: 10.23919/EUSIPCO58844.2023.10290037.
[15]	HAN Xu, WANG Shanshe, MA Siwei, et al. Optimization of motion compensation based on GPU and CPU for VVC decoding[C]. 2020 IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 1196–1200. doi: 10.1109/ICIP40778.2020.9190708.
[16]	CORRÊA M, WASKOW B, ZATT B, et al. High throughput hardware design for AV1 Paeth and smooth intra modes[C]. 2019 IEEE International Symposium on Circuits and Systems, Sapporo, Japan, 2019: 1–5. doi: 10.1109/ISCAS.2019.8702258.
[17]	CAI Zhanyuan and GAO Wei. Efficient fast algorithm and parallel hardware architecture for intra prediction of AVS3[C]. 2021 IEEE International Symposium on Circuits and Systems, Daegu, South Korea, 2021: 1–5. doi: 10.1109/ISCAS51556.2021.9401121.
[18]	HUANG Xiaofeng, JIA Huizhu, CAI Binbin, et al. Fast algorithms and VLSI architecture design for HEVC intra-mode decision[J]. Journal of Real-Time Image Processing, 2016, 12(2): 285–302. doi: 10.1007/s11554-015-0549-8.
[19]	CORRÊA M, WASKOW B, GOEBEL J, et al. A high throughput hardware architecture targeting the AV1 Paeth intra predictor[C]. 2019 IEEE 10th Latin American Symposium on Circuits & System, Armenia, Colombia, 2019: 93–96. doi: 10.1109/LASCAS.2019.8667544.
[20]	刘鹏宇, 张悦, 贾克斌, 等. 基于局部亮度直方图的自适应视频帧类型决策算法[J]. 电子与信息学报, 2023, 45(1): 300–307. doi: 10.11999/JEIT211199. LIU Pengyu, ZHANG Yue, JIA Kebin, et al. Adaptive video frame type decision algorithm based on local luminance histogram[J]. Journal of Electronics & Information Technology, 2023, 45(1): 300–307. doi: 10.11999/JEIT211199.
[21]	SU Weitong, XIANG Guoqing, HUANG Xiaofeng, et al. Fast algorithm and VLSI architecture design of rough mode decision for AVS3[C]. 2023 IEEE International Conference on Consumer Electronic, Las Vegas, USA, 2023: 1–4. doi: 10.1109/ICCE56470.2023.10043565.
[22]	齐美彬, 陈秀丽, 杨艳芳, 等. 高效率视频编码帧内预测编码单元划分快速算法[J]. 电子与信息学报, 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148. QI Meibin, CHEN Xiuli, and YANG Yanfang. Fast coding unit splitting algorithm for high efficiency video coding intra prediction[J]. Journal of Electronics & Information Technology, 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148.
[23]	CHEN Yue, MUKHERJEE D, HAN Jingning, et al. An overview of coding tools in AV1: The first video codec from the alliance for open media[J]. APSIPA Transactions on Signal and Information Processing, 2020, 9(1): e6. doi: 10.1017/ATSIP.2020.2.
[24]	HAKKENNES E A and VASSILIADIS S. Hardwired Paeth codec for portable network graphics (PNG)[C]. Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium, Milan, Italy, 1999: 318–325. doi: 10.1109/EURMIC.1999.794796.
[25]	PAETH A W. Image file compression made easy[M]. ARVO J. Graphics Gems II. Amsterdam: Elsevier, 1991: 93–100. doi: 10.1016/B978-0-08-050754-5.50029-3.
[26]	STORCH I, ROMA N, PALOMINO D, et al. Alternative reference samples to improve coding efficiency for parallel intra prediction solutions[C]. 2024 IEEE 15th Latin America Symposium on Circuits and Systems, Punta del Este, Uruguay, 2024: 1–5. doi: 10.1109/LASCAS60203.2024.10506142.
[27]	KUMM M. Multiple Constant Multiplication Optimizations for Field Programmable Gate Arrays[M]. Wiesbaden: Springer, 2016. doi: 10.1007/978-3-658-13323-8.
[28]	LIACHA A, OUDJIDA A K, BAKIRI M, et al. Radix-2^r recoding with common subexpression elimination for multiple constant multiplication[J]. IET Circuits, Devices & Systems, 2020, 14(7): 990–994. doi: 10.1049/iet-cds.2020.0213.
[29]	MOHAMED H, ELLIETHY A, ABDELAZIZ A, et al. Real-time motion estimation based video steganography with preserved consistency and local optimality[J]. Multimedia Tools and Applications, 2024: 1–24. doi: 10.1007/s11042-024-18651-9.
[30]	CHEN Shushi, HUANG Leilei, LIU Jiahao, et al. An error-surface-based fractional motion estimation algorithm and hardware implementation for VVC[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10182170.
[31]	YANG Mouzhi, ZHANG Peng, FANG Jianbin, et al. thSORT: An efficient parallel sorting algorithm on multi-core DSPs[J]. CCF Transactions on High Performance Computing, 2024, 6(5): 503–518. doi: 10.1007/s42514-023-00175-7.
[32]	ESMAILI-DOKHT P, GUIOT M, RADOJKOVIĆ P, et al. O(n) key–value sort with active compute memory[J]. IEEE Transactions on Computers, 2024, 73(5): 1341–1356. doi: 10.1109/TC.2024.3371773.
[33]	CORRÊA M M. Heuristic-based algorithms and hardware designs for fast intra-picture prediction in AV1 video coding[D]. [Ph. D. dissertation], Universidade Federal de Pelotas, 2023.
[34]	ROSA P, PALOMINO D, PORTO M, et al. GM-RF: An AV1 intra-frame fast decision based on random forest[C]. 2022 IEEE International Conference on Image Processing, Bordeaux, France, 2022: 3556–3560. doi: 10.1109/ICIP46576.2022.9897488.
[35]	CORRÊA M, ROMA N, PALOMINO D, et al. Mode-adaptive subsampling of SAD/SSE operations for intra prediction cost reduction[C]. 2022 IEEE International Symposium on Circuits and Systems, Austin, USA, 2022: 1808–1812. doi: 10.1109/ISCAS48785.2022.9937507.
[36]	CORRĚA M, NETO L, PALOMINO D, et al. ASIC solution for the directional intra prediction of the AV1 encoder targeting UHD 4K videos[C]. 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 2020: 1–5. doi: 10.1109/ISCAS45731.2020.9180526.
[37]	NETO L, CORRÊA M, PALOMINO D, et al. Directional intra frame prediction architecture with edge filter and upsampling for AV1 video coding[C]. 2020 33rd Symposium on Integrated Circuits and Systems Design, Campinas, Brazil, 2020: 1–6. doi: 10.1109/SBCCI50935.2020.9189902.
[38]	NETO L, CORREA M, PALOMINO D, et al. Exploring operation sharing in directional intra frame prediction of AV1 video coding[C]. 2021 IEEE 12th Latin America Symposium on Circuits and System, Arequipa, Peru, 2021: 1–4. doi: 10.1109/LASCAS51355.2021.9459136.
[39]	CORRÊA M M, WASKOW B H, GOEBEL J W, et al. A high-throughput hardware architecture for AV1 non-directional intra modes[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(5): 1481–1494. doi: 10.1109/TCSI.2020.2973031.