Advanced Search
Turn off MathJax
Article Contents
SHENG Qinghua, TAO Zehao, HUANG Xiaofang, LAI Changcai, HUANG Xiaofeng, YIN Haibin, DONG Zhekang. A High-Throughput Hardware Design for AV1 Rough Mode Decision[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240823
Citation: SHENG Qinghua, TAO Zehao, HUANG Xiaofang, LAI Changcai, HUANG Xiaofeng, YIN Haibin, DONG Zhekang. A High-Throughput Hardware Design for AV1 Rough Mode Decision[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240823

A High-Throughput Hardware Design for AV1 Rough Mode Decision

doi: 10.11999/JEIT240823
Funds:  The National Key R&D Program of China (2023YFB4502804)
  • Received Date: 2024-09-27
  • Rev Recd Date: 2025-01-02
  • Available Online: 2025-01-09
  •   Objective  As demand for 4K and 8K Ultra High Definition (UHD) videos increases, the latest generation of video coding standards has been developed to meet the growing need for UHD video transmission. UHD video coding requires processing more pixels and details, resulting in significant increases in computational complexity and resource consumption. Optimizing algorithms and implementing hardware acceleration are essential for achieving real-time encoding and decoding of UHD videos. In Alliance for Open Media Video 1 (AV1), richer intra-prediction modes have been introduced, expanding the number of modes from 10 in VP9 to 61, thereby increasing computational complexity. To address the added complexity of these modes and enhance hardware processing throughput, a hardware design for AV1 Rough Mode Decision (RMD) based on a fully pipelined architecture is proposed.  Methods  At the algorithm level, a 4×4 block is used as the minimum processing unit. RMD is applied to various sizes of Prediction Units (PUs) within a 64×64 Coding Tree Unit (CTU) following Z-order scanning. This approach allows for efficient processing of large blocks by dividing them into smaller, manageable units. To reduce computational complexity, the SATD cost calculations for different PU sizes (e.g., 1:2, 1:4, 2:1, and 4:1) are performed using a cost accumulation approximation method based on the 1:1 PU. This method minimizes the need to recalculate costs for every possible configuration, thus improving efficiency and reducing computational load. At the hardware level, the architecture supports RMD for PUs of various sizes (4×4 to 32×32) within a 64×64 CTU. This architecture differs from traditional designs, which use separate circuits for each PU size. It optimizes logical resource use and minimizes downtime. The design incorporates a 28-stage pipeline that enables parallel processing of intra-prediction modes, ensuring RMD for at least 16 pixels per clock cycle and significantly enhancing throughput and encoding efficiency. Additionally, the design emphasizes circuit compatibility and reusability across various PU sizes, reducing redundancy and maximizing hardware resource utilization.  Results and Discussions  Software analysis shows that the proposed AV1 coarse mode decision algorithm reduces processing time by an average of 45.78% compared to the standard AV1 algorithm under the All-Intra (AI) configuration, while achieving a 1.94% improvement in BD-Rate. The testing platform is an Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz with 16.0 GB of DRAM. Compared to existing methods, the algorithm significantly reduces processing time while maintaining encoding efficiency. It offers an optimized trade-off, with a slight BD-Rate loss in exchange for substantial reductions in encoding time. Hardware analysis reveals that the proposed hardware architecture has a total circuit area of 0.556 mm² after synthesis, with a maximum operating frequency of 432.7 MHz, enabling real-time encoding of 8k@50.6fps video. Although the circuit area is slightly larger than in existing designs, the architecture demonstrates significant improvements in processing speed and video resolution capability, providing a balanced trade-off between hardware resource usage and throughput/area efficiency. These results further confirm the design's superiority in terms of hardware resource efficiency and processing performance.  Conclusions  This paper presents a high-throughput hardware design for AV1 RMD, capable of processing all PU sizes with 56 directional and 5 non-directional prediction modes. The design employs a 28-stage pipeline for parallel intra-frame prediction mode processing, enabling RMD for at least 16 pixels per clock cycle and significantly improving encoding efficiency. Techniques such as false-reconstructed reference pixels, Z-order scanning, PMCM circuit structures, and circuit reuse address the increased hardware resource demands of parallel processing. Experimental results show that the proposed algorithm reduces processing time by an average of 45.78% and improves BD-Rate by 1.94% compared to the AV1 standard, ensuring high speed and encoding quality. Circuit synthesis confirms the architecture's capability for real-time 8k@50.6fps video processing, meeting the demands of future UHD video encoding with exceptional performance and efficiency.
  • loading
  • [1]
    BENDER I, BORGES A, AGOSTINI L, et al. Complexity and compression efficiency analysis of libaom AV1 video codec[J]. Journal of Real-Time Image Processing, 2023, 20(3): 50. doi: 10.1007/s11554-023-01308-5.
    [2]
    REN Huiwen, WANG Shanshe, MA Siwei, et al. SVT-AVS3: An open-source high-performance AVS3 encoder with scalable video technology[J]. IEEE Transactions on Multimedia, 2024, 26: 3291–3301. doi: 10.1109/TMM.2023.3309549.
    [3]
    LEE M, SONG H J, PARK J, et al. Overview of versatile video coding (H. 266/VVC) and its coding performance analysis[J]. IEIE Transactions on Smart Processing & Computing, 2023, 12(2): 122–154. doi: 10.5573/IEIESPC.2023.12.2.122.
    [4]
    MUKHERJEE D, HAN Jingning, BANKOSKI J, et al. A technical overview of VP9—the latest open-source video codec[J]. SMPTE Motion Imaging Journal, 2015, 124(1): 44–54. doi: 10.5594/j18499.
    [5]
    林浩, 饶丰. AV1视频编码标准在我国的发展趋势分析[J]. 广播电视信息, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022.

    LIN Hao and RAO Feng. Analysis on the development trend of AV1 video coding standard in China[J]. Radio & Television Information, 2023, 30(2): 62–64. doi: 10.16045/j.cnki.rti.2023.02.022. (查阅网上资料,未找到本条文献英文翻译信息,请确认) .
    [6]
    杜红青. 下一代视频编码高效帧内预测算法研究[D]. [硕士论文], 西安电子科技大学, 2023. doi: 10.27389/d.cnki.gxadu.2023.001917.

    DU Hongqing. Research on high efficiency intra prediction algorithm for next generation video coding[D]. [Master dissertation], Xidian University, 2023. doi: 10.27389/d.cnki.gxadu.2023.001917.
    [7]
    GROIS D, GILADI A, CHOI K, et al. Performance comparison of emerging EVC and VVC video coding standards with HEVC and AV1[J]. SMPTE Motion Imaging Journal, 2021, 130(4): 1–12. doi: 10.5594/JMI.2021.3065442.
    [8]
    UHRINA M, SEVCIK L, BIENIK J, et al. Performance comparison of VVC, AV1, HEVC, and AVC for high resolutions[J]. Electronics, 2024, 13(5): 953. doi: 10.3390/electronics13050953.
    [9]
    刘畅, 贾克斌, 刘鹏宇. 基于多分支网络的深度图帧内编码单元快速划分算法[J]. 电子与信息学报, 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010.

    LIU Chang, JIA Kebin, and LIU Pengyu. Fast partition algorithm in depth map intra-frame coding unit based on multi-branch network[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4357–4366. doi: 10.11999/JEIT211010.
    [10]
    WANG Yizhao, ZHANG Chaobo, and SUN Songlin. Intra prediction fast algorithm in AVS3 based on image texture characteristics[C]. 2021 20th International Symposium on Communications and Information Technologies, Tottori, Japan, 2021: 6–10. doi: 10.1109/ISCIT52804.2021.9590620.
    [11]
    ZHANG Yongfei, LI Zhe, and LI Bo, et al. Gradient-based fast decision for intra prediction in HEVC[C]. 2012 Visual Communications and Image Processing, San Diego, USA, 2012: 1–6. doi: 10.1109/VCIP.2012.6410739.
    [12]
    ZHU Linwei, ZHANG Yun, Li Na, et al. Deep learning-based intra mode derivation for versatile video coding[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2s): 96. doi: 10.1145/356369.
    [13]
    DUARTE A, ZATT B, CORREA G, et al. Fast intra mode decision using machine learning for the versatile video coding standard[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10181769.
    [14]
    STORCH I, ROMA N, PALOMINO D, et al. GPU acceleration of MIP intra prediction in VVC[C]. 2023 31st European Signal Processing Conference, Helsinki, Finland, 2023: 600–604. doi: 10.23919/EUSIPCO58844.2023.10290037.
    [15]
    HAN Xu, WANG Shanshe, MA Siwei, et al. Optimization of motion compensation based on GPU and CPU for VVC decoding[C]. 2020 IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 1196–1200. doi: 10.1109/ICIP40778.2020.9190708.
    [16]
    CORRÊA M, WASKOW B, ZATT B, et al. High throughput hardware design for AV1 Paeth and smooth intra modes[C]. 2019 IEEE International Symposium on Circuits and Systems, Sapporo, Japan, 2019: 1–5. doi: 10.1109/ISCAS.2019.8702258.
    [17]
    CAI Zhanyuan and GAO Wei. Efficient fast algorithm and parallel hardware architecture for intra prediction of AVS3[C]. 2021 IEEE International Symposium on Circuits and Systems, Daegu, South Korea, 2021: 1–5. doi: 10.1109/ISCAS51556.2021.9401121.
    [18]
    HUANG Xiaofeng, JIA Huizhu, CAI Binbin, et al. Fast algorithms and VLSI architecture design for HEVC intra-mode decision[J]. Journal of Real-Time Image Processing, 2016, 12(2): 285–302. doi: 10.1007/s11554-015-0549-8.
    [19]
    CORRÊA M, WASKOW B, GOEBEL J, et al. A high throughput hardware architecture targeting the AV1 Paeth intra predictor[C]. 2019 IEEE 10th Latin American Symposium on Circuits & System, Armenia, Colombia, 2019: 93–96. doi: 10.1109/LASCAS.2019.8667544.
    [20]
    刘鹏宇, 张悦, 贾克斌, 等. 基于局部亮度直方图的自适应视频帧类型决策算法[J]. 电子与信息学报, 2023, 45(1): 300–307. doi: 10.11999/JEIT211199.

    LIU Pengyu, ZHANG Yue, JIA Kebin, et al. Adaptive video frame type decision algorithm based on local luminance histogram[J]. Journal of Electronics & Information Technology, 2023, 45(1): 300–307. doi: 10.11999/JEIT211199.
    [21]
    SU Weitong, XIANG Guoqing, HUANG Xiaofeng, et al. Fast algorithm and VLSI architecture design of rough mode decision for AVS3[C]. 2023 IEEE International Conference on Consumer Electronic, Las Vegas, USA, 2023: 1–4. doi: 10.1109/ICCE56470.2023.10043565.
    [22]
    齐美彬, 陈秀丽, 杨艳芳, 等. 高效率视频编码帧内预测编码单元划分快速算法[J]. 电子与信息学报, 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148.

    QI Meibin, CHEN Xiuli, and YANG Yanfang. Fast coding unit splitting algorithm for high efficiency video coding intra prediction[J]. Journal of Electronics & Information Technology, 2014, 36(7): 1699–1705. doi: 10.3724/SP.J.1146.2013.01148.
    [23]
    CHEN Yue, MUKHERJEE D, HAN Jingning, et al. An overview of coding tools in AV1: The first video codec from the alliance for open media[J]. APSIPA Transactions on Signal and Information Processing, 2020, 9(1): e6. doi: 10.1017/ATSIP.2020.2.
    [24]
    HAKKENNES E A and VASSILIADIS S. Hardwired Paeth codec for portable network graphics (PNG)[C]. Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium, Milan, Italy, 1999: 318–325. doi: 10.1109/EURMIC.1999.794796.
    [25]
    PAETH A W. Image file compression made easy[M]. ARVO J. Graphics Gems II. Amsterdam: Elsevier, 1991: 93–100. doi: 10.1016/B978-0-08-050754-5.50029-3.
    [26]
    STORCH I, ROMA N, PALOMINO D, et al. Alternative reference samples to improve coding efficiency for parallel intra prediction solutions[C]. 2024 IEEE 15th Latin America Symposium on Circuits and Systems, Punta del Este, Uruguay, 2024: 1–5. doi: 10.1109/LASCAS60203.2024.10506142.
    [27]
    KUMM M. Multiple Constant Multiplication Optimizations for Field Programmable Gate Arrays[M]. Wiesbaden: Springer, 2016. doi: 10.1007/978-3-658-13323-8.
    [28]
    LIACHA A, OUDJIDA A K, BAKIRI M, et al. Radix-2r recoding with common subexpression elimination for multiple constant multiplication[J]. IET Circuits, Devices & Systems, 2020, 14(7): 990–994. doi: 10.1049/iet-cds.2020.0213.
    [29]
    MOHAMED H, ELLIETHY A, ABDELAZIZ A, et al. Real-time motion estimation based video steganography with preserved consistency and local optimality[J]. Multimedia Tools and Applications, 2024: 1–24. doi: 10.1007/s11042-024-18651-9. (查阅网上资料,未找到本条文献卷期信息,请确认) .
    [30]
    CHEN Shushi, HUANG Leilei, LIU Jiahao, et al. An error-surface-based fractional motion estimation algorithm and hardware implementation for VVC[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5. doi: 10.1109/ISCAS46773.2023.10182170.
    [31]
    YANG Mouzhi, ZHANG Peng, FANG Jianbin, et al. thSORT: An efficient parallel sorting algorithm on multi-core DSPs[J]. CCF Transactions on High Performance Computing, 2024, 6(5): 503–518. doi: 10.1007/s42514-023-00175-7.
    [32]
    ESMAILI-DOKHT P, GUIOT M, RADOJKOVIĆ P, et al. O(n) key–value sort with active compute memory[J]. IEEE Transactions on Computers, 2024, 73(5): 1341–1356. doi: 10.1109/TC.2024.3371773.
    [33]
    CORRÊA M M. Heuristic-based algorithms and hardware designs for fast intra-picture prediction in AV1 video coding[D]. [Ph. D. dissertation], Universidade Federal de Pelotas, 2023.
    [34]
    ROSA P, PALOMINO D, PORTO M, et al. GM-RF: An AV1 intra-frame fast decision based on random forest[C]. 2022 IEEE International Conference on Image Processing, Bordeaux, France, 2022: 3556–3560. doi: 10.1109/ICIP46576.2022.9897488.
    [35]
    CORRÊA M, ROMA N, PALOMINO D, et al. Mode-adaptive subsampling of SAD/SSE operations for intra prediction cost reduction[C]. 2022 IEEE International Symposium on Circuits and Systems, Austin, USA, 2022: 1808–1812. doi: 10.1109/ISCAS48785.2022.9937507.
    [36]
    CORRĚA M, NETO L, PALOMINO D, et al. ASIC solution for the directional intra prediction of the AV1 encoder targeting UHD 4K videos[C]. 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 2020: 1–5. doi: 10.1109/ISCAS45731.2020.9180526.
    [37]
    NETO L, CORRÊA M, PALOMINO D, et al. Directional intra frame prediction architecture with edge filter and upsampling for AV1 video coding[C]. 2020 33rd Symposium on Integrated Circuits and Systems Design, Campinas, Brazil, 2020: 1–6. doi: 10.1109/SBCCI50935.2020.9189902.
    [38]
    NETO L, CORREA M, PALOMINO D, et al. Exploring operation sharing in directional intra frame prediction of AV1 video coding[C]. 2021 IEEE 12th Latin America Symposium on Circuits and System, Arequipa, Peru, 2021: 1–4. doi: 10.1109/LASCAS51355.2021.9459136.
    [39]
    CORRÊA M M, WASKOW B H, GOEBEL J W, et al. A high-throughput hardware architecture for AV1 non-directional intra modes[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(5): 1481–1494. doi: 10.1109/TCSI.2020.2973031.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(3)

    Article Metrics

    Article views (35) PDF downloads(1) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return