Advanced Search
Turn off MathJax
Article Contents
XIE Lifan, WEI Songtao, YAO Peng, WU Dong, TANG Jianshi, QIAN He, GAO Bin, WU Huaqiang. A fast and accurate programming strategy for analog in-memory computing validated with a transposable RRAM macro and 0.64% fully-parallel RMS error[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251174
Citation: XIE Lifan, WEI Songtao, YAO Peng, WU Dong, TANG Jianshi, QIAN He, GAO Bin, WU Huaqiang. A fast and accurate programming strategy for analog in-memory computing validated with a transposable RRAM macro and 0.64% fully-parallel RMS error[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251174

A fast and accurate programming strategy for analog in-memory computing validated with a transposable RRAM macro and 0.64% fully-parallel RMS error

doi: 10.11999/JEIT251174 cstr: 32379.14.JEIT251174
Funds:  NSFC (92064001, 62025111)
  • Accepted Date: 2026-02-06
  • Rev Recd Date: 2026-02-06
  • Available Online: 2026-02-28
  •   Objective  Non-volatile-memory (NVM) based compute-in-memory (CIM) is a promising candidate for next-generation AI accelerators owing to its high energy efficiency and instant wake-up capability[13]. However, the conventional write-and-verify (W&V) scheme cannot satisfy the speed or precision requirements of highly parallel CIM macros. The bottleneck stems from the inefficient verification step: cell-by-cell reading is repeated for the entire array, and the switching from the “verify” state (only one row active) to the “compute” state (all rows active) introduces systematic errors such as reference drift and IR-drop-induced weight inaccuracy. Moreover, analogue CIM macros with on-chip programming must tolerate large and non-uniform offsets under massive parallelism.In this work we propose:1. A back-propagation-assisted programming (BPAP) scheme that rapidly and accurately locates failing cells without full-array verification.2. An analogue-domain offset-cancelling structure (AOSC) that in-situ compensates channel-wise offsets.3. A transposable RRAM macro equipped with parallel two-channel current-domain ADCs (TC-ADC) that doubles the effective sampling rate with only 15 % ADC-area overhead.  Methods  As shown in Fig. 2, the transposable RRAM macro consists of two processing elements (PE) and a shared backward-processing ADC (BP-ADC). Each PE includes one set of input loader (IL), a DAC array, BL buffer & switch array, and 32 TC-ADCs, supporting a fully parallel forward calculation. Additionally, an error loader (EL) and SL buffer are included to feed in an error input vector for transposed MVM. Figure 3 presents the flow diagram of the BPAP scheme. The forward calculation is firstly performed after AOSC. Subsequently, differences between the expected outputs (yexp) and experimental outputs (yreal) are computed on chip and used as inputs for following backpropagation phase. Then the derivatives of RRAM weight are calculated after feeding a few validation patterns. This training-like scheme could adapt to real RRAM states and detect the failures in high-parallelism computing state. The weights whose derivatives exceed the error threshold are selected to be remapped. This scheme enables accurate programming while avoiding the cell-by-cell verification over the entire array. In the forward phase (Fig. 4a), the 2T2R cell is configured as a signed weight, with SLs clamped at VCM by TC-ADCs. For each PE, a fully-parallel (active 320 rows of 2T2R) 4b-IN/4b-W MVM is completed with 32 ADCs converting simultaneously. In the backward phase (Fig. 4b), only the upper half part of the reference voltages is selected to drive SL buffers and the weight is configured as 1T1R mode. The differential calculation regarding the positive 1T1R and negative 1T1R is conducted by external processer. Fig.5 shows that in AOSC scheme, the redundancy rows in a RRAM array can be programmed to directly compensate the offset of the analog computing in an in-situ manner. The offset currents are obtained by applying an all-zero pattern to the regular weights. The redundancy RRAM weights are programmed to minimize the offset currents under a constant input voltage. Afterwards, these well-written redundancy weights are fed with the same input voltage under computing mode. The macro supports this AOSC mode with only extra 1% overhead of array area. Fig. 6 presents the structure the TC-ADC. Adding one class AB output stage and associated switches and capacitors enables this two-channel conversion to halve the computing latency. Meanwhile, this only consumes extra 15% ADC area to achieve 2× sampling rate.  Conclusions  We demonstrate that replacing traditional W&V with BPAP, augmented by AOSC and TC-ADC, enables reliable, high-precision programming of analogue RRAM-CIM macros under massive parallelism. The measured 96.5 % MNIST accuracy and 4.8 % ImageNet improvement validate the approach. The proposed techniques are compatible with standard 2T2R/1T1R RRAM bit-cells and are readily extendable to larger arrays and deeper neural networks.
  • loading
  • [1]
    XU Xiaowei, DING Yukun, HU S X, et al. Scaling for edge inference of deep neural networks[J]. Nature Electronics, 2018, 1(4): 216–222. doi: 10.1038/s41928-018-0059-3.
    [2]
    ALEC R, JEFFREY W, REWON C, et al. Language models are unsupervised multitask learners[Z]. Computer Science, Linguistics, 2019. (查阅网上资料, 未找到本条文献信息, 请确认).
    [3]
    XU Xiaowei, DING Yukun, HU S X, et al. Scaling for edge inference of deep neural networks[J]. Nature Electronics, 2018, 1(4): 216–222. doi: 10.1038/s41928-018-0059-3. (查阅网上资料,本条文献与第1条文献重复,请确认).
    [4]
    FUJIWARA H, MORI H, ZHAO Weichang, et al. A 3nm, 32.5TOPS/W, 55.0TOPS/mm2 and 3.78Mb/mm2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell[C]. Proceedings of 2024 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2024: 572–573. doi: 10.1109/ISSCC49657.2024.10454556.
    [5]
    GHOLAMI A, YAO Zhewei, KIM S, et al. AI and memory wall[J]. IEEE Micro, 2024, 44(3): 33–39. doi: 10.1109/MM.2024.3373763.
    [6]
    BÜCHEL J, VASILOPOULOS A, KERSTING B, et al. Gradient descent-based programming of analog in-memory computing cores[C]. Proceedings of 2022 International Electron Devices Meeting, San Francisco, USA, 2022: 779–782. doi: 10.1109/IEDM45625.2022.10019486.
    [7]
    JACOB B, KLIGYS S, CHEN Bo, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[EB/OL]. https://doi.org/10.48550/arXiv.1712.05877, 2017.
    [8]
    RADHAKRISHNAN J, BELMONTE A, CLIMA S, et al. Improving post-cycling low resistance state retention in resistive RAM with combined oxygen vacancy and copper filament[J]. IEEE Electron Device Letters, 2019, 40(7): 1072–1075. doi: 10.1109/LED.2019.2917553.
    [9]
    SHIM W, MENG Jian, PENG Xiaochen, et al. Impact of multilevel retention characteristics on RRAM based DNN inference engine[C]. Proceedings of 2021 IEEE International Reliability Physics Symposium, Monterey, USA, 2021: 1–4. doi: 10.1109/IRPS46558.2021.9405210.
    [10]
    CHIU Y C, KHWA W S, LI C Y, et al. A 22nm 8Mb STT-MRAM near-memory-computing macro with 8b-precision and 46.4–160.1TOPS/W for edge-AI devices[C]. Proceedings of 2023 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2023: 496–497. doi: 10.1109/ISSCC42615.2023.10067563.
    [11]
    YOU Deqi, KHWA W S, WU J J, et al. A 22nm nonvolatile AI-edge processor with 21.4TFLOPS/W using 47.25Mb lossless-compressed-computing STT-MRAM near-memory-compute macro[C]. Proceedings of 2024 IEEE Symposium on VLSI Technology and Circuits, Honolulu, USA, 2024: 1–2. doi: 10.1109/VLSITechnologyandCir46783.2024.10631408.
    [12]
    WANG Yang, YANG Xiaolong, QIN Yubin, et al. A 28nm 83.23TFLOPS/W POSIT-based compute-in-memory macro for high-accuracy AI applications[C]. Proceedings of 2024 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2024: 566–567. doi: 10.1109/ISSCC49657.2024.10454567.
    [13]
    KHWA W S, WU Pingchun, WU J J, et al. A 16nm 96Kb integer/floating-point dual-mode-gain-cell-computing-in-memory macro achieving 73.3–163.3TOPS/W and 33.2–91.2TFLOPS/W for AI edge-devices[C]. Proceedings of 2024 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2024: 568–569. doi: 10.1109/ISSCC49657.2024.10454447.
    [14]
    MORI H, ZHAO Weichang, LEE C E, et al. A 4nm 6163-TOPS/W/b 4790-TOPS/mm2/b SRAM based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous MAC and weight update[C]. Proceedings of 2023 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2023: 132–133. doi: 10.1109/ISSCC42615.2023.10067555.
    [15]
    KHADDAM-ALJAMEH R, STANISAVLJEVIC M, FORNT MAS J, et al. HERMES core – A 14nm CMOS and PCM-based in-memory compute core using an array of 300ps/LSB linearized CCO-based ADCs and local digital processing[C]. Proceedings of 2021 Symposium on VLSI Circuits, Kyoto, Japan, 2021: 1–2. doi: 10.23919/VLSICircuits52068.2021.9492362.
    [16]
    HUNG J M, HUANG Y H, HUANG S P, et al. An 8-Mb DC-current-free binary-to-8b precision ReRAM nonvolatile computing-in-memory macro using time-space-readout with 1286.4–21.6TOPS/W for edge-AI devices[C]. Proceedings of 2022 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2022: 1–3. doi: 10.1109/ISSCC42614.2022.9731715.
    [17]
    HUANG W H, WEN Taihao, HUNG J M, et al. A nonvolatile Al-edge processor with 4MB SLC-MLC hybrid-mode ReRAM compute-in-memory macro and 51.4–251TOPS/W[C]. Proceedings of 2023 IEEE International Solid-State Circuits Conference, San Francisco, USA, 2023: 15–17. doi: 10.1109/ISSCC42615.2023.10067610.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(1)

    Article Metrics

    Article views (59) PDF downloads(6) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return