A Survey on System and Architecture Optimization Techniques for Mixture-of-Experts Large Language Models

WANG Zehao; ZHU Zhenhua; XIE Tongxin; WANG Yu

doi:10.11999/JEIT250407

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

WANG Zehao, ZHU Zhenhua, XIE Tongxin, WANG Yu. A Survey on System and Architecture Optimization Techniques for Mixture-of-Experts Large Language Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250407

Citation:

WANG Zehao, ZHU Zhenhua, XIE Tongxin, WANG Yu. A Survey on System and Architecture Optimization Techniques for Mixture-of-Experts Large Language Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250407

Citation:

PDF( 5879 KB)

A Survey on System and Architecture Optimization Techniques for Mixture-of-Experts Large Language Models

doi: 10.11999/JEIT250407 cstr: 32379.14.JEIT250407

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Funds: The National Natural Science Foundation of China (62325405), Beijing National Research Center for Information Science and Technology (BNR2024TD03001)

Received Date: 2025-05-13
Rev Recd Date: 2025-08-20

Available Online: 2025-08-27

Abstract

Abstract

The Mixture-of-Experts (MoE) framework has become a pivotal approach for enhancing the knowledge capacity and inference efficiency of Large Language Models (LLMs). Conventional methods for scaling dense LLMs have reached significant limitations in training and inference due to computational and memory constraints. MoE addresses these challenges by distributing knowledge representation across specialized expert sub-networks, enabling parameter expansion while maintaining efficiency through sparse expert activation during inference. However, the dynamic nature of expert activation introduces substantial challenges in resource management and scheduling, necessitating targeted optimization at both the system and architectural levels. This survey focuses on the deployment of MoE-based LLMs. It first reviews the definitions and developmental trajectory of MoE, followed by an in-depth analysis of current system-level optimization strategies and architectural innovations tailored to MoE. The paper concludes by summarizing key findings and proposing prospective optimization techniques for MoE-based LLMs. Significance The MoE mechanism offers a promising solution to the computational and memory limitations of dense LLMs. By distributing knowledge representation across specialized expert sub-networks, MoE facilitates model scaling without incurring prohibitive computational costs. This architecture alleviates the bottlenecks associated with training and inference in traditional dense models, marking a notable advance in LLM research. Nonetheless, the dynamic expert activation patterns inherent to MoE introduce new challenges in resource scheduling and management. Overcoming these challenges requires targeted system- and architecture-level optimizations to fully harness the potential of MoE-based LLMs. Progress Recent advancements in MoE-based LLMs have led to the development of various optimization strategies. At the system level, approaches such as automatic parallelism, communication–computation pipelining, and communication operator fusion have been adopted to reduce communication overhead. Memory management has been improved through expert prefetching, caching mechanisms, and queue scheduling policies. To address computational load imbalance, both offline scheduling methods and runtime expert allocation strategies have been proposed, including designs that leverage heterogeneous CPU–GPU architectures. In terms of hardware architecture, innovations include dynamic adaptation to expert activation patterns, techniques to overcome bandwidth limitations, and near-memory computing schemes that improve deployment efficiency. In parallel, the open-source community has developed supporting tools and frameworks that facilitate the practical deployment and optimization of MoE-based models. Conclusions This survey presents a comprehensive review of system and architectural optimization techniques for MoE-based LLMs. It highlights the importance of reconciling parameter scalability with computational efficiency through the MoE framework. The dynamic nature of expert activation poses significant challenges in scheduling and resource management, which this survey systematically addresses. By evaluating current optimization techniques across both system and hardware layers, the paper offers key insights into the state of the field. It also proposes directions for future work, providing a reference for researchers and practitioners seeking to improve the performance and scalability of MoE-based models. The findings emphasize the need for continued innovation across algorithm development, system engineering, and architectural design to fully realize the potential of MoE in real-world applications. Prospects Future research on MoE-based LLMs is expected to advance the integration of algorithm design, system optimization, and hardware co-design. Key research directions include resolving load imbalance and maximizing resource utilization through adaptive expert scheduling algorithms, refining system frameworks to support dynamic sparse computation more effectively, and exploring hardware paradigms such as near-memory computing and hierarchical memory architectures. These developments aim to deliver more efficient and scalable MoE model deployments by fostering deeper synergy between software and hardware components.
- Large Language Models,
- Mixture of Experts,
- System Acceleration,
- Hardware Architecture

FullText(HTML)

References(177)

References

[1]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[2]	OpenAI. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2024. doi: 10.48550/arXiv.2303.08774.(查阅网上资料,请确认本条文献文献类型及格式是否正确).
[3]	Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805.(查阅网上资料,请确认本条文献文献类型及格式是否正确).
[4]	Qwen Team. Qwen2.5 technical report[J]. arXiv preprint arXiv: 2412.15115, 2025. doi: 10.48550/arXiv.2412.15115. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[5]	DUBEY A, JAUHRI A, PANDEY A, et al. The LLAMA 3 herd of models[J]. arXiv preprint arXiv: 2407.21783, 2024. doi: 10.48550/arXiv.2407.21783. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[6]	DU Zhengxiao, QIAN Yujie, LIU Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022: 320–335. doi: 10.18653/v1/2022.acl-long.26.
[7]	ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: A highly capable language model locally on your phone[J]. arXiv preprint arXiv: 2404.14219, 2024. doi: 10.48550/arXiv.2404.14219. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[8]	TOUVRON H, MARTIN L, STONE K, et al. LLAMA 2: Open foundation and fine-tuned chat models[J]. arXiv preprint arXiv: 2307.09288, 2023. doi: 10.48550/arXiv.2307.09288. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[9]	OUYANG Long, WU J, JIANG Xu, et al. Training language models to follow instructions with human feedback[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 27730–27744.
[10]	BOCKLISCH T, WERKMEISTER T, VARSHNEYA D, et al. Task-oriented dialogue with in-context learning[J]. arXiv preprint arXiv: 2402.12234, 2024. doi: 10.48550/arXiv.2402.12234. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[11]	TSAI Y D, LIU Mingjie, and REN Haoxing. Code less, align more: Efficient LLM fine-tuning for code generation with data pruning[J]. arXiv preprint arXiv: 2407.05040, 2024. doi: 10.48550/arXiv.2407.05040. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[12]	NLLB Team, COSTA-JUSSÀ M R, CROSS J, et al. No language left behind: Scaling human-centered machine translation[J]. arXiv preprint arXiv: 2207.04672, 2022. doi: 10.48550/arXiv.2207.04672.(查阅网上资料,请确认本条文献文献类型及格式是否正确).
[13]	BORDES F, PANG R Y, AJAY A, et al. An introduction to vision-language modeling[J]. arXiv preprint arXiv: 2405.17247, 2024. doi: 10.48550/arXiv.2405.17247. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[14]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 34892–34916.
[15]	WANG Peng, BAI Shuai, TAN Sinan, et al. Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution[J]. arXiv preprint arXiv: 2409.12191, 2024. doi: 10.48550/arXiv.2409.12191. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[16]	YAO Yuan, YU Tianyu, ZHANG Ao, et al. MiniCPM-V: A GPT-4V level MLLM on your phone[J]. arXiv preprint arXiv: 2408.01800, 2024. doi: 10.48550/arXiv.2408.01800. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[17]	KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[J]. arXiv preprint arXiv: 2001.08361, 2020. doi: 10.48550/arXiv.2001.08361. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[18]	HOFFMANN J, BORGEAUD S, MENSCH A, et al. Training compute-optimal large language models[J]. arXiv preprint arXiv: 2203.15556, 2022. doi: 10.48550/arXiv.2203.15556.(查阅网上资料,请确认本条文献文献类型及格式是否正确).
[19]	ANIL R, DAI A M, FIRAT O, et al. PaLM 2 technical report[J]. arXiv preprint arXiv: 2305.10403, 2023. doi: 10.48550/arXiv.2305.10403. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[20]	JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79–87. doi: 10.1162/neco.1991.3.1.79.
[21]	SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[J]. arXiv preprint arXiv: 1701.06538, 2017. doi: 10.48550/arXiv.1701.06538. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[22]	DU Nan, HUANG Yanping, DAI A M, et al. GLaM: Efficient scaling of language models with mixture-of-experts[C]. Proceedings of the 39th International Conference on Machine Learning, Maryland, USA, 2022: 5547–5569.
[23]	JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv preprint arXiv: 2401.04088, 2024. doi: 10.48550/arXiv.2401.04088. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[24]	RAJBHANDARI S, LI Conglong, YAO Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]. Proceedings of the 39th International Conference on Machine Learning, Maryland, USA, 2022: 18332–18346.
[25]	X-AI. Grok-1[EB/OL]. https://github.com/xai-org/grok-1, 2024. (查阅网上资料,请确认本条文献信息).
[26]	LIEBER O, LENZ B, BATA H, et al. Jamba: A hybrid transformer-mamba language model[J]. arXiv preprint arXiv: 2403.19887, 2024. doi: 10.48550/arXiv.2403.19887. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[27]	DeepSeek-AI. DeepSeek-V3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2025. doi: 10.48550/arXiv.2412.19437.(查阅网上资料,请确认本条文献文献类型及格式是否正确).
[28]	FEDUS W, ZOPH B, and SHAZEER N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. arXiv preprint arXiv: 2101.03961, 2022. doi: 10.48550/arXiv.2101.03961. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[29]	MiniMax. MiniMax-01: Scaling foundation models with lightning attention[J]. arXiv preprint arXiv: 2501.08313, 2025. doi: 10.48550/arXiv.2501.08313. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[30]	YANG An, LIN Junyang, MEN Rui, et al. M6-T: Exploring sparse expert models and beyond[J]. arXiv preprint arXiv: 2105.15082, 2021. doi: 10.48550/arXiv.2105.15082. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[31]	LEPIKHIN D, LEE H J, XU Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[J]. arXiv preprint arXiv: 2006.16668, 2020. doi: 10.48550/arXiv.2006.16668. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[32]	ZOPH B, BELLO I, KUMAR S, et al. ST-MoE: Designing stable and transferable sparse expert models[J]. arXiv preprint arXiv: 2202.08906, 2022. doi: 10.48550/arXiv.2202.08906. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[33]	XUE Fuzhao, ZHENG Zian, FU Yao, et al. OpenMoE: An early effort on open mixture-of-experts language models[J]. arXiv preprint arXiv: 2402.01739, 2024. doi: 10.48550/arXiv.2402.01739. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[34]	SHEN Yikang, GUO Zhen, CAI Tianle, et al. JetMoE: Reaching Llama2 performance with 0.1M dollars[J]. arXiv preprint arXiv: 2404.07413, 2024. doi: 10.48550/arXiv.2404.07413. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[35]	CAI Weilin, JIANG Juyong, WANG Fan, et al. A survey on mixture of experts[J]. arXiv preprint arXiv: 2407.06204, 2025. doi: 10.48550/arXiv.2407.06204. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[36]	VATS A, RAJA R, JAIN V, et al. The evolution of MoE: A survey from basics to breakthroughs[J]. 2024. (查阅网上资料, 未找到本条文献信息, 请确认).
[37]	PEI Zehua, ZOU Lancheng, ZHEN Huiling, et al. CMoE: Converting mixture-of-experts from dense to accelerate LLM inference[J]. arXiv preprint arXiv: 2502.04416, 2025. doi: 10.48550/arXiv.2502.04416. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[38]	ZHANG Xiaofeng, SHEN Yikang, HUANG Zeyu, et al. Mixture of attention heads: Selecting attention heads per token[J]. arXiv preprint arXiv: 2210.05144, 2022. doi: 10.48550/arXiv.2210.05144. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[39]	WANG K C, OSTASHEV D, FANG Yuwei, et al. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation[C]. SIGGRAPH Asia 2024 Conference Papers, Tokyo, Japan, 2024: 3. doi: 10.1145/3680528.3687662.
[40]	JIN Peng, ZHU Bo, YUAN Li, et al. MoH: Multi-head attention as mixture-of-head attention[J]. arXiv preprint arXiv: 2410.11842, 2025. doi: 10.48550/arXiv.2410.11842. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[41]	ZHANG Zhenyu, SHENG Ying, ZHOU Tianyi, et al. H₂O: Heavy-hitter oracle for efficient generative inference of large language models[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 34661–34710.
[42]	ZANDIEH A, HAN I, MIRROKNI V, et al. SubGen: Token generation in sublinear time and memory[J]. arXiv preprint arXiv: 2402.06082, 2024. doi: 10.48550/arXiv.2402.06082. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[43]	DONG H, YANG Xinyu, ZHANG Zhangyang, et al. Get more with LESS: Synthesizing recurrence with KV cache compression for efficient LLM inference[J]. arXiv preprint arXiv: 2402.09398, 2024. doi: 10.48550/arXiv.2402.09398. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[44]	LU Enzhe, JIANG Zhejun, LIU Jingyuan, et al. MoBA: Mixture of block attention for long-context LLMs[J]. arXiv preprint arXiv: 2502.13189, 2025. doi: 10.48550/arXiv.2502.13189. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[45]	YUAN Jingyang, GAO Huazuo, DAI Damai, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention[J]. arXiv preprint arXiv: 2502.11089, 2025. doi: 10.48550/arXiv.2502.11089. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[46]	LU Xudong, LIU Qi, XU Yuhui, et al. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models[J]. arXiv preprint arXiv: 2402.14800, 2024. doi: 10.48550/arXiv.2402.14800. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[47]	XIE Yanyue, ZHANG Zhi, ZHOU Ding, et al. MoE-Pruner: Pruning mixture-of-experts large language model using the hints from its router[J]. arXiv preprint arXiv: 2410.12013, 2024. doi: 10.48550/arXiv.2410.12013. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[48]	LIU Enshu, ZHU Junyi, LIN Zinan, et al. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs[J]. arXiv preprint arXiv: 2407.00945, 2024. doi: 10.48550/arXiv.2407.00945. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[49]	ZHANG Zeliang, LIU Xiaodong, CHENG Hao, et al. Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts[J]. arXiv preprint arXiv: 2407.09590, 2025. doi: 10.48550/arXiv.2407.09590. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[50]	CHOWDHURY M N R, WANG Meng, EL MAGHRAOUI K, et al. A provably effective method for pruning experts in fine-tuned sparse mixture-of-experts[J]. arXiv preprint arXiv: 2405.16646, 2024. doi: 10.48550/arXiv.2405.16646. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[51]	SARKAR S, LAUSEN L, CEVHER V, et al. Revisiting SMoE language models by evaluating inefficiencies with task specific expert pruning[J]. arXiv preprint arXiv: 2409.01483, 2024. doi: 10.48550/arXiv.2409.01483. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[52]	ZHUANG Yan, ZHENG Zhenzhe, WU Fan, et al. LiteMoE: Customizing on-device LLM serving via proxy submodel tuning[C]. Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems, Hangzhou, China, 2024: 521–534. doi: 10.1145/3666025.3699355.
[53]	FRANTAR E and ALISTARH D. QMoE: Sub-1-Bit compression of trillion-parameter models[C]. Proceedings of the 5th MLSys Conference, Santa Clara, USA, 2024: 439–451.
[54]	TANG Peng, LIU Jiacheng, HOU Xiaofeng, et al. HOBBIT: A mixed precision expert offloading system for fast MoE inference[J]. arXiv preprint arXiv: 2411.01433, 2024. doi: 10.48550/arXiv.2411.01433. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[55]	KIM Y J, HENRY R, FAHIM R, et al. Who says Elephants can't run: Bringing large scale MoE models into cloud scale production[J]. arXiv preprint arXiv: 2211.10017, 2022. doi: 10.48550/arXiv.2211.10017. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[56]	KIM Y J, FAHIM R, and AWADALLA H H. Mixture of quantized experts (MoQE): Complementary effect of low-bit quantization and robustness[J]. arXiv preprint arXiv: 2310.02410, 2023. doi: 10.48550/arXiv.2310.02410. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[57]	HUANG Beichen, YUAN Yueming, SHAO Zelei, et al. MiLo: Efficient quantized MoE inference with mixture of low-rank compensators[J]. arXiv preprint arXiv: 2504.02658, 2025. doi: 10.48550/arXiv.2504.02658. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[58]	YANG Cheng, SUI Yang, XIAO Jinqi, et al. MoE-I²: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition[J]. arXiv preprint arXiv: 2411.01016, 2024. doi: 10.48550/arXiv.2411.01016. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[59]	GAO Zefeng, LIU Peiyu, ZHAO W X, et al. Parameter-efficient mixture-of-experts architecture for pre-trained language models[J]. arXiv preprint arXiv: 2203.01104, 2022. doi: 10.48550/arXiv.2203.01104. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[60]	GU Hao, LI Wei, LI Lujun, et al. Delta decompression for MoE-based LLMs compression[J]. arXiv preprint arXiv: 2502.17298, 2025. doi: 10.48550/arXiv.2502.17298. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[61]	FRANTAR E and ALISTARH D. SparseGPT: Massive language models can be accurately pruned in one-shot[C]. Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, 2023: 10323–10337.
[62]	MA Xinyin, FANG Gongfan, and WANG Xinchao. LLM-pruner: On the structural pruning of large language models[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 21702–21720.
[63]	XU Yuzhuang, HAN Xu, YANG Zonghan, et al. OneBit: Towards extremely low-bit large language models[J]. arXiv preprint arXiv: 2402.11295, 2024. doi: 10.48550/arXiv.2402.11295. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[64]	LIN Ji, TANG Jiaming, TANG Haotian, et al. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration[C]. Proceedings of the 5th MLSys Conference, Santa Clara, USA, 2024: 87–100.
[65]	FRANTAR E, ASHKBOOS S, HOEFLER T, et al. GPTQ: Accurate post-training quantization for generative pre-trained transformers[EB/OL]. https://arxiv.org/abs/2210.17323, 2023. (查阅网上资料,请确认修改是否正确).
[66]	KAUSHAL A, VAIDHYA T, and RISH I. LORD: Low rank decomposition of monolingual code LLMs for one-shot compression[J]. arXiv preprint arXiv: 2309.14021, 2023. doi: 10.48550/arXiv.2309.14021. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[67]	GOU Yunhao, LIU Zhili, CHEN Kai, et al. Mixture of cluster-conditional LoRA experts for vision-language instruction tuning[J]. arXiv preprint arXiv: 2312.12379, 2024. doi: 10.48550/arXiv.2312.12379. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[68]	MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 9564–9576.
[69]	HUANG Haiyang, ARDALANI N, SUN Anna, et al. Towards MoE deployment: Mitigating inefficiencies in mixture-of-expert (MoE) inference[J]. arXiv preprint arXiv: 2303.06182, 2023. doi: 10.48550/arXiv.2303.06182. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[70]	ZHONG Shuzhang, LIANG Ling, WANG Yuan, et al. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient MoE inference[C]. Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, USA, 2024: 51. doi: 10.1145/3676536.3676741.
[71]	ZHAI Mingshu, HE Jiaao, MA Zixuan, et al. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization[C]. 2023 USENIX Annual Technical Conference, Boston, USA, 2023: 961–975.
[72]	PAN Xinglin, LIN Wenxiang, ZHANG Lin, et al. FSMoE: A flexible and scalable training system for sparse mixture-of-experts models[C]. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, Rotterdam, Netherlands, 2025: 524–539. doi: 10.1145/3669940.3707272.
[73]	HE Jiaao, ZHAI Jidong, ANTUNES T, et al. FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models[C]. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, 2022: 120–134. doi: 10.1145/3503221.3508418.
[74]	HWANG C, CUI Wei, XIONG Yifan, et al. Tutel: Adaptive mixture-of-experts at scale[C]. Proceedings of the 6th MLSys Conference, Miami Beach, USA, 2023: 269–287.
[75]	JIANG Chenyu, TIAN Ye, JIA Zhen, et al. Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping[C]. Proceedings of the 7th MLSys Conference, Santa Clara, USA, 2024: 74–86.
[76]	LI Jiamin, JIANG Yimin, ZHU Yibo, et al. Accelerating distributed MoE training and inference with Lina[C]. 2023 USENIX Annual Technical Conference, Boston, USA, 2023: 945–959.
[77]	SHI Shaohuai, PAN Xinglin, CHU Xiaowen, et al. PipeMoE: Accelerating mixture-of-experts through adaptive pipelining[C]. IEEE INFOCOM 2023-IEEE Conference on Computer Communications, New York City, USA, 2023: 1–10. doi: 10.1109/infocom53939.2023.10228874.
[78]	WANG Hulin, XIA Yaqi, YANG Donglin, et al. Harnessing Inter-GPU shared memory for seamless MoE communication-computation fusion[C]. Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Las Vegas, USA, 2025: 170–182. doi: 10.1145/3710848.3710868.
[79]	ZHANG Shulai, ZHENG Ningxin, LIN Haibin, et al. Comet: Fine-grained computation-communication overlapping for mixture-of-experts[J]. arXiv preprint arXiv: 2502.19811, 2025. doi: 10.48550/arXiv.2502.19811. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[80]	GO S and MAHAJAN D. MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing[J]. arXiv preprint arXiv: 2502.06643, 2025. doi: 10.48550/arXiv.2502.06643. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[81]	HE Xin, ZHANG Shunkang, WANG Yuxin, et al. ExpertFlow: Optimized expert activation and token allocation for efficient mixture-of-experts inference[J]. arXiv preprint arXiv: 2410.17954, 2024. doi: 10.48550/arXiv.2410.17954. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[82]	ELISEEV A and MAZUR D. Fast inference of mixture-of-experts language models with offloading[J]. arXiv preprint arXiv: 2312.17238, 2023. doi: 10.48550/arXiv.2312.17238. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[83]	HWANG R, WEI Jianyu, CAO Shijie, et al. Pre-gated MoE: An algorithm-system co-design for fast and scalable mixture-of-expert inference[C]. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Buenos Aires, Argentina, 2024: 1018–1031. doi: 10.1109/isca59077.2024.00078.
[84]	DU Zhixu, LI Shiyu, WU Yuhao, et al. SIDA: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models[C]. Proceedings of the 7th MLSys Conference, Santa Clara, USA, 2024: 224–238.
[85]	SONG Xiaoniu, ZHONG Zihang, and CHEN Rong. ProMoE: Fast MoE-based LLM serving using proactive caching[J]. arXiv preprint arXiv: 2410.22134, 2025. doi: 10.48550/arXiv.2410.22134. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[86]	WEI Yuanxin, DU Jiangsu, JIANG Jiazhi, et al. APTMoE: Affinity-Aware pipeline tuning for MoE models on bandwidth-constrained GPU nodes[C]. SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, USA, 2024: 1–14. doi: 10.1109/sc41406.2024.00096.
[87]	FANG Zhiyuan, HUANG Yuegui, HONG Zicong, et al. K_LOTSKI: Efficient mixture-of-expert inference via expert-aware multi-batch pipeline[C]. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Rotterdam, Netherlands, 2025: 574–588. doi: 10.1145/3676641.3716261.
[88]	ZHANG Yujie, AGGARWAL S, and MITRA T. DAOP: Data-aware offloading and predictive pre-calculation for efficient MoE inference[C]. 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025: 1–7. doi: 10.23919/DATE64628.2025.10992741.
[89]	CAO Shiyi, LIU Shu, GRIGGS T, et al. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs[C]. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, Rotterdam, Netherlands, 2025: 715–730. doi: 10.1145/3669940.3707267.
[90]	KAMAHORI K, TANG Tian, GU Yile, et al. Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models[J]. arXiv preprint arXiv: 2402.07033, 2025. doi: 10.48550/arXiv.2402.07033. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[91]	ZHONG Shuzhang, SUN Yanfan, LIANG Ling, et al. HybriMoE: Hybrid CPU-GPU scheduling and cache management for efficient MoE inference[J]. arXiv preprint arXiv: 2504.05897, 2025. doi: 10.48550/arXiv.2504.05897. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[92]	NVIDIA. NCCL[EB/OL]. https://developer.nvidia.com/n-ccl, 2025. (查阅网上资料,请确认本条文献信息).
[93]	XU Yuanzhong, LEE H J, CHEN Dehao, et al. GSPMD: General and scalable parallelization for ML computation graphs[J]. arXiv preprint arXiv: 2105.04663, 2021. doi: 10.48550/arXiv.2105.04663. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[94]	NVIDIA. NVSHMEM[EB/OL]. https://docs.nvi-dia.com/nvshmem/api/, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[95]	PUNNIYAMURTHY K, HAMIDOUCHE K, and BECKMANN B M. Optimizing distributed ML communication with fused computation-collective operations[C]. SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, USA, 2024: 1–17. doi: 10.1109/sc41406.2024.00094.
[96]	YU Dianhai, SHEN Liang, HAO Hongxiang, et al. MoESys: A distributed and efficient mixture-of-experts training and inference system for internet services[J]. IEEE Transactions on Services Computing, 2024, 17(5): 2626–2639. doi: 10.1109/tsc.2024.3399654.
[97]	GALE T, NARAYANAN D, YOUNG C, et al. MegaBlocks: Efficient sparse training with mixture-of-experts[C]. Proceedings of the 6th MLSys Conference, Miami Beach, USA, 2023: 288–304.
[98]	NVIDIA. cuBLAS[EB/OL]. https://developer.nvidia.co-m/cublas, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[99]	CAI Zixian, LIU Zhengyang, MALEKI S, et al. Synthesizing optimal collective algorithms[C]. Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, 2021: 62–75. doi: 10.1145/3437801.3441620.
[100]	ZENG Zhiyuan and XIONG Deyi. SCoMoE: Efficient mixtures of experts with structured communication[C]. The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 2023. (查阅网上资料, 未找到页码信息, 请补充).
[101]	SUO Jiashun, LIAO Xiaojian, XIAO Limin, et al. CoServe: Efficient Collaboration-of-Experts (CoE) model inference with limited memory[C]. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Rotterdam, Netherlands, 2025: 178–191. doi: 10.1145/3676641.3715986.
[102]	CAI Weilin, QIN Le, and HUANG Jiayi. MoC-System: Efficient fault tolerance for sparse mixture-of-experts model training[C]. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Rotterdam, Netherlands, 2025: 655–671. doi: 10.1145/3676641.3716006.
[103]	KOSSMANN F, JIA Zhihao, and AIKEN A. Optimizing mixture of experts using dynamic recompilations[J]. arXiv preprint arXiv: 2205.01848, 2022. doi: 10.48550/arXiv.2205.01848. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[104]	LI Yinghan, LI Yifei, ZHANG Jiejing, et al. Static batching of irregular workloads on GPUs: Framework and application to efficient MoE model inference[J]. arXiv preprint arXiv: 2501.16103, 2025. doi: 10.48550/arXiv.2501.16103. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[105]	LMDeploy Contributors. LMDeploy: A toolkit for compressing, deploying, and serving LLM[EB/OL]. https://github.com/InternLM/lmdeploy, 2023.
[106]	KWON W, LI Zhouhan, ZHUANG Siyuan, et al. Efficient memory management for large language model serving with PagedAttention[C]. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany, 2023: 611–626. doi: 10.1145/3600006.3613165.
[107]	ZHENG Lianmin, YIN Liangsheng, XIE Zhiqiang, et al. SGLang: Efficient execution of structured language model programs[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2025: 62557–62583.
[108]	KVCache-AI. A flexible framework for experiencing cutting-edge LLM inference optimizations[EB/OL]. https://github.com/kvcache-ai/ktransformers, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[109]	LIU Jiacheng, TANG Peng, WANG Wenfeng, et al. A survey on inference optimization techniques for mixture of experts models[J]. arXiv preprint arXiv: 2412.14219, 2025. doi: 10.48550/arXiv.2412.14219. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[110]	DONG Jiale, LOU Wenqi, ZHENG Zhendong, et al. UbiMoE: A ubiquitous mixture-of-experts vision transformer accelerator with hybrid computation pattern on FPGA[J]. arXiv preprint arXiv: 2502.05602, 2025. doi: 10.48550/arXiv.2502.05602. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[111]	PARK G, SONG S, SANG Haoyang, et al. 20.8 space-mate: A 303.5mW real-time sparse mixture-of-experts-based NeRF-SLAM processor for mobile spatial computing[C]. 2024 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, USA, 2024: 374–376. doi: 10.1109/isscc49657.2024.10454487.
[112]	LIN Xuanda, TIAN Huinan, XUE Wenxiao, et al. FLAME: Fully leveraging MoE sparsity for transformer on FPGA[C]. Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, USA, 2024: 248. doi: 10.1145/3649329.3656507.
[113]	SARKAR R, LIANG Hanxue, FAN Zhiwen, et al. Edge-MoE: Memory-efficient multi-task vision transformer architecture with task-level sparsity via mixture-of-experts[C]. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, USA, 2023: 1–9. doi: 10.1109/iccad57390.2023.10323651.
[114]	LIANG Hanxue, FAN Zhiwen, SARKAR R, et al. M³ViT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 28441–28457.
[115]	HE Siqi, ZHU Haozhe, ZHENG Jiapei, et al. Hydra: Harnessing expert popularity for efficient mixture-of-expert inference on Chiplet system[C]. Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, USA, 2025: 1–6. (查阅网上资料, 请确认标黄部分信息).
[116]	KIM T, CHOI K, CHO Y, et al. MoNDE: Mixture of near-data experts for large-scale sparse models[C]. Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, USA, 2024: 221. doi: 10.1145/3649329.3655951.
[117]	YUN S, KYUNG K, CHO J, et al. Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching[C]. 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Austin, USA, 2024: 1429–1443. doi: 10.1109/micro61859.2024.00105.
[118]	WU Lizhou, ZHU Haozhe, HE Siqi, et al. PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading[C]. Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2025: 1–6. (查阅网上资料, 请确认标黄部分信息).
[119]	Microsoft. ONNX Runtime[EB/OL]. https://onnx-runtime.ai/, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[120]	LEE S, KANG S H, LEE J, et al. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product[C]. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2021: 43–56. doi: 10.1109/isca52012.2021.00013.
[121]	GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint arXiv: 2312.00752, 2024. doi: 10.48550/arXiv.2312.00752. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[122]	NIE Xiaonan, LIU Qibin, FU Fangcheng, et al. LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2025: 54161–54182.
[123]	SINGH S, RUWASE O, AWAN A A, et al. A hybrid tensor-expert-data parallelism approach to optimize mixture-of-experts training[C]. Proceedings of the 37th International Conference on Supercomputing, Orlando, USA, 2023: 203–214. doi: 10.1145/3577193.3593704.
[124]	SHI Shaohuai, PAN Xinglin, WANG Qiang, et al. ScheMoE: An extensible mixture-of-experts distributed training system with tasks scheduling[C]. Proceedings of the Nineteenth European Conference on Computer Systems, Athens, Greece, 2024: 236–249. doi: 10.1145/3627703.3650083.
[125]	PRABHAKAR R, SIVARAMAKRISHNAN R, GANDHI D, et al. SambaNova SN40L: Scaling the AI memory wall with dataflow and composition of experts[C]. 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Austin, USA, 2024: 1353–1366. doi: 10.1109/micro61859.2024.00100.
[126]	YAO Jinghan, ANTHONY Q, SHAFI A, et al. Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference[C]. 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, USA, 2024: 915–925. doi: 10.1109/ipdps57955.2024.00086.
[127]	YU Hanfei, CUI Xingqi, ZHANG Hong, et al. fMoE: Fine-grained expert offloading for large mixture-of-experts serving[J]. arXiv preprint arXiv: 2502.05370, 2025. doi: 10.48550/arXiv.2502.05370. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[128]	FANG Zhiyuan, HONG Zicong, HUANG Yuegui, et al. Fate: Fast edge inference of mixture-of-experts models via cross-layer gate[J]. arXiv preprint arXiv: 2502.12224, 2025. doi: 10.48550/arXiv.2502.12224. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[129]	QIAN Yulei, LI Fengcun, JI Xiangyang, et al. EPS-MoE: Expert pipeline scheduler for cost-efficient MoE inference[J]. arXiv preprint arXiv: 2410.12247, 2025. doi: 10.48550/arXiv.2410.12247. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[130]	SKLIAR A, VAN ROZENDAAL T, LEPERT R, et al. Mixture of cache-conditional experts for efficient mobile device inference[J]. arXiv preprint arXiv: 2412.00099, 2025. doi: 10.48550/arXiv.2412.00099. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[131]	CHEN Xin, ZHANG Hengheng, GU Xiaotao, et al. Pipeline MoE: A flexible MoE implementation with pipeline parallelism[J]. arXiv preprint arXiv: 2304.11414, 2023. doi: 10.48550/arXiv.2304.11414. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[132]	NIE Xiaonan, MIAO Xupeng, WANG Zilong, et al. FlexMoE: Scaling large-scale sparse pre-trained model training via dynamic device placement[J]. Proceedings of the ACM on Management of Data, 2023, 1(1): 110. doi: 10.1145/3588964.
[133]	YUAN Yichao, MA Lin, and TALATI N. MoE-Lens: Towards the hardware limit of high-throughput MoE LLM serving under resource constraints[J]. arXiv preprint arXiv: 2504.09345, 2025. doi: 10.48550/arXiv.2504.09345. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[134]	ZHANG Mohan, LI Pingzhi, PENG Jie, et al. Advancing MoE efficiency: A collaboration-constrained routing (C2R) strategy for better expert parallelism design[J]. arXiv preprint arXiv: 2504.01337, 2025. doi: 10.48550/arXiv.2504.01337. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[135]	LUO Shuqing, PENG Jie, LI Pingzhi, et al. HEXA-MoE: Efficient and heterogeneous-aware training for mixture-of-experts[J]. arXiv preprint arXiv: 2411.01288, 2025. doi: 10.48550/arXiv.2411.01288. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[136]	TAIRIN S, MAHMUD S, SHEN Haiying, et al. eMoE: Task-aware memory efficient mixture-of-experts-based (MoE) model inference[J]. arXiv preprint arXiv: 2503.06823, 2025. doi: 10.48550/arXiv.2503.06823. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[137]	ZHU Ruidong, JIANG Ziheng, JIN Chao, et al. MegaScale-Infer: Serving mixture-of-experts at scale with disaggregated expert parallelism[J]. arXiv preprint arXiv: 2504.02263, 2025. doi: 10.48550/arXiv.2504.02263. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[138]	XUE Leyang, FU Yao, LU Zhan, et al. MoE-infinity: Offloading-efficient MoE model serving[J]. arXiv preprint arXiv: 2401.14361, 2025. doi: 10.48550/arXiv.2401.14361. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[139]	WU Chenpeng, GU Qiqi, SHI Heng, et al. Samoyeds: Accelerating MoE models with structured sparsity leveraging sparse tensor cores[C]. Proceedings of the Twentieth European Conference on Computer Systems, Rotterdam, Netherlands, 2025: 293–310. doi: 10.1145/3689031.3717455.
[140]	LI Zhiyao, YANG Bohan, LI Jiaxiang, et al. Adyna: Accelerating dynamic neural networks with adaptive scheduling[C]. 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), Las Vegas, USA, 2025: 549–562. doi: 10.1109/hpca61900.2025.00049.
[141]	Bytedance. Flux[EB/OL]. https://github.com/byte-dance/flux, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[142]	LIAN Yaoxiu, GOU Zhihong, HAN Yibo, et al. A cross-model fusion-aware framework for optimizing (gather-matmul-scatter)s workload[C]. Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2025: 1–6. (查阅网上资料, 请确认标黄部分信息).
[143]	XUE Fuzhao, HE Xiaoxin, REN Xiaozhe, et al. One student knows all experts know: From sparse to dense[C]. The Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 2023. (查阅网上资料, 未找到页码信息, 请补充).
[144]	CERON J S O, SOKAR G, WILLI T, et al. Mixtures of experts unlock parameter scaling for deep RL[C]. Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024: 38520–38540.
[145]	ZHENG Zhen, PAN Zaifeng, WANG Dalin, et al. BladeDISC: Optimizing dynamic shape machine learning workloads via compiler approach[J]. Proceedings of the ACM on Management of Data, 2023, 1(3): 206. doi: 10.1145/3617327.
[146]	NVIDIA. Megatron-LM[EB/OL]. https://github.com/NVIDIA/Megatron-LM, 2023. (查阅网上资料,未找到本条文献信息,请确认).
[147]	NVIDIA. TensorRT-LLM[EB/OL]. https://github.com/NVIDIA/TensorRT-LLM, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[148]	InternLM. LMDeploy. [EB/OL]. https://github.com/InternLM/lmdeploy, 2023. (查阅网上资料,未找到本条文献信息,请确认).
[149]	GGML. llama. cpp[EB/OL]. https://github.com/ggml-org/llama.cpp, 2024. (查阅网上资料,未找到本条文献信息,请确认).
[150]	DeepSeek-AI. DeepEP: An efficient expert-parallel communication library[EB/OL]. https://github.com/deepseek-ai/DeepEP, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[151]	DeepSeek-AI. Expert parallelism load balancer[EB/OL]. https://github.com/deepseek-ai/EPLB, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[152]	SANSEVIERO O, TUNSTALL L, SCHMID P, et al. Mixture of experts explained[EB/OL]. https://huggingface.co/blog/moe, 2023. (查阅网上资料,未找到本条文献信息,请确认).
[153]	Facebook Al Research. Sequence-to-sequence toolkit written in python[EB/OL]. https://github.com/facebookresearch/fairseq, 2021.
[154]	NLLB Team, COSTA-JUSSÀ M R, CROSS J, et al. No language left behind: Scaling human-centered machine translation[J]. arXiv preprint arXiv: 2207.04672, 2022. doi: 10.48550/arXiv.2207.04672. (查阅网上资料,请确认本条文献文献类型及格式是否正确)(查阅网上资料,本条文献与第12条文献重复,请确认).
[155]	YANG An, YANG Baosong, HUI Binyuan, et al. Qwen2 technical report[J]. arXiv preprint arXiv: 2407.10671v4, 2024. doi: 10.48550/arXiv.2407.10671. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[156]	Qwen Team. Qwen1.5-MoE: Matching 7B model performance with 1/3 activated parameters[EB/OL]. https://qwenlm.github.io/blog/qwen-moe/, 2024.
[157]	The Mosaic Research Team. Introducing DBRX: A New State-of-the-Art Open LLM[EB/OL]. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, 2024.
[158]	Snowflake Team. Arctic: Open, efficient foundation language models from snowflake[EB/OL]. https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake, 2024. (查阅网上资料,未找到本条文献信息,请确认).
[159]	DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model[J]. arXiv preprint arXiv: 2405.04434, 2024. doi: 10.48550/arXiv.2405.04434. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[160]	WEI Tianwen, ZHU Bo, ZHAO Liang, et al. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models[J]. arXiv preprint arXiv: 2406.06563, 2024. doi: 10.48550/arXiv.2406.06563. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[161]	WU Shaohua, LUO Jiangang, CHEN Xi, et al. Yuan 2.0-M32: Mixture of experts with attention router[J]. arXiv preprint arXiv: 2405.17976, 2024. doi: 10.48550/arXiv.2405.17976. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[162]	ZHU Tong, QU Xiaoye, DONG Daize, et al. LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training[J]. arXiv preprint arXiv: 2406.16554, 2024. doi: 10.48550/arXiv.2406.16554. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[163]	MUENNIGHOFF N, SOLDAINI L, GROENEVELD D, et al. OLMoE: Open mixture-of-experts language models[J]. arXiv preprint arXiv: 2409.02060, 2025. doi: 10.48550/arXiv.2409.02060. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[164]	LIU Liyuan, KIM Y J, WANG Shuohang, et al. GRIN: GRadient-INformed MoE[J]. arXiv preprint arXiv: 2409.12136, 2024. doi: 10.48550/arXiv.2409.12136. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[165]	SUN Xingwu, CHEN Yanfeng, HUANG Yiqing, et al. Hunyuan-Large: An open-source MoE model with 52 billion activated parameters by tencent[J]. arXiv preprint arXiv: 2411.02265, 2024. doi: 10.48550/arXiv.2411.02265. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[166]	MiniMax. MiniMax-01: Scaling foundation models with lightning attention[J]. arXiv preprint arXiv: 2501.08313, 2025. doi: 10.48550/arXiv.2501.08313. (查阅网上资料,请确认本条文献文献类型及格式是否正确)(查阅网上资料,本条文献与第29条文献重复,请确认).
[167]	LIU Jingyuan, SU Jianlin, YAO Xingcheng, et al. Muon is scalable for LLM training[J]. arXiv preprint arXiv: 2502.16982, 2025. doi: 10.48550/arXiv.2502.16982. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[168]	Meta-AI. LLaMA4[EB/OL]. https://www.llama.com/models/llama-4/, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[169]	YANG An, LI Anfeng, YANG Baosong, et al. Qwen3 technical report[J]. arXiv preprint arXiv: 2505.09388, 2025. doi: 10.48550/arXiv.2505.09388. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[170]	TIAN Changxin, CHEN Kunlong, LIU Jia, et al. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models[J]. arXiv preprint arXiv: 2507.17702, 2025. doi: 10.48550/arXiv.2507.17702. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[171]	HUANG Quzhe, AN Zhenwei, ZHUANG Nan, et al. Harder tasks need more experts: Dynamic routing in MoE models[J]. arXiv preprint arXiv: 2403.07652, 2024. doi: 10.48550/arXiv.2403.07652. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[172]	KUNWAR P, VU M N, GUPTA M, et al. TT-LoRA MoE: Unifying parameter-efficient fine-tuning and sparse mixture-of-experts[J]. arXiv preprint arXiv: 2504.21190, 2025. doi: 10.48550/arXiv.2504.21190. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[173]	WANG Junlin, WANG Jue, ATHIWARATKUN B, et al. Mixture-of-agents enhances large language model capabilities[J]. arXiv preprint arXiv: 2406.04692, 2024. doi: 10.48550/arXiv.2406.04692. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[174]	YUN S, PARK S, NAM H, et al. The new LLM bottleneck: A systems perspective on latent attention and mixture-of-experts[J]. arXiv preprint arXiv: 2507.15465, 2025. doi: 10.48550/arXiv.2507.15465. (查阅网上资料,请确认本条文献文献类型及格式是否正确).
[175]	AL MARUF H, WANG Hao, DHANOTIA A, et al. TPP: Transparent page placement for CXL-enabled tiered-memory[C]. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Vancouver, Canada, 2023: 742–755. doi: 10.1145/3582016.3582063.
[176]	ZHOU Zhe, CHEN Yiqi, ZHANG Tao, et al. NeoMem: Hardware/software co-design for CXL-native memory tiering[C]. 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Austin, USA: IEEE, 2024: 1518–1531. doi: 10.1109/MICRO61859.2024.00111.
[177]	JIN Chao, JIANG Ziheng, BAI Zhihao, et al. MegaScale-MoE: Large-scale communication-efficient training of mixture-of-experts models in production[J]. arXiv preprint arXiv: 2505.11432, 2025. doi: 10.48550/arXiv.2505.11432. (查阅网上资料,请确认本条文献文献类型及格式是否正确).