MG-MoE: Routed Multi-Granularity Expert Ensemble

XIAN Fengyu; JIAN Haifang; XIE Zihui; DU Jun; ZHANG Yuanyuan; NING Xin; DONG Miaomiao; WANG Hongchang

doi:10.11999/JEIT260219

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

XIAN Fengyu, JIAN Haifang, XIE Zihui, DU Jun, ZHANG Yuanyuan, NING Xin, DONG Miaomiao, WANG Hongchang. MG-MoE: Routed Multi-Granularity Expert Ensemble[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260219

Citation:

XIAN Fengyu, JIAN Haifang, XIE Zihui, DU Jun, ZHANG Yuanyuan, NING Xin, DONG Miaomiao, WANG Hongchang. MG-MoE: Routed Multi-Granularity Expert Ensemble[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260219

Citation:

PDF( 5130 KB)

MG-MoE: Routed Multi-Granularity Expert Ensemble

doi: 10.11999/JEIT260219 cstr: 32379.14.JEIT260219

XIAN Fengyu^1
,,
JIAN Haifang^{2, 5},
XIE Zihui¹,
DU Jun¹,
ZHANG Yuanyuan³,
NING Xin^{4, 5},
DONG Miaomiao^{2, 5},
WANG Hongchang^{2, 5}

1.
School of Communication and Electronic Engineering, Shandong Normal University, Jinan 250358, China
2.
Laboratory of Solid State Optoelectronics Information Technology, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
3.
Beijing Milu Ecological Research Center, Beijing 100076, China
4.
Laboratory of Artificial Intelligence and High Speed Circuit, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
5.
University of Chinese Academy of Sciences, Beijing 100049, China

Funds: The National Key Research and Development Program of China (2024YFE0210600)

Received Date: 2026-03-02
Accepted Date: 2026-04-23
Rev Recd Date: 2026-04-22

Available Online: 2026-05-23

Abstract

Abstract

Objective Fine-Grained Image Recognition (FGIR) aims to distinguish visually similar subcategories that differ only in subtle local patterns. It must also remain robust to large intra-class variations caused by pose changes, occlusion, illumination shifts, and complex backgrounds. In real-world scenarios, these challenges are further intensified by long-tailed category distributions. Rare or difficult classes are more likely to overfit spurious contextual cues and suffer from unstable decision boundaries. Therefore, a conditional computation paradigm is needed, in which complementary inductive biases are separated into specialized expert branches and adaptively combined for each sample. This work aims to develop a routed multi-granularity mixture-of-experts framework that improves discriminative performance under controllable inference cost. It also enhances robustness for difficult samples and long-tailed categories through adaptive sparse expert activation. Methods A Multi-Granularity Mixture-of-Experts (MG-MoE) model is proposed. It is a routed ensemble architecture composed of a shared backbone, four heterogeneous experts, and a learnable router that predicts input-conditioned expert weights (Fig. 2). The experts are designed with complementary inductive biases to address key factors in FGIR. MPSA emphasizes global structure and contour-level semantics. PMG captures fine local details through multi-granularity part modeling. TransFG focuses on pose and deformation modeling. PIM improves robustness in cluttered backgrounds through background suppression. To limit interference and reduce unnecessary computation, MG-MoE adopts sparse fusion. Only the Top-K experts, with K=2 by default, contribute to the final prediction during inference. To improve routing stability and generalization, a two-stage optimization strategy is designed. In the first stage, dynamic cluster-level training is performed. A cluster-level soft teacher distribution is constructed from validation-set statistics and imposed through Kullback-Leibler (KL) divergence regularization. This process stabilizes routing behavior and promotes effective expert specialization. In the second stage, residual fine-tuning is conducted. The feature-driven routing mechanism is kept unchanged, while the classification heads of the Top-2 experts associated with each cluster are selectively unfrozen. The router and expert heads are then jointly optimized with grouped learning rates. This design reduces fusion bias and strengthens discrimination for difficult samples and long-tailed categories. Results and Discussions MG-MoE achieves strong performance on standard FGIR benchmarks. On CUB-200-2011, it obtains 92.89% Top-1 accuracy. This result is higher than those of representative expert backbones used individually, including MPSA (91.23%), PIM (91.17%), and TransFG (90.49%). It also outperforms the multi-granularity baseline PMG (88.32%) (Table 1). On the Bird-1445 sampled set, MG-MoE achieves 96.80% Top-1 accuracy and consistently improves over strong baselines (Table 2). These results indicate that routed multi-expert specialization remains effective in data-limited and highly similar fine-grained scenarios. The efficiency-accuracy trade-off is summarized in Table 3. With Top-2 sparse routing, MG-MoE reaches 92.89% accuracy with a compute budget of 143.9 GFLOPs. It avoids dense expert activation during inference by selecting only the Top-2 experts for each sample, thereby achieving a favorable balance between accuracy and efficiency. Ablation experiments show that increasing K beyond 2 does not yield consistent gains, which suggests that indiscriminate fusion can dilute discriminative evidence. Top-2 fusion produces the best performance, whereas Top-1 fusion is more sensitive to routing errors and larger K values may introduce noise and reduce accuracy (Table 4). The role of expert diversity and composition is also analyzed. Two- and three-expert variants generally underperform the full four-expert configuration, indicating that each inductive bias contributes to different fine-grained difficulty factors. In contrast, adding homogeneous experts without new functional diversity brings diminishing or negative gains, which is consistent with increased routing ambiguity and limited expert complementarity (Table 5). These results support the use of a compact set of heterogeneous experts combined with sparse routing. To interpret the learned specialization, category-wise routing statistics are visualized. The expert-category heatmap shows that MPSA receives dominant routing weights across many categories, reflecting the central role of global structure in fine-grained discrimination. PIM and TransFG show higher activation for specific difficult categories, which is consistent with their roles in background suppression and pose and deformation modeling (Fig. 3). Finally, t-SNE visualizations illustrate the qualitative effect of expert fusion on class separability. Shared backbone features show stronger inter-class entanglement among visually similar subcategories. In contrast, fused outputs form clearer clusters with better between-class separation and within-class compactness, indicating a more reliable decision space shaped by routed expert aggregation (Fig. 4). Conclusions MG-MoE is a multi-granularity routed mixture-of-experts framework for fine-grained recognition. By combining four complementary experts, Top-2 sparse fusion, and a two-stage optimization strategy for stable routing and calibrated fusion, MG-MoE improves recognition accuracy on CUB-200-2011 and the Bird-1445 sampled set. It also provides interpretable evidence of expert specialization (Table 1, Table 2, Fig. 3, Fig. 4). Ablation results confirm that controlled Top-2 fusion and heterogeneous expert design are key to the observed performance gains. Overly dense fusion or homogeneous expert expansion provides limited benefit (Table 4, Table 5).
- Fine-grained image recognition,
- Mixture-of-experts,
- Multi-granularity learning,
- Routing mechanism,
- Sparse activation

FullText(HTML)

References(33)

References

[1]	SUN Hongbo, HE Xiangteng, XU Jinglin, et al. SIM-OFE: Structure information mining and object-aware feature enhancement for fine-grained visual categorization[J]. IEEE Transactions on Image Processing, 2024, 33: 5312–5326. doi: 10.1109/TIP.2024.3459788.
[2]	YANG Shengying, YANG Xinqi, WU Jianfeng, et al. Significant feature suppression and cross-feature fusion networks for fine-grained visual classification[J]. Scientific Reports, 2024, 14(1): 24051. doi: 10.1038/s41598-024-74654-4.
[3]	WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813.
[4]	MA Bing, LI Junyi, JIN Zhengbei, et al. Fine-grained image recognition with bio-inspired gradient-aware attention[J]. Biomimetics, 2025, 10(12): 834. doi: 10.3390/biomimetics10120834.
[5]	CHANG Dongliang, TONG Yujun, DU Ruoyi, et al. An erudite fine-grained visual classification model[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7268–7277. doi: 10.1109/CVPR52729.2023.00702.
[6]	SU J C, CHENG Zezhou, and MAJI S. A realistic evaluation of semi-supervised learning for fine-grained classification[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12966–12975. doi: 10.1109/CVPR46437.2021.01277.
[7]	SHU Yangyang, YU Baosheng, XU Haiming, et al. Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism[C]. The 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 449–465. doi: 10.1007/978-3-031-19806-9_26.
[8]	FEDUS W, ZOPH B, and SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1–39.
[9]	JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79–87. doi: 10.1162/neco.1991.3.1.79.
[10]	SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]. The 5th International Conference on Learning Representations, Toulon, France, 2017.
[11]	RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 657.
[12]	HAN Xumeng, WEI Longhui, DOU Zhiyang, et al. ViMoE: An empirical study of designing vision mixture-of-experts[J]. IEEE Transactions on Image Processing, 2025, 34: 7209–7221. doi: 10.1109/TIP.2025.3626887.
[13]	ZHU Jinguo, ZHU Xizhou, WANG Wenhai, et al. Uni-perceiver-MoE: Learning sparse generalist models with conditional MoEs[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 193.
[14]	MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 695.
[15]	SHEN Leyang, CHEN Gongwei, SHAO Rui, et al. MoME: Mixture of multimodal experts for generalist multimodal large language models[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 1330.
[16]	ZHENG Haiyang, PU Nan, LI Wenjing, et al. Generalized fine-grained category discovery with multi-granularity conceptual experts[J]. arXiv preprint arXiv: 2509.26227, 2025.
[17]	HE Ju, CHEN Jieneng, LIU Shuai, et al. TransFG: A transformer architecture for fine-grained recognition[C]. The 36th AAAI Conference on Artificial Intelligence, 2022: 852–860. doi: 10.1609/aaai.v36i1.19967.
[18]	DU Ruoyi, CHANG Dongliang, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 153–168. doi: 10.1007/978-3-030-58565-5_10.
[19]	CHOU P Y, LIN C H, and KAO W C. A novel plug-in module for fine-grained visual classification[J]. arXiv preprint arXiv: 2202.03822, 2022.
[20]	XU Zhikang, YUE Xiaodong, LV Ying, et al. Trusted fine-grained image classification through hierarchical evidence fusion[C]. The 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 10657–10665. doi: 10.1609/aaai.v37i9.26265.
[21]	LEPIKHIN D, LEE H, XU Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]. The 9th International Conference on Learning Representations, Austria, 2021.
[22]	GURURANGAN S, LEWIS M, HOLTZMAN A, et al. DEMix layers: Disentangling domains for modular language modeling[C]. The 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022: 5557–5576. doi: 10.18653/v1/2022.naacl-main.407.
[23]	RAJBHANDARI S, LI Conglong, YAO Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]. The 39th International Conference on Machine Learning, Baltimore, USA, 2022: 18332–18346.
[24]	WANG Lean, GAO Huazuo, ZHAO Chenggang, et al. Auxiliary-loss-free load balancing strategy for mixture-of-experts[J]. arXiv preprint arXiv: 2408.15664, 2024.
[25]	ROLLER S, SUKHBAATAR S, SZLAM A, et al. Hash layers for large sparse models[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 1343.
[26]	JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv preprint arXiv: 2401.04088, 2024.
[27]	DAI Damai, DENG Chengqi, ZHAO Chenggang, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[C]. The 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024. doi: 10.18653/v1/2024.acl-long.70.
[28]	CHEN Tianlong, CHEN Xuxi, DU Xianzhi, et al. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts[C]. 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 17346–17357. doi: 10.1109/ICCV51070.2023.01591.
[29]	王洪昌, 咸凤羽, 谢子晖, 等. BIRD1445: 面向生态监测的大规模多模态鸟类数据集[J]. 电子与信息学报, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647. WANG Hongchang, XIAN Fengyu, XIE Zihui, et al. BIRD1445: Large-scale multimodal bird dataset for ecological monitoring[J]. Journal of Electronics & Information Technology, 2026, 48(2): 873–888. doi: 10.11999/JEIT250647.
[30]	CHEN Yue, BAI Yalong, ZHANG Wei, et al. Destruction and construction learning for fine-grained image recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5157–5166. doi: 10.1109/CVPR.2019.00530.
[31]	LUO Wei, YANG Xitong, MO Xianjie, et al. Cross-x learning for fine-grained visual categorization[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8242–8251. doi: 10.1109/ICCV.2019.00833.
[32]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11976–11986. doi: 10.1109/CVPR52688.2022.01167.
[33]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.