Advanced Search
Turn off MathJax
Article Contents
KONG Xiangyan, GAO YuLong, WANG Gang. Multimodal Pedestrian Trajectory Prediction with Multi-Scale Spatio-Temporal Group Modeling and Diffusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250900
Citation: KONG Xiangyan, GAO YuLong, WANG Gang. Multimodal Pedestrian Trajectory Prediction with Multi-Scale Spatio-Temporal Group Modeling and Diffusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250900

Multimodal Pedestrian Trajectory Prediction with Multi-Scale Spatio-Temporal Group Modeling and Diffusion

doi: 10.11999/JEIT250900 cstr: 32379.14.JEIT250900
  • Received Date: 2025-09-09
  • Accepted Date: 2026-01-04
  • Rev Recd Date: 2026-01-04
  • Available Online: 2026-01-15
  •   Objective  The rapid development of autonomous driving and social robotics has increased the need for accurate pedestrian trajectory prediction to improve safety and interaction efficiency. Existing group-based methods mainly emphasize local spatial interaction and often overlook latent grouping characteristics across time. This study proposes a multi-scale spatiotemporal feature construction method that separates trajectory shape from absolute spatiotemporal coordinates. This enables the model to capture latent group associations across different temporal intervals. A spatiotemporal interaction three-element encoding mechanism is incorporated to extract dynamic relationships between individuals and groups. By integrating the reverse process length mechanism of diffusion models, the system progressively reduces prediction uncertainty. This approach provides an effective solution for multimodal trajectory prediction in complex, crowded scenes and offers theoretical support for improving the accuracy and stability of long-range trajectory forecasting.  Methods  The algorithm performs deep modeling of pedestrian trajectories through multi-scale spatiotemporal group modeling across three components: group construction, interaction modeling, and trajectory generation. First, to address the limitations of methods that focus on local spatiotemporal patterns but overlook cross-dimensional latent characteristics, a multiscale trajectory grouping model is developed. Its core design extracts trajectory offsets to represent trajectory shapes, separating motion features from absolute positions. This enables the system to identify latent group associations among agents who follow similar motion patterns across different periods. Second, a spatiotemporal interaction three-element encoding method is proposed. By defining neural interaction strength, interaction categories, and category functions, the method captures detailed individual interactions and the global dynamic evolution of collective behavior. Finally, a Diffusion Model is introduced for multimodal prediction. Through the reverse process length mechanism, the model converges gradually, reduces uncertainty, and transforms a diffuse prediction space into plausible future trajectories.  Results and Discussions  The proposed model was evaluated against 11 state-of-the-art baselines on the NBA dataset (Table 1). The results show clear advantages in minADE20. It achieves substantial gains over GroupNet+CVAE in long-term prediction tasks, improving minADE20 and minFDE20 by 0.18 and 0.36, respectively, at the 4-second horizon. Although it is slightly inferior to MID in long-term trend prediction, possibly because group dynamics shift rapidly and intensely in NBA scenarios, the model maintains strong instantaneous accuracy. This supports the effectiveness of the multi-scale grouping strategy, which uses historical trajectories to capture complex dynamic interactions. On the ETH/UCY datasets (Table 2), MSGD provides consistent improvements across all five sub-scenes. In the dense and highly interactive UNIV scene, the method exceeds all baselines by leveraging the strengths of multi-scale modeling. Although MSGD is marginally behind PPT in long-distance endpoint constraints, it maintains a lead in minADE20. It also outperforms Trajectory++ in velocity smoothness and directional coherence (std dev: 0.701 2) (Table 3), indicating that the generated trajectories maintain natural smoothness aligned with human motion. Ablation studies verify the independent effects of the diffusion model, spatiotemporal feature extraction, and multi-scale grouping modules (Table 4). Grouping sensitivity analysis on the NBA dataset shows that full-court grouping (group size 11) enhances long-term stability, reducing minFDE20 by 0.026–0.03 at 4 seconds (Table 5). Configurations with group sizes of 5 or 2 further support the importance of team formations and “one-on-one” local offensive and defensive dynamics (Table 6). Diffusion-step and training-epoch sensitivity analysis reveals a complementary relationship: moderate diffusion steps (30–40) refine denoising and improve accuracy, whereas excessive steps may cause overfitting (Table 7). Qualitative visualization confirms that MSGD generates multimodal trajectories with high overlap with ground truth (Fig. 2).  Conclusions  This study presents a trajectory prediction algorithm that improves performance in two primary ways: (1) it captures pedestrian interactions by extracting spatiotemporal features, and (2) it strengthens collective behavior modeling through multi-scale grouping. Experiments show that the method achieves state-of-the-art performance on the NBA and ETH/UCY datasets, and ablation studies confirm the effectiveness of all modules. Two limitations remain. First, explicit environmental information, such as maps or obstacles, is not yet incorporated. Second, the diffusion model requires substantial computational cost during inference. Future research will address these issues.
  • loading
  • [1]
    李暾, 朱耀堃, 吴欣虹, 等. 基于卡口上下文和深度置信网络的车辆轨迹预测模型研究[J]. 电子与信息学报, 2021, 43(5): 1323–1330. doi: 10.11999/JEIT200137.

    LI Tun, ZHU Yaokun, WU Xinhong, et al. Vehicle trajectory prediction method based on intersection context and deep belief network[J]. Journal of Electronics & Information Technology, 2021, 43(5): 1323–1330. doi: 10.11999/JEIT200137.
    [2]
    THERESA W G, MADHIMITHRA R, and BHAVANA G. A hybrid RL-GNN approach for precise pedestrian trajectory prediction in autonomous navigation[C]. 8th International Conference on Trends in Electronics and Informatics, Tirunelveli, India, 2025: 1485–1490. doi: 10.1109/ICOEI65986.2025.11013272.
    [3]
    ALAHI A, GOEL K, RAMANATHAN V, et al. Social LSTM: Human trajectory prediction in crowded spaces[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 961–971. doi: 10.1109/CVPR.2016.110.
    [4]
    余浩扬, 李艳生, 肖凌励, 等. 面向动态环境的巡检机器人轻量级语义视觉SLAM框架[J]. 电子与信息学报, 2025, 47(10): 3979–3992. doi: 10.11999/JEIT250301.

    YU Haoyang, LI Yansheng, XIAO Lingli, et al. A lightweight semantic visual simultaneous localization and mapping framework for inspection robots in dynamic environments[J]. Journal of Electronics & Information Technology, 2025, 47(10): 3979–3992. doi: 10.11999/JEIT250301.
    [5]
    WEI Xiaoge, LV Wei, SONG Weiguo, et al. Survey study and experimental investigation on the local behavior of pedestrian groups[J]. Complexity, 2015, 20(6): 87–97. doi: 10.1002/cplx.21633.
    [6]
    MOUSSAÏD M, PEROZO N, GARNIER S, et al. The walking behaviour of pedestrian social groups and its impact on crowd dynamics[J]. PLoS One, 2010, 5(4): e10047. doi: 10.1371/journal.pone.0010047.
    [7]
    霍如, 吕科呈, 黄韬. 车联网中路径预测驱动的任务切分与计算资源分配方法[J]. 电子与信息学报, 2025, 47(10): 3658–3669. doi: 10.11999/JEIT250135.

    HUO Ru, LÜ Kecheng, and HUANG Tao. Task segmentation and computing resource allocation method driven by path prediction in internet of vehicles[J]. Journal of Electronics & Information Technology, 2025, 47(10): 3658–3669. doi: 10.11999/JEIT250135.
    [8]
    毛琳, 解云娇, 杨大伟, 等. 行人轨迹预测条件端点局部目的地池化网络[J]. 电子与信息学报, 2022, 44(10): 3465–3475. doi: 10.11999/JEIT210716.

    MAO Lin, XIE Yunjiao, YANG Dawei, et al. Local destination pooling network for pedestrian trajectory prediction of condition endpoint[J]. Journal of Electronics & Information Technology, 2022, 44(10): 3465–3475. doi: 10.11999/JEIT210716.
    [9]
    LIANG Junwei, JIANG Lu, MURPHY K, et al. The garden of forking paths: Towards multi-future trajectory prediction[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10505–10515. doi: 10.1109/CVPR42600.2020.01052.
    [10]
    周传鑫, 简刚, 李凌书, 等. 融合兴趣点和联合损失函数的长时航迹预测模型[J]. 电子与信息学报, 2025, 47(8): 2841–2849. doi: 10.11999/JEIT250011.

    ZHOU Chuanxin, JIAN Gang, LI Lingshu, et al. Long-term trajectory prediction model based on points of interest and joint loss function[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2841–2849. doi: 10.11999/JEIT250011.
    [11]
    HELBING D and MOLNÁR P. Social force model for pedestrian dynamics[J]. Physical Review E, 1995, 51(5): 4282–4286. doi: 10.1103/PhysRevE.51.4282.
    [12]
    SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks, 2009, 20(1): 61–80. doi: 10.1109/TNN.2008.2005605.
    [13]
    WU Zonghan, PAN Shirui, CHEN Fengwen, et al. A comprehensive survey on graph neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4–24. doi: 10.1109/TNNLS.2020.2978386.
    [14]
    WANG Chenyue and WANG Dongyu. Advancing federated learning in IoV: GNN-based trajectory prediction and privacy protection[C]. 2025 IEEE Wireless Communications and Networking Conference, Milan, Italy, 2025: 1–6. doi: 10.1109/WCNC61545.2025.10978319.
    [15]
    BAE I, PARK J H, and JEON H G. Learning pedestrian group representations for multi-modal trajectory prediction[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 270–289. doi: 10.1007/978-3-031-20047-2_16.
    [16]
    MOUSSAÏD M, PEROZO N, GARNIER S, et al. The walking behaviour of pedestrian social groups and its impact on crowd dynamics[J]. PLoS One, 2010, 5(4): e10047. doi: 10.1371/journal.pone.0010047.(查阅网上资料,本条文献与第6条文献重复,请确认).
    [17]
    XU Chenxin, LI Maosen, NI Zhenyang, et al. GroupNet: Multiscale hypergraph neural networks for trajectory prediction with relational reasoning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 6488–6497. doi: 10.1109/CVPR52688.2022.00639.
    [18]
    ZHANG Yuzhen, SU Junning, GUO Hang, et al. S-CVAE: Stacked CVAE for trajectory prediction with incremental greedy region[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(12): 20351–20363. doi: 10.1109/TITS.2024.3465836.
    [19]
    YANG Jiayu, LEE J J, and ANTONIOU C. Trajectory prediction for multiple agents in dynamic environments: Factoring in traffic states and driving styles[J]. IEEE Transactions on Intelligent Transportation Systems, 2025, 26(11): 19281–19295. doi: 10.1109/TITS.2025.3595743.
    [20]
    WEI Chuheng, WU Guoyuan, BARTH M J, et al. KI-GAN: Knowledge-informed generative adversarial networks for enhanced multi-vehicle trajectory forecasting at signalized intersections[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2024: 7115–7124. doi: 10.1109/CVPRW63382.2024.00706.
    [21]
    CHEN Yanbo, YU Huilong, and XI Junqiang. STS-GAN: Spatial-temporal attention guided social GAN for vehicle trajectory prediction[C]. 16th International Symposium on Advanced Vehicle Control, Milan, Italy, 2024: 164–170. doi: 10.1007/978-3-031-70392-8_24.
    [22]
    GUPTA A, JOHNSON J, FEI-FEI L, et al. Social GAN: Socially acceptable trajectories with generative adversarial networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2255–2264. doi: 10.1109/CVPR.2018.00240.
    [23]
    MOHAMED A, QIAN Kun, ELHOSEINY M, et al. Social-STGCNN: A social spatio-temporal graph convolutional neural network for human trajectory prediction[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 14412–14420. doi: 10.1109/CVPR42600.2020.01443.
    [24]
    HUANG Yingfan, BI Huikun, LI Zhaoxin, et al. STGAT: Modeling spatial-temporal interactions for human trajectory prediction[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 6271–6280. doi: 10.1109/ICCV.2019.00637.
    [25]
    KIPF T N, FETAYA E, WANG K C, et al. Neural relational inference for interacting systems[C]. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 2018: 2693–2702.
    [26]
    YU Cunjun, MA Xiao, REN Jiawei, et al. Spatio-temporal graph transformer networks for pedestrian trajectory prediction[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 507–523. doi: 10.1007/978-3-030-58610-2_30.
    [27]
    MANGALAM K, GIRASE H, AGARWAL S, et al. It is not the journey but the destination: Endpoint conditioned trajectory prediction[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 759–776. doi: 10.1007/978-3-030-58536-5_45.
    [28]
    HU Yue, CHEN Siheng, ZHANG Ya, et al. Collaborative motion prediction via neural motion message passing[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6318–6327. doi: 10.1109/CVPR42600.2020.00635.
    [29]
    GU Tianpei, CHEN Guangyi, LI Junlong, et al. Stochastic trajectory prediction via motion indeterminacy diffusion[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17092–17101. doi: 10.1109/CVPR52688.2022.01660.
    [30]
    SOHL-DICKSTEIN J, WEISS E A, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 2256–2265.
    [31]
    SADEGHIAN A, KOSARAJU V, SADEGHIAN A, et al. SoPhie: An attentive GAN for predicting paths compliant to social and physical constraints[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 1349–1358. doi: 10.1109/CVPR.2019.00144.
    [32]
    SUN Jianhua, LI Yuxuan, FANG Haoshu, et al. Three steps to multimodal trajectory prediction: Modality clustering, classification and synthesis[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13230–13239. doi: 10.1109/ICCV48922.2021.01300.
    [33]
    LIN Xiaotong, LIANG Tianming, LAI Jianhuang, et al. Progressive pretext task learning for human trajectory prediction[C]. 18th European Conference on Computer Vision, Milan, Italy, 2025: 197–214. doi: 10.1007/978-3-031-73404-5_12.
    [34]
    LI Linhui, LIN Xiaotong, HUANG Yejia, et al. Beyond minimum-of-N: Rethinking the evaluation and methods of pedestrian trajectory prediction[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 12880–12893. doi: 10.1109/TCSVT.2024.3439128.
    [35]
    SALZMANN T, IVANOVIC B, CHAKRAVARTY P, et al. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 683–700. doi: 10.1007/978-3-030-58523-5_40.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(2)  / Tables(7)

    Article Metrics

    Article views (193) PDF downloads(9) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return