Advanced Search
Turn off MathJax
Article Contents
LIN Zheng, HU Haiying, DI Peng, ZHU Yongsheng, ZHOU Meijiang. Research on Proximal Policy Optimization for Autonomous Long-Distance Rapid Rendezvous of Spacecraft[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250844
Citation: LIN Zheng, HU Haiying, DI Peng, ZHU Yongsheng, ZHOU Meijiang. Research on Proximal Policy Optimization for Autonomous Long-Distance Rapid Rendezvous of Spacecraft[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250844

Research on Proximal Policy Optimization for Autonomous Long-Distance Rapid Rendezvous of Spacecraft

doi: 10.11999/JEIT250844 cstr: 32379.14.JEIT250844
  • Accepted Date: 2025-12-09
  • Rev Recd Date: 2025-12-09
  • Available Online: 2025-12-13
  • Objective With the increasing demands of deep-space exploration, on-orbit servicing, and space debris removal missions, autonomous long-range rapid rendezvous capabilities have become critical for future space operations. Traditional trajectory planning approaches based on analytical methods or heuristic optimization often exhibit limitations when dealing with complex dynamics, strong disturbances, and uncertainties, which often makes it difficult to balance efficiency and robustness. Deep Reinforcement Learning (DRL), by combining the approximation capabilities of deep neural networks with the decision-making strengths of reinforcement learning, enables adaptive learning and real-time decision-making in high-dimensional continuous state and action spaces. In particular, the Proximal Policy Optimization (PPO) algorithm, with its training stability, sample efficiency, and ease of implementation, has emerged as a representative policy gradient method that enhances policy exploration while ensuring stable policy updates. Therefore, integrating DRL with PPO into spacecraft long-range rapid rendezvous tasks can not only overcome the limitations of conventional methods but also provide an intelligent, efficient, and robust solution for autonomous guidance in complex orbital environments.Methods This study first establishes a spacecraft orbital dynamics model incorporating the effects of J2 perturbation, while also modeling uncertainties such as position and velocity measurement errors and actuator deviations during on-orbit operations. Subsequently, the long-range rapid rendezvous problem is formulated as a Markov Decision Process (MDP), with the state space defined by variables including position, velocity, and relative distance, and the action space characterized by impulse duration and direction. The model further integrates fuel consumption and terminal position and velocity constraints. Based on this formulation, a DRL framework leveraging PPO was constructed, in which the policy network outputs maneuver command distributions and the value network estimate state values to improve training stability. To address convergence difficulties arising from sparse rewards, an enhanced dense reward function was designed, combining a position potential function with a velocity-guidance function to guide the agent toward the target while gradually decelerating and ensuring fuel efficiency. Finally, the optimal maneuver strategy for the spacecraft was obtained through simulation-based training, and its robustness was validated under various uncertainty conditions.Results and Discussions Based on the aforementioned DRL framework, a comprehensive simulation was conducted to evaluate the effectiveness and robustness of the proposed improved algorithm. In Case 1, three reward structures were tested: sparse reward, traditional dense reward, and an improved dense reward integrating a relative position potential function and a velocity guidance term. The results indicate that the design of the reward function significantly impacts convergence behavior and policy stability. With a sparse reward structure, the agent lacks process feedback, which hinders effective exploration of feasible actions. The traditional dense reward provides continuous feedback, allowing for gradual convergence toward local optima. However, terminal velocity deviations remain uncorrected in the later stages, leading to suboptimal convergence and incomplete satisfaction of terminal constraints. In contrast, the improved dense reward effectively guides the agent toward favorable behaviors from the early training stages while penalizing undesirable actions at each step, thereby accelerating convergence and enhancing robustness. The velocity guidance term enables the agent to anticipate necessary adjustments during the mid-to-late phases of the approach, rather than postponing corrections until the terminal phase, resulting in more fuel-efficient maneuvers. The simulation results further demonstrate the actual performance: the maneuvering spacecraft executed 10 impulsive maneuvers throughout the mission, achieving a terminal relative distance of 21.326 km, a relative velocity of 0.0050 km/s, and a total fuel consumption of 111.2123 kg. Furthermore, to validate the robustness of the trained model against realistic uncertainties in orbital operations, 1000 Monte Carlo simulations were performed. As presented in Table 5, the mission success rate reached 63.40%, with fuel consumption in all trials remaining within acceptable bounds. Finally, to verify the superiority of the PPO algorithm, its performance was compared with that of DDPG in a multi-impulse fast-approach rendezvous mission in Case 2. The results from PPO training show that the maneuvering spacecraft performed 5 impulsive maneuvers, achieving a terminal separation of 2.2818 km, a relative velocity of 0.0038 km/s, and a total fuel consumption of 4.1486 kg. The DDPG training results indicate that the maneuvering spacecraft consumed 4.3225 kg of fuel, achieving a final separation of 4.2731 km and a relative velocity of 0.0020 km/s. Both algorithms successfully fulfill mission requirements, with comparable fuel usage. However, it is noted that DDPG required a training duration of 9 hours and 23 minutes, incurring significant computational resource consumption. In contrast, the PPO training process was relatively more efficient, converging within 6 hours and 4 minutes. Therefore, although DDPG exhibits higher sample efficiency, its longer training cycle and greater computational burden make it less efficient than PPO in practical applications. The comparative analysis demonstrates that the proposed PPO with the improved dense reward significantly enhances learning efficiency, policy stability, and robustness.Conclusions This study addressed the problem of autonomous long-range rapid rendezvous for spacecraft under J2 perturbation and uncertainties, and proposed a PPO-based trajectory optimization method. The results demonstrated that the proposed approach could generate maneuver trajectories satisfying terminal constraints under limited fuel and transfer time, while outperforming conventional methods in terms of convergence speed, fuel efficiency, and robustness. The main contributions of this work are: (1) the development of an orbital dynamics framework that incorporates J2 perturbation and uncertainty modeling, and the formulation of the rendezvous problem as a MDP; (2) the design of an enhanced dense reward function combining a position potential function and a velocity-guidance function, which effectively improved training stability and convergence efficiency; (3) simulation-based validation of PPO’s applicability and robustness in complex orbital environments, providing a feasible solution for future autonomous rendezvous and on-orbit servicing missions. Future work will consider sensor noise, environmental disturbances, and multi-spacecraft cooperative rendezvous in complex mission scenarios, aiming to enhance the algorithm’s practical applicability and generalization to real-world operations.
  • loading
  • [1]
    LI Weijie, CHENG Dayi, LIU Xigang, et al. On-Orbit Service (OOS) of spacecraft: A review of engineering developments[J]. Progress in Aerospace Sciences, 2019, 108: 32–120. doi: 10.1016/j.paerosci.2019.01.004.
    [2]
    NALLAPU R T and THANGAVELAUTHAM J. Design and sensitivity analysis of spacecraft swarms for planetary moon reconnaissance through co-orbits[J]. Acta Astronautica, 2021, 178: 854–896. doi: 10.1016/j.actaastro.2020.10.008.
    [3]
    NIU Shangwei, LI Dongxu, and JI Haoran. Research on mission time planning and autonomous interception guidance method for low-thrust spacecraft in long-distance interception[C]. 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE), Dalian, China, 2020: 117–123. doi: 10.1109/CACRE50138.2020.9230051.
    [4]
    LEDKOV A and ASLANOV V. Review of contact and contactless active space debris removal approaches[J]. Progress in Aerospace Sciences, 2022, 134: 100858. doi: 10.1016/j.paerosci.2022.100858.
    [5]
    陈宏宇, 吴会英, 周美江, 等. 微小卫星轨道工程应用与STK仿真[M]. 北京: 科学出版社, 2016. (查阅网上资料, 未找到页码信息, 请确认补充).

    CHEN Hongyu, WU Huiying, ZHOU Meijiang, et al. Orbit Engineering Application and STK Simulation for Microsatellite[M]. Beijing: Science Press, 2016. (查阅网上资料, 未找到对应的英文翻译信息, 请确认).
    [6]
    ABDELKHALIK O and MORTARI D. N-impulse orbit transfer using genetic algorithms[J]. Journal of Spacecraft and Rockets, 2007, 44(2): 456–460. doi: 10.2514/1.24701.
    [7]
    PONTANI M, GHOSH P, and CONWAY B A. Particle swarm optimization of multiple-burn rendezvous trajectories[J]. Journal of Guidance, Control, and Dynamics, 2012, 35(4): 1192–1207. doi: 10.2514/1.55592.
    [8]
    YU Jing, CHEN Xiaoqian, CHEN Lihu, et al. Optimal scheduling of GEO debris removing based on hybrid optimal control theory[J]. Acta Astronautica, 2014, 93: 400–409. doi: 10.1016/j.actaastro.2013.07.015.
    [9]
    MNIH V, KAVUKKCUOGLU K, SILVER D, et al. Playing Atari with deep reinforcement learning[EB/OL]. https://arxiv.org/abs/1312.5602, 2013.
    [10]
    DONG Zhicai, ZHU Yiman, WANG Lu, et al. Motion planning of free-floating space robots for tracking tumbling targets by two-axis matching via reinforcement learning[J]. Aerospace Science and Technology, 2024, 155: 109540. doi: 10.1016/j.ast.2024.109540.
    [11]
    TIWARI M, PRAZENICA R, and HENDERSON T. Direct adaptive control of spacecraft near asteroids[J]. Acta Astronautica, 2023, 202: 197–213. doi: 10.1016/j.actaastro.2022.10.014.
    [12]
    SCORSOGLIO A, FURFARO R, LINARES R, et al. Relative motion guidance for near-rectilinear lunar orbits with path constraints via actor-critic reinforcement learning[J]. Advances in Space Research, 2023, 71(1): 316–335. doi: 10.1016/j.asr.2022.08.002.
    [13]
    SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. https://arxiv.org/abs/1707.06347, 2017.
    [14]
    王禄丰, 李爽. J2摄动下非线性轨道不确定性传播方法[C]. 2024年中国航天大会论文集, 武汉, 2024: 59–64. doi: 10.26914/c.cnkihy.2024.081107.

    WANG Lufeng and LI Shuang. Nonlinear orbit uncertainty propagation method under J2 perturbation[C]. Proceedings of 2024 China Aerospace Congress, Wuhan, 2024: 59–64. doi: 10.26914/c.cnkihy.2024.081107. (查阅网上资料,未找到对应的英文翻译信息,请确认).
    [15]
    孙盼, 李爽. 连续方程与高斯和框架下轨道不确定性传播方法综述[J]. 中国科学: 物理学 力学 天文学, 2025, 55(9): 294501. doi: 10.1360/SSPMA-2024-0300.

    SUN Pan and LI Shuang. A review of uncertainty propagation methods within continuity equation and Gaussian mixture model frameworks[J]. SCIENTIA SINICA Physica, Mechanica & Astronomica, 2025, 55(9): 294501. doi: 10.1360/SSPMA-2024-0300.
    [16]
    BAILLIEUL J and SAMAD T. Encyclopedia of Systems and Control[M]. 2nd ed. Cham: Springer, 2021. doi: 10.1007/978-3-030-44184-5. .
    [17]
    LANDERS M and DORYAB A. Deep reinforcement learning verification: A survey[J]. ACM Computing Surveys, 2023, 55(14s): 1–31. doi: 10.1145/3596444.
    [18]
    XU Haotian, XUAN Junyu, ZHANG Guangquan, et al. Trust region policy optimization via entropy regularization for Kullback-Leibler divergence constraint[J]. Neurocomputing, 2024, 589: 127716. doi: 10.1016/j.neucom.2024.127716.
    [19]
    IBRAHIM S, MOSTAFA M, JNADI A, et al. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications[J]. IEEE Access, 2024, 12: 175473–175500. doi: 10.1109/access.2024.3504735.
    [20]
    PAOLO G, CONINX M, LAFLAQUIÈRE A, et al. Discovering and exploiting sparse rewards in a learned behavior space[J]. Evolutionary Computation, 2024, 32(3): 275–305. doi: 10.1162/evco_a_00343.
    [21]
    NG A Y, HARADA D, and RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[R]. 2016. (查阅网上资料, 未找到本条文献信息, 请确认).
    [22]
    MONTENBRUCK O and GILL E. Satellite Orbits: Models, Methods and Applications[M]. Berlin: Springer, 2000. doi: 10.1007/978-3-642-58351-3. .
    [23]
    ZAVOLI A and FEDERICI L. Reinforcement learning for robust trajectory design of interplanetary missions[J]. Journal of Guidance, Control, and Dynamics, 2021, 44(8): 1440–1453. doi: 10.2514/1.G005794.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(10)  / Tables(6)

    Article Metrics

    Article views (7) PDF downloads(0) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return