高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种模型辅助的联邦强化学习多无人机路径规划方法

陆音 刘金志 张珉

陆音, 刘金志, 张珉. 一种模型辅助的联邦强化学习多无人机路径规划方法[J]. 电子与信息学报. doi: 10.11999/JEIT241055
引用本文: 陆音, 刘金志, 张珉. 一种模型辅助的联邦强化学习多无人机路径规划方法[J]. 电子与信息学报. doi: 10.11999/JEIT241055
LU Yin, LIU Jinzhi, ZHANG Min. A Model-Assisted Federated Reinforcement Learning Method for Multi-UAV Path Planning[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT241055
Citation: LU Yin, LIU Jinzhi, ZHANG Min. A Model-Assisted Federated Reinforcement Learning Method for Multi-UAV Path Planning[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT241055

一种模型辅助的联邦强化学习多无人机路径规划方法

doi: 10.11999/JEIT241055
基金项目: 国家自然科学基金(62401290)
详细信息
    作者简介:

    陆音:男,研究员,研究方向为无线通信、无人机通信、物联网技术与应用、人工智能

    刘金志:男,硕士生,研究方向为无人机通信

    张珉:男,讲师,研究方向为无人机通信与组网、空天地一体智联网

    通讯作者:

    陆音 luyin@njupt.edu.cn

  • 中图分类号: TN929.5

A Model-Assisted Federated Reinforcement Learning Method for Multi-UAV Path Planning

Funds: The National Natural Science Foundation of China (62401290)
  • 摘要: 针对多无人机在环境监测传感器网络、灾害应急通信节点等设备位置部分未知场景下的数据收集需求,该文提出一种模型辅助的联邦强化学习多无人机路径规划方法。在联邦学习框架下,通过结合最大熵强化学习与单调价值函数分解机制,引入动态熵温度参数和注意力机制,优化了多无人机协作的探索效率与策略稳定性。此外,设计了一种基于信道建模与位置估计的混合模拟环境构建方法,利用改进的粒子群算法快速估计未知设备位置,显著降低了真实环境交互成本。仿真结果表明,所提算法能够实现高效数据收集,相较于传统多智能体强化学习方法,数据收集率提升4.4%,路径长度减少8.4%,验证了所提算法的有效性和优越性。
  • 图  1  系统模型

    图  2  多无人机数据收集轨迹图

    图  3  无人机数据收集率变化图

    图  4  损失曲线变化图

    图  5  无人机总路径长度曲线

    图  6  不同初始电量数据收集率

    图  7  不同初始电量下无人机路径长度

    1  模型辅助的联邦最大熵强化学习算法

     (1) 初始化M架无人机集合,重放缓冲区$ {\mathcal{D}}^{m} $,QMIX参数θ,本地QMIX参数θm=0,目标网络参数$ {\widehat{\theta }}^{m}={\theta }^{m} $,目标网络更新周期Ntarget
       聚合周期Nfreq
     (2) for e=0,1,···, Emax-1 do
     (3)  (a)现实世界数据收集
     (4)  for 每个无人机mM do
     (5)   t=0,初始化状态s0
     (6)   while $ {E}_{t}^{m}\ge 0 $ do
     (7)    采样动作$ {\mathit{a}}_{t}^{m}~{\pi }_{\theta }\left({\mathit{a}}_{t}^{m}\right|{s}_{t}^{m}) $
     (8)    使用批评网络计算Q值$ Q({s}_{t}^{m},{\mathit{a}}_{t}^{m}) $
     (9)    根据安全控制器执行$ {\mathit{a}}_{t}^{m} $,观察新状态$ {s}_{t+1}^{m} $和奖励$ {r}_{t}^{m} $
     (10)    将$ \left({{s}_{t}^{m},\mathit{a}}_{t}^{m},{r}_{t}^{m},{s}_{t+1}^{m}\right) $存储在$ {\mathcal{D}}^{m} $中
     (11)    t=t+1
     (12)   end while
     (13) end for
     (14) (b)学习信道模型并估计未知设备位置,建立模拟环境
     (15) (c)模拟环境训练
     (16) for episode=0,1,···,N-1 do
     (17)  for 每个并行的无人机mM do
     (18)   t=0,初始化状态s0
     (19)   while $ {E}_{t}^{m}\ge 0,\forall j=\mathrm{1,2},\cdots ,M $ do
     (20)    for 每个模拟代理$ j=\mathrm{1,2},\cdots ,M $ do
     (21)     $ {\tau }_{t}^{j}={\tau }_{t-1}^{j}\cup \{{o}_{t}^{j},{\mathit{a}}_{t-1}^{j}\} $
     (22)     使用SAC策略选择动作$ {\mathit{a}}_{t}^{j}~{\pi }_{\theta }\left({\mathit{a}}_{t}^{j}\right|{s}_{t}^{j}) $
     (23)     使用评价网络计算Q值$ Q({s}_{t}^{j},{\mathit{a}}_{t}^{j}) $,并更新策略
     (24)    end for
     (25)    采取联合动作$ {\times }_{j}{\mathit{a}}_{t}^{j} $,观察$ {\times }_{j}{o}_{t+1}^{j} $,获得奖励rt和下一个状态st+1
     (26)    将$ \left({s}_{t},{\times }_{j}{o}_{t}^{j},{\times }_{j}{\mathit{a}}_{t}^{j},{r}_{t},{s}_{t+1},{{\times }_{j}o}_{t+1}^{j}\right) $存储在$ {\mathcal{D}}^{m} $中
     (27)    t=t+1
     (28)   end while
     (29)  end for
     (30)  从$ {\mathcal{D}}^{m} $中随机采样一批B片段,计算Qtot,并使用目标网络$ {\widehat{\theta }}^{m} $计算目标Qtot
     (31)  更新评价网络参数,最大化期望奖励并正则化熵
     (32)  $ {\theta }^{m}\leftarrow {\theta }^{m}-\alpha \nabla \mathcal{L}\left({\theta }^{m}\right) $
     (33)  if mod (episode,Ntarget)=0 then
     (34)   重置$ {\widehat{\theta }}^{m}={\theta }^{m} $
     (35)  end if
     (36)  end for
     (37)  if mod (episode,Nfreq)=0 then
     (38)   更新$ \theta =\sum _{m=1}^{M}{w}_{m}{\theta }_{m} $并设置$ {\theta }^{m}\leftarrow \theta ,\forall m\in M $
     (39)  end if
     (40)end for
    下载: 导出CSV

    表  1  训练超参数

    训练参数数值
    折扣因子γ0.99
    软更新率τ0.005
    经验池大小$ \mathcal{D} $5000
    批量样本数B32
    网络学习率lr0.0003
    演员网络学习率lactor0.0003
    温度参数学习率lα0.0002
    目标熵值0.3
    注意力头数n4
    优化器Adam
    下载: 导出CSV
  • [1] WEI Zhiqing, ZHU Mingyue, ZHANG Ning, et al. UAV-assisted data collection for Internet of Things: A survey[J]. IEEE Internet of Things Journal, 2022, 9(17): 15460–15483. doi: 10.1109/JIOT.2022.3176903.
    [2] CHENG Zhekun, ZHAO Liangyu, and SHI Zhongjiao. Decentralized multi-UAV path planning based on two-layer coordinative framework for formation rendezvous[J]. IEEE Access, 2022, 10: 45695–45708. doi: 10.1109/ACCESS.2022.3170583.
    [3] ZHENG Jibin, DING Minghui, SUN Lu, et al. Distributed stochastic algorithm based on enhanced genetic algorithm for path planning of multi-UAV cooperative area search[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(8): 8290–8303. doi: 10.1109/TITS.2023.3258482.
    [4] LIU Zhihong, WANG Xiangke, SHEN Lincheng, et al. Mission-oriented miniature fixed-wing UAV swarms: A multilayered and distributed architecture[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(3): 1588–1602. doi: 10.1109/TSMC.2020.3033935.
    [5] VELICHKO N A. Distributed multi-agent reinforcement learning based on feudal networks[C]. 2024 6th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russian Federation, 2024: 1–5. doi: 10.1109/REEPE60449.2024.10479775.
    [6] LIN Mengting, LI Bin, ZHOU Bin, et al. Distributed stochastic model predictive control for heterogeneous UAV swarm[J]. IEEE Transactions on Industrial Electronics, 2024: 1–11. doi: 10.1109/TIE.2024.3508055.
    [7] WANG Xu, WANG Sen, LIANG Xingxing, et al. Deep reinforcement learning: A survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(4): 5064–5078. doi: 10.1109/TNNLS.2022.3207346.
    [8] WESTHEIDER J, RÜCKIN J, and POPOVIĆ M. Multi-UAV adaptive path planning using deep reinforcement learning[C]. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, USA, 2023: 649–656. doi: 10.1109/IROS55552.2023.10342516.
    [9] PUENTE-CASTRO A, RIVERO D, PEDROSA E, et al. Q-learning based system for path planning with unmanned aerial vehicles swarms in obstacle environments[J]. Expert Systems with Applications, 2024, 235: 121240. doi: 10.1016/j.eswa.2023.121240.
    [10] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]. The 35th International Conference on Machine Learning, Stockholm, Sweden, 2018: 1861–1870.
    [11] LOWE R, WU Yi, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6382–6393.
    [12] FOERSTER J N, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients[C]. The 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 363. doi: 10.1609/aaai.v32i1.11794.
    [13] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: Multi-agent variational exploration[C]. The 33rd International Conference on Neural Information Processing Systems, 2019: 684.
    [14] RASHID T, SAMVELYAN M, DE WITT C S, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning[J]. The Journal of Machine Learning Research, 2020, 21(1): 178.
    [15] LYU Lingjuan, YU Han, MA Xingjun, et al. Privacy and robustness in federated learning: Attacks and defenses[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(7): 8726–8746. doi: 10.1109/TNNLS.2022.3216981.
    [16] WANG Tianshun, HUANG Xumin, WU Yuan, et al. UAV swarm-assisted two-tier hierarchical federated learning[J]. IEEE Transactions on Network Science and Engineering, 2024, 11(1): 943–956. doi: 10.1109/TNSE.2023.3311024.
    [17] HU Chen, REN Hanchi, DENG Jingjing, et al. Distributed learning for UAV swarms[J]. arXiv preprint arXiv: 2410.15882, 2024. doi: 10.48550/arXiv.2410.15882.
    [18] TONG Ziheng, WANG Jingjing, HOU Xiangwang, et al. Blockchain-based trustworthy and efficient hierarchical federated learning for UAV-enabled IoT networks[J]. IEEE Internet of Things Journal, 2024, 11(21): 34270–34282. doi: 10.1109/JIOT.2024.3370964.
    [19] FIROUZJAEI H M, MOGHADDAM J Z, and ARDEBILIPOUR M. Delay optimization of a federated learning-based UAV-aided IoT network[J]. arXiv preprint arXiv: 2502.06284, 2025. doi: 10.48550/arXiv.2502.06284.
    [20] WANG Pengfei, YANG Hao, HAN Guangjie, et al. Decentralized navigation with heterogeneous federated reinforcement learning for UAV-enabled mobile edge computing[J]. IEEE Transactions on Mobile Computing, 2024, 23(12): 13621–13638. doi: 10.1109/TMC.2024.3439696.
    [21] ESRAFILIAN O, BAYERLEIN H, and GESBERT D. Model-aided deep reinforcement learning for sample-efficient UAV trajectory design in IoT networks[C]. 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 2021: 1–6. doi: 10.1109/GLOBECOM46510.2021.9685774.
    [22] CHEN Jichao, ESRAFILIAN O, BAYERLEIN H, et al. Model-aided federated reinforcement learning for multi-UAV trajectory planning in IoT networks[C]. 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023: 818–823. doi: 10.1109/GCWkshps58843.2023.10465088.
    [23] YAN Yan, ZHANG Baoxian, LI Cheng, et al. A novel model-assisted decentralized multi-agent reinforcement learning for joint optimization of hybrid beamforming in massive MIMO mmWave systems[J]. IEEE Transactions on Vehicular Technology, 2023, 72(11): 14743–14755. doi: 10.1109/TVT.2023.3280910.
    [24] ZHANG Tuo, FENG Tiantian, ALAM S, et al. GPT-FL: Generative pre-trained model-assisted federated learning[J]. arXiv Preprint arXiv: 2306.02210, 2023. doi: 10.48550/arXiv.2306.02210.
    [25] BAYERLEIN H, THEILE M, CACCAMO M, et al. Multi-UAV path planning for wireless data harvesting with deep reinforcement learning[J]. IEEE Open Journal of the Communications Society, 2021, 2: 1171–1187. doi: 10.1109/OJCOMS.2021.3081996.
    [26] OLIEHOEK F A and AMATO C. A Concise Introduction to Decentralized POMDPs[M]. Cham, Switzerland: Springer, 2016. doi: 10.1007/978-3-319-28929-8.
    [27] ZENG Yong, XU Jie, and ZHANG Rui. Energy minimization for wireless communication with rotary-wing UAV[J]. IEEE Transactions on Wireless Communications, 2019, 18(4): 2329–2345. doi: 10.1109/TWC.2019.2902559.
  • 加载中
图(7) / 表(2)
计量
  • 文章访问数:  53
  • HTML全文浏览量:  29
  • PDF下载量:  7
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-12-02
  • 修回日期:  2025-04-30
  • 网络出版日期:  2025-05-09

目录

    /

    返回文章
    返回