基于多智能体深度强化学习的无人机动态预部署策略

唐伦; 李质萱; 蒲昊; 汪智平; 陈前斌

doi:10.11999/JEIT220513

基于多智能体深度强化学习的无人机动态预部署策略

doi: 10.11999/JEIT220513 cstr: 32379.14.JEIT220513

重庆邮电大学通信与信息工程学院重庆 400065

基金项目: 国家自然科学基金(62071078)，重庆市教委科学技术研究项目(KJZD-M201800601)，川渝联合实施重点研发项目(2021YFQ0053)

详细信息

作者简介:
唐伦：男，教授，博士生导师，主要研究方向为下一代无线通信网络、异构蜂窝网络、软件定义网络等

李质萱：女，硕士生，研究方向为智能网络、边缘计算、无人机通信等

蒲昊：男，硕士生，研究方向为边缘智能计算资源分配与协同机理、无人机等

汪智平：男，硕士生，研究方向为边缘智能计算协同机理、联邦学习通信优化等

陈前斌：男，教授，博士生导师，主要研究方向为个人通信、多媒体信息处理与传输、异构蜂窝网络等

通讯作者:
蒲昊　puhao19970525@163.com

中图分类号: TN929.5
计量
- 文章访问数: 1409
- HTML全文浏览量: 379
- PDF下载量: 331
- 被引次数: 0
出版历程
- 收稿日期: 2022-04-22
- 修回日期: 2022-06-01
- 网络出版日期: 2022-06-22
- 刊出日期: 2023-06-10

A Dynamic Pre-Deployment Strategy of UAVs Based on Multi-Agent Deep Reinforcement Learning

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Funds: The National Natural Science Foundation of China (62071078), The Science and Technology Research Program of Chongqing Municipal Education Commission (KJZD-M201800601), Sichuan and Chongqing Key R&D Projects (2021YFQ0053)

摘要

摘要: 针对传统优化算法在求解长时间尺度内通信无人机(UAV)动态部署时复杂度过高且难以与动态环境信息匹配等缺陷，该文提出一种基于多智能体深度强化学习(MADRL)的UAV动态预部署策略。首先利用一种深度时空网络模型预测用户的预期速率需求以捕捉动态环境信息，定义用户满意度的概念以刻画用户所获得UAV提供服务的公平性，并以最大化长期总体用户满意度和最小化UAV移动及发射能耗为目标建立优化模型。其次，将上述模型转化为部分可观测马尔科夫博弈过程(POMG)，并提出一种基于MADRL的H-MADDPG算法求解该POMG中轨迹规划、用户关联和功率分配的最佳决策。该H-MADDPG算法使用混合网络结构以实现对多模态输入的特征提取，并采用集中式训练-分布式执行的机制以高效地训练和执行决策。最后仿真结果证明了所提算法的有效性。
- 无人机通信 /
- 动态部署 /
- 部分可观测马尔科夫博弈 /
- 多智能体深度强化学习
Abstract: It’s challenging to use traditional optimization algorithms to solve the long-term dynamic deployment problem of Unmanned Aerial Vehicles (UAVs) due to their high complexity and difficulty in matching dynamic environment. Aiming at solving these shortcomings, a dynamic pre-deployment strategy of UAV based on Multi-Agent Deep Reinforcement Learning (MADRL) is proposed. Firstly, a deep spatio-temporal network model is used to predict the expected rate demand of users in the coverage area to capture the dynamic environment information. The concept of users’ satisfaction is defined to describe the fairness of users. An optimization problem is modeled with the goal of maximizing the long-term overall users’ satisfaction, minimizing the mobile and radio energy consumption of the UAVs. Secondly, the problem above is transformed into a Partially Observable Markov Game (POMG) process. An H-MADDPG algorithm based on MADRL is proposed to solve the optimal decision of trajectory design, user association and power allocation. The H-MADDPG algorithm uses a hybrid network structure to extract the features of multi-modal inputs, and adopts a centralized training-distributed execution mechanism to realize efficient training and decision execution. Finally, the effectiveness of the algorithm is verified by simulation experiments.
- Unmanned Aerial Vehicle (UAV) communication /
- Dynamic deployment /
- Partially Observable Markov Game(POMG) /
- Multi-Agent Deep Reinforcement Learning (MADRL)

HTML全文

图 1 执行者网络结构

下载: 全尺寸图片幻灯片

图 2 评判者网络结构

下载: 全尺寸图片幻灯片

图 3 不同学习率下的算法训练对比

下载: 全尺寸图片幻灯片

图 4 算法收敛对比

下载: 全尺寸图片幻灯片

图 5 算法性能对比

下载: 全尺寸图片幻灯片

图 6 累计发射功耗对比

下载: 全尺寸图片幻灯片

图 7 累计路径长度对比

下载: 全尺寸图片幻灯片

算法1　H-MADDP算法
输入：$ {{\hat {\boldsymbol X}}^{{\text{Tr}}}}(t) $，$ {{\hat {\boldsymbol X}}^{\text{U}}}(t) $($ t \in {\mathcal{T}} $)，最大回合数E，最大时间步长　　　　T，$ \gamma $，$ \tau $，I，最大代数(epoch)K
输出：$ {\omega _m} $，$ {\omega '_m} $，$ {\theta _m} $，$ {\theta '_m} $
1 　随机初始化所有智能体的在线/目标评判者网络、在线/目标　　　执行者网络
2 　for episode=1～E:
3 　　初始化全局状态s和所有智能体经验回放池
4 　　for t=1～T:
5 　　　　所有智能体基于观测状态执行动作
6 　　　　全局状态由s跳变至$ s' $，所有智能体得到相应奖励，并　　　　　　将样本存储至经验回放池
7 　　　　if 经验回放池已满:
8 　　　　　for m=1～M:
9 　　　　　　for epoch=1～K:
10 　　　　　　　循环采样I个样本直至所有样本参与训练
11 　　　　　　　每次采样根据式(12)和式(13)更新$ {\omega _m} $，根据　　　　　　　　　式(14)和式(15)更新$ {\theta _m} $
12 　　　　　　end for
13 　　　　　end for
14 　　　　清空经验回放池
15 　　　　$ s \leftarrow s' $
16 　　　　根据式(16)、式(17)更新$ {\omega '_m} $和$ {\theta '_m} $
17 　　　end for
18 end for

下载: 导出CSV

表 1 仿真参数设置

仿真参数	数值	仿真参数	数值
载波频率f_c	5 GHz	环境常量a/b	9.6/0.2
天线增益G	10 dB	权重系数$ \varphi $/$ \lambda $/$ \beta $	10^–1/10^–3/10^–1
总带宽B	10 MHz	惩罚系数$ {\eta _1} $/$ {\eta _2} $/$ {\eta _3} $	10^–2/10^–1/10²
噪声功率谱密度N₀	–174 dBm/Hz	UAV数量M	3
$ {\mu _{{\text{LoS}}}} $/$ {\sigma _{{\text{LoS}}}} $/$ {\mu _{\text{N}}}_{{\text{LoS}}} $/$ {\sigma _{{\text{NLoS}}}} $	1.6/8.41/23/33.78	P_max/d_max/d_min	30 W/1000 m/100 m
区域长度L/宽度W	10(×200 m)/10(×200 m)	训练参数E/T/K/I/τ	1000/200/100/5/0.1

下载: 导出CSV

表 2 H-MADDPG与MADDPG网络结构对比

		H-MADDPG结构参数	MADDPG结构参数
执行者网络	卷积层1	32个3×3卷积核	无
	池化层1	2×2 平均池化
	卷积层2	16个3×3卷积核
	池化层2	2×2 平均池化
	全连接层1	256个神经元	512个神经元
	全连接层2	128个神经元	256个神经元
	全连接层3	无	50个神经元
评判者网络	卷积层1	32个3×3卷积核	无
	池化层1	2×2 平均池化
	卷积层2	16个3×3卷积核
	池化层2	2×2 平均池化
	全连接层1	512个神经元	1024个神经元
	全连接层2	256个神经元	512个神经元
	全连接层3	128个神经元	200个神经元
	全连接层3	无	20个神经元

下载: 导出CSV

表 3 算法对应权重系数

	总体用户满意度权重系数$ \varphi $	UAV单位移动功耗权重系数$ \beta $
H-MADDPG1	1×10^–1	1×10^–3
H-MADDPG2	0.9×10^–1	1.05×10^–3
H-MADDPG3	0.7×10^–1	1.1×10^–3
EED	1×10^–1	1×10^–3

下载: 导出CSV

参考文献(17)

[1]	SAAD W, BENNIS M, and CHEN Mingzhe. A vision of 6G wireless systems: Applications, trends, technologies, and open research problems[J]. IEEE Network, 2020, 34(3): 134–142. doi: 10.1109/MNET.001.1900287
[2]	陈新颖, 盛敏, 李博, 等. 面向6G的无人机通信综述[J]. 电子与信息学报, 2022, 44(3): 781–789. doi: 10.11999/JEIT210789 CHEN Xinying, SHENG Min, LI Bo, et al. Survey on unmanned aerial vehicle communications for 6G[J]. Journal of Electronics &Information Technology, 2022, 44(3): 781–789. doi: 10.11999/JEIT210789
[3]	WANG Qian, CHEN Zhi, LI Hang, et al. Joint power and trajectory design for physical-layer secrecy in the UAV-aided mobile relaying system[J]. IEEE Access, 2018, 6: 62849–62855. doi: 10.1109/ACCESS.2018.2877210
[4]	ZHANG Guangchi, WU Qingqing, CUI Miao, et al. Securing UAV communications via joint trajectory and power control[J]. IEEE Transactions on Wireless Communications, 2019, 18(2): 1376–1389. doi: 10.1109/TWC.2019.2892461
[5]	GAO Ying, TANG Hongying, LI Baoqing, et al. Joint trajectory and power design for UAV-enabled secure communications with no-fly zone constraints[J]. IEEE Access, 2019, 7: 44459–44470. doi: 10.1109/ACCESS.2019.2908407
[6]	ZHANG Shuhang, ZHANG Hongliang, HE Qichen, et al. Joint trajectory and power optimization for UAV relay networks[J]. IEEE Communications Letters, 2018, 22(1): 161–164. doi: 10.1109/LCOMM.2017.2763135
[7]	YANG Gang, DAI Rao, and LIANG Yingchang. Energy-efficient UAV backscatter communication with joint trajectory design and resource optimization[J]. IEEE Transactions on Wireless Communications, 2021, 20(2): 926–941. doi: 10.1109/TWC.2020.3029225
[8]	LIU C H, CHEN Zheyu, TANG Jian, et al. Energy-efficient UAV control for effective and fair communication coverage: A deep reinforcement learning approach[J]. IEEE Journal on Selected Areas in Communications, 2018, 36(9): 2059–2070. doi: 10.1109/JSAC.2018.2864373
[9]	ZHAO Nan, CHENG Yiqiang, PEI Yiyang, et al. Deep reinforcement learning for trajectory design and power allocation in UAV networks[C]. 2020 IEEE International Conference on Communications, Dublin, Ireland, 2020: 1–6.
[10]	WANG Liang, WANG Kezhi, PAN Cunhua, et al. Deep reinforcement learning based dynamic trajectory control for UAV-assisted mobile edge computing[J]. IEEE Transactions on Mobile Computing, 2022, 21(10): 3536–3550.
[11]	CHEN Xiaming, JIN Yaohui, QIANG Siwei, et al. Analyzing and modeling spatio-temporal dependence of cellular traffic at city scale[C]. 2015 IEEE International Conference on Communications, London, the United Kingdom, 2015: 3585–3591.
[12]	ZHANG Chuanting, ZHANG Haixia, QIAO Jingping, et al. Deep transfer learning for intelligent cellular traffic prediction based on cross-domain big data[J]. IEEE Journal on Selected Areas in Communications, 2019, 37(6): 1389–1401. doi: 10.1109/JSAC.2019.2904363
[13]	唐伦, 蒲昊, 汪智平, 等. 基于注意力机制ConvLSTM的UAV节能预部署策略[J]. 电子与信息学报, 2022, 44(3): 960–968. doi: 10.11999/JEIT211368 TANG Lun, PU Hao, WANG Zhiping, et al. Energy-efficient predictive deployment strategy of UAVs based on ConvLSTM with attention mechanism[J]. Journal of Electronic &Information Technology, 2022, 44(3): 960–968. doi: 10.11999/JEIT211368
[14]	OSBORNE M J. An Introduction to Game Theory[M]. London: Oxford University Press, 2003: 8–10.
[15]	SUTTON R S and BARTO A G. Reinforcement Learning: An Introduction[M]. Cambridge: MIT Press, 2018: 324–326.
[16]	ZHANG Qianqian, SAAD W, BENNIS M, et al. Predictive deployment of UAV base stations in wireless networks: Machine learning meets contract theory[J]. IEEE Transactions on Wireless Communications, 2021, 20(1): 637–652. doi: 10.1109/TWC.2020.3027624
[17]	YIN Sixing and YU R F. Resource allocation and trajectory design in UAV-Aided cellular networks based on multiagent reinforcement learning[J]. IEEE Internet of Things Journal, 2022, 9(4): 2933–2943. doi: 10.1109/JIOT.2021.3094651