A Resource Allocation Algorithm for Space-Air-Ground Integrated Network Based on Deep Reinforcement Learning
摘要: 空天地一体化网络(SAGIN)通过提高地面网络的资源利用率可以有效满足多种业务类型的通信需求,然而忽略了系统的自适应能力和鲁棒性及不同用户的服务质量(QoS)。针对这一问题,该文提出在空天地一体化网络架构下,面向城区和郊区通信的深度强化学习(DRL)资源分配算法。基于第3代合作伙伴计划(3GPP)标准中定义的用户参考信号接收功率(RSRP),考虑地面同频干扰情况,以不同域中基站的时频资源作为约束条件,构建了最大化系统用户的下行吞吐量优化问题。利用深度Q网络(DQN)算法求解该优化问题时,定义了能够综合考虑用户服务质量需求、系统自适应能力及系统鲁棒性的奖励函数。仿真结果表明,综合考虑无人驾驶汽车,沉浸式服务及普通移动终端通信业务需求时,表征系统性能的奖励函数值在2 000次迭代下,相较于贪婪算法提升了39.1%;对于无人驾驶汽车业务,利用DQN算法进行资源分配后,相比于贪婪算法,丢包数平均下降38.07%,时延下降了6.05%。Abstract: The Space-Air-Ground Integrated Network (SAGIN) can effectively meet the communication needs of various service types by improving the resource utilization of the ground network, but ignoring the adaptive ability and robustness of the system and the Quality of Service (QoS) in different users. In response to this problem, a Deep Reinforcement Learning (DRL) Resource allocation algorithm for urban and suburban communications under the SAGIN architecture is proposed in this paper. Based on Reference Signal Reception Power (RSRP) defined in the 3rd Generation Partnership Project (3GPP) standard, considering ground co-frequency interference, and using the time-frequency resources of base stations in different domains as constraints, an optimization problem to maxmize the downlink throughput of system users is constructed. When using the Deep Q-network (DQN) algorithm to solve the optimization problem, a reward function which can comprehensively consider the user’s QoS requirements, system adaptability and system robustness is defined. Considering the service requirements of unmanned vehicles, immersive services and ordinary mobile communication services, the simulation results show that the value of the reward function which represents the performance of the system is increased by 39.1% compared with the greedy algorithm under 2 000 iterations. For the unmanned vehicle services, the average packet loss rate by the DQN algorithm is 38.07% lower than that by the greedy algorithm, and the delay by the DQN algorithm is also 6.05% lower than that by the greedy algorithm.
1 SAGIN下DQN资源分配算法流程
输入:初始化经验回放池D,容量为N,估计网络$Q$随机参数$\theta $,
目标网络${Q'}$参数为${\theta '}$,${\theta '} = \theta $,折扣因子$\gamma $输出:输出动作向量$ {{\boldsymbol{a}}_t} $ for episode $ = 1,{\text{ }}M{\text{ do}}:$ 初始化环境状态向量$ {{\boldsymbol{s}}_t} $ ${\text{for }}t = 1,{\text{ }}T{\text{ do}}:$ 以$\varepsilon $为概率选择随机动作${{\boldsymbol{a}}_t}$,否则$1 - \varepsilon $概率选择动作
$ {{\boldsymbol{a}}_t} = \arg {\max _a}Q({{\boldsymbol{s}}_t},{{\boldsymbol{a}}_{t,\theta}} ) $执行动作${{\boldsymbol{a}}_t}$,到达状态值${{\boldsymbol{s}}_{t + 1}}$,得到奖励值${r_t}$ 将$ ({{\boldsymbol{s}}_t},{{\boldsymbol{a}}_t},{r_t},{{\boldsymbol{s}}_{t + 1}}) $存放在经验池$D$中 从经验池$D$中对向量进行均匀随机抽样,取出Mini-batch大
小的数据$ ({{\boldsymbol{s}}_{{t'}}},{{\boldsymbol{a}}_{{t'}}},{r_{{t'}}},{{\boldsymbol{s}}_{{t'} + 1}}) $设置$ {y}_{{t}^{{'}}}=\left\{\begin{array}{l}\text{}{r}_{{t}^{{'}}},\text{}至{t}^{{'}}+1结束\\ {r}_{{t}^{{'}}}+\gamma {\mathrm{max}}_{a}{Q}^{{'}}({{\boldsymbol{s}}}_{{t}^{{'}}+1},{{\boldsymbol{a}}}_{{t;\theta}^{{'}}}{ }^{{'}}),\text{}未至t+1\end{array}\right. $ 根据梯度下降法,利用损失函数
$ L(\theta ) = {({y_{{t'}}} - Q\left( {{{\boldsymbol{s}}_{{t'}}},{{\boldsymbol{a}}_{{t'}}};\theta } \right))^2} $,更新网络参数更新网络${Q'} = Q$ end for end for 表 1 SAGIN资源分配仿真主要参数
参数 数值 卫星载频${f_{\text{c}}}$(GHz) 28.4 卫星带宽${B_{\text{w}}}$(MHz) 220 卫星有效各向辐射功率${\text{EIPR}}$(dBW) 62 卫星路径损耗${\text{PL}}$(dB) 188.4 卫星大气损耗${\text{AL}}$(dB) 0.1 卫星$G/T$(dB/K) 9.7 无人机载频${f_{\text{c}}}$(MHz) 1000 无人机带宽${B_{\text{w}}}$(MHz) 30 无人机天线增益$G$(dBi) 16 无人机发射器天线高度${h_{\text{b}}}$(m) 50 无人机副载波频率$P$(dB) 20 无人机馈电损耗FL(dB) 4 地面基站载频${f_{\text{c}}}$(MHz) 1700 地面基站天线增益$G$(dBi) 5 地面基站副载波频率$P$(dB) 20 地面基站馈电损耗FL(dB) 1 地面基站发射器天线高度${h_{\text{b}}}$(m) 40 用户接收器天线高度${h_{\text{m}}}$(m) 1 表 2 DQN算法参数
参数 数值 ${t_{{\text{duration}}}}$(s) 20 ${t_{{\text{sample}}}}$(s) 0.01 episodes 2 000 ITER 2 000 学习率 1e–3 折扣因子$\gamma $ 0.95 batch size 100 memory size 5e5 表 3 SAGIN资源分配仿真用户分类
业务名称 标号 用户速度(m/s) 业务特点 地理位置 下行速率需求(Mbit/s) 服务等级$\alpha $ 沉浸式服务(如AR,高清视频等) ${\text{U}}{{\text{E}}^{\text{0}}}{\text{,U}}{{\text{E}}^{\text{1}}}{\text{,U}}{{\text{E}}^{\text{2}}}$ 0 高带宽,固定 城镇 15 1 无人驾驶汽车 ${\text{U}}{{\text{E}}^{\text{3}}}$ 20 极高带宽,高移动性 城镇 25 3 普通移动终端通信 ${\text{U}}{{\text{E}}^{\text{4}}}{\text{,U}}{{\text{E}}^{\text{5}}}$ 1.2 低带宽、低移动性 郊区 3 1 表 4 不同算法下基站最大资源分配用户
基站序号 ${\text{B}}{{\text{S}}^{\text{0}}}$ ${\text{B}}{{\text{S}}^{\text{1}}}$ ${\text{B}}{{\text{S}}^{\text{2}}}$ ${\text{B}}{{\text{S}}^{\text{3}}}$ DQN算法 ${\text{U}}{{\text{E}}^{\text{5}}}$ ${\text{U}}{{\text{E}}^{\text{4}}}$ $ {\text{U}}{{\text{E}}^{\text{1}}}{\text{,U}}{{\text{E}}^{\text{3}}} $ ${\text{U}}{{\text{E}}^{\text{0}}}{\text{,U}}{{\text{E}}^{\text{2}}}$ 贪婪算法 ${\text{U}}{{\text{E}}^{\text{4}}}{\text{,U}}{{\text{E}}^{\text{5}}}$ $ - $ $ {\text{U}}{{\text{E}}^{\text{1}}}{\text{,U}}{{\text{E}}^{\text{2}}}{\text{,U}}{{\text{E}}^{\text{3}}} $ ${\text{U}}{{\text{E}}^{\text{0}}}$ 随机算法 ${\text{U}}{{\text{E}}^{\text{0}}}{\text{,U}}{{\text{E}}^{\text{5}}}$ ${\text{U}}{{\text{E}}^{\text{4}}}$ $ {\text{U}}{{\text{E}}^{\text{3}}} $ $ {\text{U}}{{\text{E}}^{\text{1}}}{\text{,U}}{{\text{E}}^{\text{2}}} $ 表 5 不同算法下加入同频干扰后下降的系统奖励值
算法名称 下降的系统奖励值R DQN算法 289.97 贪婪算法 455.16 随机算法 967.49 -
