高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种融合噪声网络的深度强化学习通信干扰资源分配算法

彭翔 许华 蒋磊 饶宁 宋佰霖

彭翔, 许华, 蒋磊, 饶宁, 宋佰霖. 一种融合噪声网络的深度强化学习通信干扰资源分配算法[J]. 电子与信息学报, 2023, 45(3): 1043-1054. doi: 10.11999/JEIT220066
引用本文: 彭翔, 许华, 蒋磊, 饶宁, 宋佰霖. 一种融合噪声网络的深度强化学习通信干扰资源分配算法[J]. 电子与信息学报, 2023, 45(3): 1043-1054. doi: 10.11999/JEIT220066
PENG Xiang, XU Hua, JIANG Lei, RAO Ning, SONG Bailin. A Deep Reinforcement Learning Communication Jamming Resource Allocation Algorithm Fused with Noise Network[J]. Journal of Electronics & Information Technology, 2023, 45(3): 1043-1054. doi: 10.11999/JEIT220066
Citation: PENG Xiang, XU Hua, JIANG Lei, RAO Ning, SONG Bailin. A Deep Reinforcement Learning Communication Jamming Resource Allocation Algorithm Fused with Noise Network[J]. Journal of Electronics & Information Technology, 2023, 45(3): 1043-1054. doi: 10.11999/JEIT220066

一种融合噪声网络的深度强化学习通信干扰资源分配算法

doi: 10.11999/JEIT220066
详细信息
    作者简介:

    彭翔:男,博士生,研究方向为通信对抗智能决策

    许华:男,博士,教授,研究方向为通信信号处理、智能通信对抗

    蒋磊:男,博士,副教授,研究方向为通信抗干扰、智能通信对抗

    饶宁:男,博士生,研究方向为通信对抗智能决策

    宋佰霖:男,硕士,研究方向为通信对抗智能决策

    通讯作者:

    彭翔 pengxiang0538@163.com

  • 中图分类号: TN975

A Deep Reinforcement Learning Communication Jamming Resource Allocation Algorithm Fused with Noise Network

  • 摘要: 针对传统干扰资源分配算法在处理非线性组合优化问题时需要较完备的先验信息,同时决策维度小,无法满足现代通信对抗要求的问题,该文提出一种融合噪声网络的深度强化学习通信干扰资源分配算法(FNNDRL)。借鉴噪声网络的思想,该算法设计了孪生噪声评估网络,在避免Q值高估的基础上,通过提升评估网络的随机性,保证了训练过程的探索性;基于概率熵的物理意义,设计了基于策略分布熵改进的策略网络损失函数,在最大化累计奖励的同时最大化策略分布熵,避免策略优化过程中收敛到局部最优。仿真结果表明,该算法在解决干扰资源分配问题时优于所对比的平均分配和强化学习方法,同时算法稳定性较高,对高维决策空间适应性强。
  • 图  1  典型的通信对抗场景

    图  2  通信干扰方程物理意义及空间关系

    图  3  噪声网络原理图

    图  4  算法框架

    图  5  3 vs 5 不同熵系数下算法训练奖励曲线

    图  6  3 vs 5 不同熵系数下算法全面压制效果

    图  7  3 vs 5 不同熵系数下算法干扰80%以上的通信链路效果

    图  8  3 vs 5 不同熵系数下算法测试效果

    图  9  3 vs 5 成功干扰80%以上的通信链路效果

    图  10  3 vs 5 全面压制效果

    图  11  3 vs 5 累计奖励

    图  12  3 vs 5 算法测试效果

    图  13  5 vs 8 成功干扰80%以上通信链路效果

    图  14  5 vs 8 全面压制效果

    图  15  5 vs 8 累计奖励

    图  16  5 vs 8 压制百分比测试效果

    算法1 融合噪声网络的深度强化学习通信干扰资源分配算法
     步骤1 随机初始化评估网络Noisy Q1和Noisy Q2,参数分别为$ {\theta _1} $和$ {\theta _2} $;($ \theta \triangleq \mu + \sigma \circ \xi $,$ \xi $为高斯噪声)
     步骤2 随机初始化策略网络Policy Network,参数为$ \varphi $;
     步骤3 初始化评估目标Noisy Q1网络和目标Noisy Q2网络,参数分别为$ {\bar \theta _1} \leftarrow {\theta _1},{\kern 1pt} {\bar \theta _2} \leftarrow {\theta _2} $;
     步骤4 初始化经验回放池D
     步骤5 For episodes in 100 do:
          初始化状态S0
          For steps in 500 do:
           (1)根据状态选择动作$a$,执行动作$a$得到奖励值R和下一个状态${\boldsymbol{S}}'$
           (2)存储$\left( {{\boldsymbol{S}},a,R,{\boldsymbol{S}}'} \right)$ 到回放池D当中,${\boldsymbol{S}} \leftarrow {\boldsymbol{S}}'$;
           If 回放池的当前容量> C
            (a)从经验回放池中随机采样小批次样本$\left( {{S_i},{a_i},{R_i},{S_{i + 1}}} \right)$;
            (b)计算目标价值:$ {y_i} = R{}_i + \;\gamma \mathop {\min }\limits_{j = 1,2} {\bar Q_j}\left( {{S_{i + 1}},\bar a',{\xi _{i + 1}}|\bar \theta } \right) $;
            (c)最小化评估网络损失函数:$L\left( \theta \right) = {\kern 1pt} \dfrac{1}{B}{\displaystyle\sum\nolimits_i {\left( {R{}_i + \;\gamma \mathop {\min }\limits_{j = 1,2} {{\bar Q}_j}\left( {{S_{i + 1}},\bar a',{\xi _{i + 1}}|\bar \theta } \right) - Q\left( {{S_i},{a_i},{\xi _i}|\theta } \right)} \right)} ^2}$;
            梯度下降更新评估网络参数$ {\theta _1} $和$ {\theta _2} $:$\theta \leftarrow \theta - {\alpha _\theta } \cdot {{\text{∇}} _\theta }L\left( \theta \right)\;$;
            (d)最小化策略网络损失函数:$ J\left( \varphi \right) = \dfrac{1}{B}{\displaystyle\sum\nolimits_i {\left( {\mathop {\min }\limits_{j = 1,2} {Q_j}\left( {{S_i},\bar a,{\xi _i}|\theta } \right) - \alpha \lg {\pi _\varphi }\left( {\bar a|{S_i}} \right)} \right)} ^2} $;
            梯度下降更新策略网络参数$ \varphi $:$\varphi \leftarrow \varphi - {\alpha _\varphi } \cdot {{\text{∇}} _\varphi }J\left( \varphi \right)$;
            (e)单步更新目标Noisy Q1网络和目标Noisy Q2网络参数:$ {\bar \theta _1} \leftarrow \tau \; \cdot \;{\theta _1} + \left( {1 - \tau } \right){\bar \theta _1} $,$ {\bar \theta _2} \leftarrow \tau \; \cdot \;{\theta _2} + \left( {1 - \tau } \right){\bar \theta _2} $;
          End for
        End for
    下载: 导出CSV

    表  1  模型参数

    物理参数
    通信链路天线增益${G_{\rm{t}}}$5 dB
    干扰链路天线增益${G_{\rm{J}}}$8 dB
    干扰设备最大干扰功率${P_{{\rm{J}}\max } }$90 dBm
    通信信号发射功率${P_{\rm{t}}}$55 dBm
    干扰信号传播距离${r_j}$300 km
    中心频率$ f $225 MHz
    符号错误率阈值$ \kappa $0.05
    小批次样本大小B256
    经验回放池容量C217
    下载: 导出CSV

    表  2  FNNDRL算法参数

    参数评估网络策略网络
    Noisy Q1,Q2目标Noisy Q1,Q2Policy Network
    学习率0.0050.0050.003
    输入层Linear,30(80)Linear,30(80)Linear,15(40)
    隐藏层1Noisy Linear,256,ReLUNoisy Linear, 256,ReLULinear,256,ReLU
    隐藏层2Noisy Linear,256,ReLUNoisy Linear, 256,ReLULinear,256,ReLU
    输出层Linear,1Linear,130(80),Tanh
    软更新系数$\tau $Tau=0.01(0.001)Tau=0.01(0.001)——
    Dropout概率p=0.2p=0.2p=0.2
    衰减系数0.99——
    下载: 导出CSV

    表  3  DDPG算法参数

    参数Q网络目标Q网络策略网络目标策略网络
    学习率0.0010.001
    输入层Linear,30(80)Linear,30(80)Linear,15(40)Linear,15(40)
    隐藏层Linear,300(256),ReLULinear,300(256),ReLULinear,300(256),ReLULinear,300(256),ReLU
    输出层Linear,1Linear,115(40),Tanh15(40),Tanh
    软更新系数$\tau $Tau=0.01Tau=0.01Tau=0.01Tau=0.01
    衰减系数0.9——
    下载: 导出CSV

    表  4  A2C算法参数

    参数Critic网络Actor网络
    学习率0.0010.0001
    输入层Linear,15(40)Linear,15(40)
    隐藏层Linear,300(256),ReLULinear, 300(256), ReLU
    输出层Linear,115(40),Tanh
    衰减系数0.99
    下载: 导出CSV
  • [1] LIU Yafeng and DAI Yuhong. On the complexity of joint subcarrier and power allocation for multi-user OFDMA systems[J]. IEEE Transactions on Signal Processing, 2014, 62(3): 583–596. doi: 10.1109/TSP.2013.2293130
    [2] 宗思光, 刘涛, 梁善永. 基于改进遗传算法的干扰资源分配问题研究[J]. 电光与控制, 2018, 25(5): 41–45. doi: 10.3969/j.issn.1671-637X.2018.05.009

    ZONG Siguang, LIU Tao, and LIANG Shanyong. Interference resource allocation based on improved genetic algorithm[J]. Electro-Optics &Control, 2018, 25(5): 41–45. doi: 10.3969/j.issn.1671-637X.2018.05.009
    [3] LUO Zhaoyi, DENG Min, YAO Zhiqiang, et al. Distributed blanket jamming resource scheduling for satellite navigation based on particle swarm optimization and genetic algorithm[C]. The IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 2020: 611–616.
    [4] XU Zhiwei and ZHANG Kai. Multiobjective multifactorial immune algorithm for multiobjective multitask optimization problems[J]. Applied Soft Computing, 2021, 107: 107399. doi: 10.1016/j.asoc.2021.107399
    [5] TIAN Min, DENG Hongtao, and XU Mengying. Immune parallel artificial bee colony algorithm for spectrum allocation in cognitive radio sensor networks[C]. 2020 International Conference on Computer, Information and Telecommunication Systems (CITS), Hangzhou, China, 2020: 1–4.
    [6] 李东生, 高杨, 雍爱霞. 基于改进离散布谷鸟算法的干扰资源分配研究[J]. 电子与信息学报, 2016, 38(4): 899–905. doi: 10.11999/JEIT150726

    LI Dongsheng, GAO Yang, and YONG Aixia. Jamming resource allocation via improved discrete cuckoo search algorithm[J]. Journal of Electronics &Information Technology, 2016, 38(4): 899–905. doi: 10.11999/JEIT150726
    [7] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing Atari with deep reinforcement learning[J]. arXiv: 1312.5602, 2013.
    [8] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529–533. doi: 10.1038/nature14236
    [9] XIONG Xiong, ZHENG Kan, LEI Lei, et al. Resource allocation based on deep reinforcement learning in IoT edge computing[J]. IEEE Journal on Selected Areas in Communications, 2020, 38(6): 1133–1146. doi: 10.1109/JSAC.2020.2986615
    [10] HE Chaofan, HU Yang, CHEN Yan, et al. Joint power allocation and channel assignment for NOMA with deep reinforcement learning[J]. IEEE Journal on Selected Areas in Communications, 2019, 37(10): 2200–2210. doi: 10.1109/JSAC.2019.2933762
    [11] 黄星源, 李岩屹. 基于双Q学习算法的干扰资源分配策略[J]. 系统仿真学报, 2021, 33(8): 1801–1808. doi: 10.16182/j.issn1004731x.joss.20-0253

    HUANG Xingyuan and LI Yanyi. The allocation of jamming resources based on double Q-learning algorithm[J]. Journal of System Simulation, 2021, 33(8): 1801–1808. doi: 10.16182/j.issn1004731x.joss.20-0253
    [12] 许华, 宋佰霖, 蒋磊, 等. 一种通信对抗干扰资源分配智能决策算法[J]. 电子与信息学报, 2021, 43(11): 3086–3095. doi: 10.11999/JEIT210115

    XU Hua, SONG Bailin, JIANG Lei, et al. An intelligent decision-making algorithm for communication countermeasure jamming resource allocation[J]. Journal of Electronics &Information Technology, 2021, 43(11): 3086–3095. doi: 10.11999/JEIT210115
    [13] 饶宁, 许华, 齐子森, 等. 基于最大策略熵深度强化学习的通信干扰资源分配方法[J]. 西北工业大学学报, 2021, 39(5): 1077–1086. doi: 10.1051/jnwpu/20213951077

    RAO Ning, XU Hua, QI Zisen, et al. Allocation method of communication interference resource based on deep reinforcement learning of maximum policy entropy[J]. Journal of Northwestern Polytechnical University, 2021, 39(5): 1077–1086. doi: 10.1051/jnwpu/20213951077
    [14] ZHONG Chen, WANG Feng, GURSOY M C, et al. Adversarial jamming attacks on deep reinforcement learning based dynamic multichannel access[C]. 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Korea (South), 2020: 1–6.
    [15] ZHONG Chen, LU Ziyang, GURSOY M C, et al. Actor-critic deep reinforcement learning for dynamic multichannel access[C]. 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Anaheim, USA, 2018: 599–603.
    [16] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[C]. The 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
    [17] CUI Haoran, WANG Dongyu, LI Qi, et al. A2C deep reinforcement learning-based MEC network for offloading and resource allocation[C]. The 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 2021: 1905–1909.
    [18] XU Chen, WANG Jian, YU Tianhang, et al. Buffer-aware wireless scheduling based on deep reinforcement learning[C]. Proceedings of 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Korea (South), 2020: 1–6.
    [19] MENG Fan, CHEN Peng, WU Lenan, et al. Power allocation in multi-user cellular networks: Deep reinforcement learning approaches[J]. IEEE Transactions on Wireless Communications, 2020, 19(10): 6255–6267. doi: 10.1109/TWC.2020.3001736
    [20] AMURU S, TEKIN C, VAN DER SCHAAR M, et al. Jamming bandits - a novel learning method for optimal jamming[J]. IEEE Transactions on Wireless Communications, 2016, 15(4): 2792–2808. doi: 10.1109/TWC.2015.2510643
    [21] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[C]. The 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [22] KINGMA D P, SALIMANS T, and WELLING M. Variational dropout and the local reparameterization trick[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 2575–2583.
    [23] HAARNOJA T, TANG Haoran, ABBEEL P, et al. Reinforcement learning with deep energy-based policies[C]. The 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017: 1352–1361.
    [24] WANG Wenjing, BHATTACHARJEE S, CHATTERJEE M, et al. Collaborative jamming and collaborative defense in cognitive radio networks[J]. Pervasive and Mobile Computing, 2013, 9(4): 572–587. doi: 10.1016/j.pmcj.2012.06.008
  • 加载中
图(16) / 表(5)
计量
  • 文章访问数:  693
  • HTML全文浏览量:  324
  • PDF下载量:  160
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-01-13
  • 修回日期:  2022-07-12
  • 网络出版日期:  2022-07-15
  • 刊出日期:  2023-03-10

目录

    /

    返回文章
    返回