一种融合噪声网络的深度强化学习通信干扰资源分配算法

彭翔; 许华; 蒋磊; 饶宁; 宋佰霖

doi:10.11999/JEIT220066

一种融合噪声网络的深度强化学习通信干扰资源分配算法

doi: 10.11999/JEIT220066 cstr: 32379.14.JEIT220066

空军工程大学信息与导航学院西安 710077

详细信息

作者简介:
彭翔：男，博士生，研究方向为通信对抗智能决策

许华：男，博士，教授，研究方向为通信信号处理、智能通信对抗

蒋磊：男，博士，副教授，研究方向为通信抗干扰、智能通信对抗

饶宁：男，博士生，研究方向为通信对抗智能决策

宋佰霖：男，硕士，研究方向为通信对抗智能决策

通讯作者:
彭翔　pengxiang0538@163.com

中图分类号: TN975
计量
- 文章访问数: 1083
- HTML全文浏览量: 706
- PDF下载量: 179
- 被引次数: 0
出版历程
- 收稿日期: 2022-01-13
- 修回日期: 2022-07-12
- 网络出版日期: 2022-07-15
- 刊出日期: 2023-03-10

A Deep Reinforcement Learning Communication Jamming Resource Allocation Algorithm Fused with Noise Network

Information and Navigation College, Air Force Engineering University, Xi’an 710077, China

摘要

摘要: 针对传统干扰资源分配算法在处理非线性组合优化问题时需要较完备的先验信息，同时决策维度小，无法满足现代通信对抗要求的问题，该文提出一种融合噪声网络的深度强化学习通信干扰资源分配算法(FNNDRL)。借鉴噪声网络的思想，该算法设计了孪生噪声评估网络，在避免Q值高估的基础上，通过提升评估网络的随机性，保证了训练过程的探索性；基于概率熵的物理意义，设计了基于策略分布熵改进的策略网络损失函数，在最大化累计奖励的同时最大化策略分布熵，避免策略优化过程中收敛到局部最优。仿真结果表明，该算法在解决干扰资源分配问题时优于所对比的平均分配和强化学习方法，同时算法稳定性较高，对高维决策空间适应性强。
- 干扰资源分配 /
- 深度强化学习 /
- 噪声网络 /
- 策略分布熵
Abstract: To solve the problem that the traditional jamming resource allocation algorithm needs relatively complete prior information when dealing with nonlinear combinatorial optimization problems, and meanwhile, the decision dimension is small, which can not meet the requirements of modern communication countermeasures, a Deep Reinforcement Learning communication jamming resource allocation algorithm Fused with Noise Network (FNNDRL) is proposed. Using the idea of noise network for reference, twin noise evaluation network, which can avoid the overestimation of Q value and improve the randomness of evaluation network to ensure the exploration of training process is designed by the algorithm. Based on the physical significance of the probability entropy, an improved strategy network loss function based on the strategy distribution entropy is designed to maximize the cumulative reward and the strategy distribution entropy to avoid convergence to local optimal in the process of strategy optimization. The simulation results show that the proposed algorithm is superior to the average allocation and reinforcement learning methods in solving the problem of jamming resource allocation. Meanwhile, the algorithm has high stability and strong adaptability to high-dimensional decision space.
- Jamming resource allocation /
- Deep Reinforcement Learning(DRL) /
- Noise network /
- Entropy of strategy distribution

HTML全文

图 1 典型的通信对抗场景

下载: 全尺寸图片幻灯片

图 2 通信干扰方程物理意义及空间关系

下载: 全尺寸图片幻灯片

图 3 噪声网络原理图

下载: 全尺寸图片幻灯片

图 4 算法框架

下载: 全尺寸图片幻灯片

图 5 3 vs 5 不同熵系数下算法训练奖励曲线

下载: 全尺寸图片幻灯片

图 6 3 vs 5 不同熵系数下算法全面压制效果

下载: 全尺寸图片幻灯片

图 7 3 vs 5 不同熵系数下算法干扰80%以上的通信链路效果

下载: 全尺寸图片幻灯片

图 8 3 vs 5 不同熵系数下算法测试效果

下载: 全尺寸图片幻灯片

图 9 3 vs 5 成功干扰80%以上的通信链路效果

下载: 全尺寸图片幻灯片

图 10 3 vs 5 全面压制效果

下载: 全尺寸图片幻灯片

图 11 3 vs 5 累计奖励

下载: 全尺寸图片幻灯片

图 12 3 vs 5 算法测试效果

下载: 全尺寸图片幻灯片

图 13 5 vs 8 成功干扰80%以上通信链路效果

下载: 全尺寸图片幻灯片

图 14 5 vs 8 全面压制效果

下载: 全尺寸图片幻灯片

图 15 5 vs 8 累计奖励

下载: 全尺寸图片幻灯片

图 16 5 vs 8 压制百分比测试效果

下载: 全尺寸图片幻灯片

算法1　融合噪声网络的深度强化学习通信干扰资源分配算法
步骤1　随机初始化评估网络Noisy Q1和Noisy Q2，参数分别为$ {\theta _1} $和$ {\theta _2} $；($ \theta \triangleq \mu + \sigma \circ \xi $，$ \xi $为高斯噪声)
步骤2　随机初始化策略网络Policy Network，参数为$ \varphi $；
步骤3　初始化评估目标Noisy Q1网络和目标Noisy Q2网络，参数分别为$ {\bar \theta _1} \leftarrow {\theta _1},{\kern 1pt} {\bar \theta _2} \leftarrow {\theta _2} $；
步骤4　初始化经验回放池D；
步骤5　For episodes in 100 do：
初始化状态S₀；
For steps in 500 do：
(1)根据状态选择动作$a$，执行动作$a$得到奖励值R和下一个状态${\boldsymbol{S}}'$
(2)存储$\left( {{\boldsymbol{S}},a,R,{\boldsymbol{S}}'} \right)$ 到回放池D当中，${\boldsymbol{S}} \leftarrow {\boldsymbol{S}}'$；
If 回放池的当前容量> C：
(a)从经验回放池中随机采样小批次样本$\left( {{S_i},{a_i},{R_i},{S_{i + 1}}} \right)$；
(b)计算目标价值：$ {y_i} = R{}_i + \;\gamma \mathop {\min }\limits_{j = 1,2} {\bar Q_j}\left( {{S_{i + 1}},\bar a',{\xi _{i + 1}}\|\bar \theta } \right) $；
(c)最小化评估网络损失函数：$L\left( \theta \right) = {\kern 1pt} \dfrac{1}{B}{\displaystyle\sum\nolimits_i {\left( {R{}_i + \;\gamma \mathop {\min }\limits_{j = 1,2} {{\bar Q}_j}\left( {{S_{i + 1}},\bar a',{\xi _{i + 1}}\|\bar \theta } \right) - Q\left( {{S_i},{a_i},{\xi _i}\|\theta } \right)} \right)} ^2}$；
梯度下降更新评估网络参数$ {\theta _1} $和$ {\theta _2} $：$\theta \leftarrow \theta - {\alpha _\theta } \cdot {{\text{∇}} _\theta }L\left( \theta \right)\;$；
(d)最小化策略网络损失函数：$ J\left( \varphi \right) = \dfrac{1}{B}{\displaystyle\sum\nolimits_i {\left( {\mathop {\min }\limits_{j = 1,2} {Q_j}\left( {{S_i},\bar a,{\xi _i}\|\theta } \right) - \alpha \lg {\pi _\varphi }\left( {\bar a\|{S_i}} \right)} \right)} ^2} $；
梯度下降更新策略网络参数$ \varphi $：$\varphi \leftarrow \varphi - {\alpha _\varphi } \cdot {{\text{∇}} _\varphi }J\left( \varphi \right)$；
(e)单步更新目标Noisy Q1网络和目标Noisy Q2网络参数：$ {\bar \theta _1} \leftarrow \tau \; \cdot \;{\theta _1} + \left( {1 - \tau } \right){\bar \theta _1} $，$ {\bar \theta _2} \leftarrow \tau \; \cdot \;{\theta _2} + \left( {1 - \tau } \right){\bar \theta _2} $；
End for
End for

下载: 导出CSV

表 1 模型参数

物理参数	值
通信链路天线增益${G_{\rm{t}}}$	5 dB
干扰链路天线增益${G_{\rm{J}}}$	8 dB
干扰设备最大干扰功率${P_{{\rm{J}}\max } }$	90 dBm
通信信号发射功率${P_{\rm{t}}}$	55 dBm
干扰信号传播距离${r_j}$	300 km
中心频率$ f $	225 MHz
符号错误率阈值$ \kappa $	0.05
小批次样本大小B	256
经验回放池容量C	2¹⁷

下载: 导出CSV

表 2 FNNDRL算法参数

参数	评估网络		策略网络
参数	Noisy Q1,Q2	目标Noisy Q1,Q2	Policy Network
学习率	0.005	0.005	0.003
输入层	Linear,30(80)	Linear,30(80)	Linear,15(40)
隐藏层1	Noisy Linear,256,ReLU	Noisy Linear, 256，ReLU	Linear,256,ReLU
隐藏层2	Noisy Linear,256,ReLU	Noisy Linear, 256，ReLU	Linear,256,ReLU
输出层	Linear,1	Linear,1	30(80),Tanh
软更新系数$\tau $	Tau=0.01(0.001)	Tau=0.01(0.001)	——
Dropout概率	p=0.2	p=0.2	p=0.2
衰减系数	0.99		——

下载: 导出CSV

表 3 DDPG算法参数

参数	Q网络	目标Q网络	策略网络	目标策略网络
学习率	0.001		0.001
输入层	Linear,30(80)	Linear,30(80)	Linear,15(40)	Linear,15(40)
隐藏层	Linear,300(256),ReLU	Linear,300(256),ReLU	Linear,300(256),ReLU	Linear,300(256),ReLU
输出层	Linear,1	Linear,1	15(40),Tanh	15(40),Tanh
软更新系数$\tau $	Tau=0.01	Tau=0.01	Tau=0.01	Tau=0.01
衰减系数	0.9		——

下载: 导出CSV

表 4 A2C算法参数

参数	Critic网络	Actor网络
学习率	0.001	0.0001
输入层	Linear,15(40)	Linear,15(40)
隐藏层	Linear,300(256),ReLU	Linear, 300(256), ReLU
输出层	Linear,1	15(40),Tanh
衰减系数	0.99

下载: 导出CSV

参考文献(24)

[1]	LIU Yafeng and DAI Yuhong. On the complexity of joint subcarrier and power allocation for multi-user OFDMA systems[J]. IEEE Transactions on Signal Processing, 2014, 62(3): 583–596. doi: 10.1109/TSP.2013.2293130
[2]	宗思光, 刘涛, 梁善永. 基于改进遗传算法的干扰资源分配问题研究[J]. 电光与控制, 2018, 25(5): 41–45. doi: 10.3969/j.issn.1671-637X.2018.05.009 ZONG Siguang, LIU Tao, and LIANG Shanyong. Interference resource allocation based on improved genetic algorithm[J]. Electro-Optics &Control, 2018, 25(5): 41–45. doi: 10.3969/j.issn.1671-637X.2018.05.009
[3]	LUO Zhaoyi, DENG Min, YAO Zhiqiang, et al. Distributed blanket jamming resource scheduling for satellite navigation based on particle swarm optimization and genetic algorithm[C]. The IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 2020: 611–616.
[4]	XU Zhiwei and ZHANG Kai. Multiobjective multifactorial immune algorithm for multiobjective multitask optimization problems[J]. Applied Soft Computing, 2021, 107: 107399. doi: 10.1016/j.asoc.2021.107399
[5]	TIAN Min, DENG Hongtao, and XU Mengying. Immune parallel artificial bee colony algorithm for spectrum allocation in cognitive radio sensor networks[C]. 2020 International Conference on Computer, Information and Telecommunication Systems (CITS), Hangzhou, China, 2020: 1–4.
[6]	李东生, 高杨, 雍爱霞. 基于改进离散布谷鸟算法的干扰资源分配研究[J]. 电子与信息学报, 2016, 38(4): 899–905. doi: 10.11999/JEIT150726 LI Dongsheng, GAO Yang, and YONG Aixia. Jamming resource allocation via improved discrete cuckoo search algorithm[J]. Journal of Electronics &Information Technology, 2016, 38(4): 899–905. doi: 10.11999/JEIT150726
[7]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing Atari with deep reinforcement learning[J]. arXiv: 1312.5602, 2013.
[8]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529–533. doi: 10.1038/nature14236
[9]	XIONG Xiong, ZHENG Kan, LEI Lei, et al. Resource allocation based on deep reinforcement learning in IoT edge computing[J]. IEEE Journal on Selected Areas in Communications, 2020, 38(6): 1133–1146. doi: 10.1109/JSAC.2020.2986615
[10]	HE Chaofan, HU Yang, CHEN Yan, et al. Joint power allocation and channel assignment for NOMA with deep reinforcement learning[J]. IEEE Journal on Selected Areas in Communications, 2019, 37(10): 2200–2210. doi: 10.1109/JSAC.2019.2933762
[11]	黄星源, 李岩屹. 基于双Q学习算法的干扰资源分配策略[J]. 系统仿真学报, 2021, 33(8): 1801–1808. doi: 10.16182/j.issn1004731x.joss.20-0253 HUANG Xingyuan and LI Yanyi. The allocation of jamming resources based on double Q-learning algorithm[J]. Journal of System Simulation, 2021, 33(8): 1801–1808. doi: 10.16182/j.issn1004731x.joss.20-0253
[12]	许华, 宋佰霖, 蒋磊, 等. 一种通信对抗干扰资源分配智能决策算法[J]. 电子与信息学报, 2021, 43(11): 3086–3095. doi: 10.11999/JEIT210115 XU Hua, SONG Bailin, JIANG Lei, et al. An intelligent decision-making algorithm for communication countermeasure jamming resource allocation[J]. Journal of Electronics &Information Technology, 2021, 43(11): 3086–3095. doi: 10.11999/JEIT210115
[13]	饶宁, 许华, 齐子森, 等. 基于最大策略熵深度强化学习的通信干扰资源分配方法[J]. 西北工业大学学报, 2021, 39(5): 1077–1086. doi: 10.1051/jnwpu/20213951077 RAO Ning, XU Hua, QI Zisen, et al. Allocation method of communication interference resource based on deep reinforcement learning of maximum policy entropy[J]. Journal of Northwestern Polytechnical University, 2021, 39(5): 1077–1086. doi: 10.1051/jnwpu/20213951077
[14]	ZHONG Chen, WANG Feng, GURSOY M C, et al. Adversarial jamming attacks on deep reinforcement learning based dynamic multichannel access[C]. 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Korea (South), 2020: 1–6.
[15]	ZHONG Chen, LU Ziyang, GURSOY M C, et al. Actor-critic deep reinforcement learning for dynamic multichannel access[C]. 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Anaheim, USA, 2018: 599–603.
[16]	LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[C]. The 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
[17]	CUI Haoran, WANG Dongyu, LI Qi, et al. A2C deep reinforcement learning-based MEC network for offloading and resource allocation[C]. The 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 2021: 1905–1909.
[18]	XU Chen, WANG Jian, YU Tianhang, et al. Buffer-aware wireless scheduling based on deep reinforcement learning[C]. Proceedings of 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Korea (South), 2020: 1–6.
[19]	MENG Fan, CHEN Peng, WU Lenan, et al. Power allocation in multi-user cellular networks: Deep reinforcement learning approaches[J]. IEEE Transactions on Wireless Communications, 2020, 19(10): 6255–6267. doi: 10.1109/TWC.2020.3001736
[20]	AMURU S, TEKIN C, VAN DER SCHAAR M, et al. Jamming bandits - a novel learning method for optimal jamming[J]. IEEE Transactions on Wireless Communications, 2016, 15(4): 2792–2808. doi: 10.1109/TWC.2015.2510643
[21]	FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[C]. The 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
[22]	KINGMA D P, SALIMANS T, and WELLING M. Variational dropout and the local reparameterization trick[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 2575–2583.
[23]	HAARNOJA T, TANG Haoran, ABBEEL P, et al. Reinforcement learning with deep energy-based policies[C]. The 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017: 1352–1361.
[24]	WANG Wenjing, BHATTACHARJEE S, CHATTERJEE M, et al. Collaborative jamming and collaborative defense in cognitive radio networks[J]. Pervasive and Mobile Computing, 2013, 9(4): 572–587. doi: 10.1016/j.pmcj.2012.06.008