Chinese Semantic Communication System Based on Word-level and Sentence-level Semantics
-
摘要: 语义通信作为一种新的通信范式,能在语义层面提升通信的有效性和可靠性。然而,现有语义通信系统的研究大多基于英文语料,面向中文语料的语义通信系统研究较为缺乏。因此,该文提出一种基于模块化设计思想的中文语义通信系统,能够有效兼容现有数字通信技术。在发送端,该文提出一种针对中文文本的词性编码方法,显著提升了通信系统的有效性;在接收端,提出一种基于词语级和句子级语义的联合上下文译码机制,并融合候选集合机制与递归算法,进一步提升了通信系统的可靠性。仿真结果表明,词语级和句子级语义可显著提升通信系统的有效性和可靠性,所提语义通信系统在有效性和可靠性方面整体性能表现优异。Abstract:
Objective To address the mismatch between limited communication resources and growing service demands, semantic communication—a novel paradigm—has been proposed and is expected to offer an effective solution. Unlike traditional approaches that focus on accurate symbol transmission, semantic communication operates at the semantic level, aiming to convey intended meaning by leveraging shared background knowledge at both the transmitter and receiver. Advances in semantic information theory provide a theoretical basis for this paradigm, while the development of artificial intelligence techniques for semantic extraction and understanding supports practical system implementation. Most existing semantic communication systems for textual data are based on English corpora; however, Chinese text differs markedly in word segmentation, lexical annotation, and syntactic structure. Systems tailored for Chinese corpora remain underexplored. Furthermore, current lexical code-based systems primarily focus on word-level semantics and fail to fully capture sentence-level semantics. This study addresses these limitations by mining and processing lexical and contextual semantics specific to Chinese text. A semantic communication system is proposed that uses Chinese corpora to learn and extract both word-level and sentence-level semantic associations. Lexical coding is performed at the transmitter, and joint context decoding is realized at the receiver, thereby improving the effectiveness and reliability of the communication process. Methods A Chinese semantic communication system is designed to capture both word-level and sentence-level semantics, leveraging the unique characteristics of Chinese text to enable efficient and reliable transmission of meaning. At the transmitter, a lexical coding method is proposed that encodes words based on their combined lexical semantic features. At the receiver, a two-stage decoding process is implemented. First, the Continuous Bag-of-Words (CBOW) model is used to learn word-level semantics from shared knowledge, estimating the conditional probability of the next word based on preceding words. Second, the Bidirectional Encoder Representations from Transformers (BERT) model is applied to capture sentence-level semantics, using Chinese characters as the fundamental processing unit to compute the probability distribution of words at each position in the sentence. Upon receiving the bit sequence, Huffman decoding is performed with a candidate code list mechanism to generate a set of candidate words. A recursive memoization algorithm then selects the most probable words based on word-level semantics. Finally, sentence-level semantics are applied to correct potential errors in the sentence, producing the recovered text. Results and Discussions The proposed semantic communication system improves effectiveness by encoding combined phrases during lexical coding, thereby reducing the number of coding objects. Reliability is enhanced by leveraging contextual associations during feature learning and joint decoding. For effectiveness, the average code length of the Huffman coding dictionary is 10.61, while the lexical coding dictionary for four categories achieves an average of 8.98. This represents an 18.15% increase in average coding rate. Experiments conducted on 100 randomly selected texts across different corpus sizes yield consistent results ( Table 3 ,Fig. 5 ), validating the effectiveness of lexical coding. For reliability, system performance is first evaluated under varying parameter settings. The optimal values for context window size, lexical category count, and Hamming distance threshold are identified (Figs. 6 ~10 ). Comparative analysis across different systems is then conducted. Under an AWGN channel, the lexical+word-level+sentence-level semantic system achieves higher BLEU scores than the Huffman-only system when the Signal-to-Noise Ratio (SNR) is ≤6 dB, and matches the performance of DeepSC between –3 dB and 3 dB. At SNR ≥9 dB, its BLEU scores are slightly lower than those of the Huffman-only system but significantly higher than those of DeepSC. Across all SNR ranges, the lexical+word-level+sentence-level system outperforms the lexical+word-level system. The BLEU scores of the Huffman+word-level and Huffman+sentence-level systems are similar and consistently exceed those of the Huffman-only system. Similar trends are observed on Rayleigh and Rician fading channels and with METEOR scores (Figs. 11 ,12 ). These results indicate that combining word-level and sentence-level semantics with a candidate set mechanism for joint context decoding substantially enhances transmission reliability at the receiver.Conclusions A Chinese semantic communication system based on word-level and sentence-level semantics is proposed. First, a lexical grouping and coding method based on LAC segmentation is developed by analyzing lexical features in Chinese text, which improves the effectiveness of the communication system. Second, the receiver models context co-occurrence probabilities by extracting word-level and sentence-level semantic features, enabling joint decoding through word selection and sentence-level error correction, thereby enhancing reliability. Simulation results show that the average code length of the Huffman coding dictionary is 10.61, while the lexical coding dictionary for four categories achieves an average of 8.98, resulting in an 18.15% increase in coding rate. On the AWGN channel, the proposed lexical+word-level+sentence-level system outperforms the Huffman-only system at low SNR and the DeepSC system at high SNR. The Huffman+word-level and Huffman+sentence-level systems yield similar reliability scores, both consistently higher than the Huffman-only system. These findings confirm that incorporating both word-level and sentence-level semantics significantly enhances system reliability. -
表 1 词性分类
标签 含义 标签 含义 标签 含义 标签 含义 n 普通名词 f 方位名词 s 处所名词 nw 作品名 nz 其他专名 v 普通动词 vd 动副词 vn 名动词 a 形容词 ad 副形词 an 名形词 d 副词 m 数量词 q 量词 r 代词 p 介词 c 连词 u 助词 xc 其他虚词 w 标点符号 P 人名 L 地名 O 机构名 T 时间 表 2 词性编码结果示例
词语 词组 编码 词语 词组 编码 这种 (情况,来,最大,这种) 00111110 部队 (部队,作为,是否,二) 00010101 图像 (图像,指出,繁多,以至) 100110000001 已 (指标,生产,已,及) 1001111 的 (装备,是,主要,的) 010 装备 (装备,是,主要,的) 010 数据处理 (核弹头,数据处理,纯粹,近些) 11001111100010 各类 (军,联合,均,各类) 110000011 格式 (格式,寻求,极高,一重要) 1010001101111 战术 (战术,通信,迅速,而是) 1111010010 与 (中,为,新,与) 110111 系统 (系统,进行,重要,在) 00101 美军 (美军,出,准确,之一) 100101000 相 (规划,合作,相,第二) 1001101000 地面 (地面,输出,初始,其次) 0011001000 兼容 (文章,兼容,艰巨,第一支) 1101010000000 表 3 Huffman编码和不同分类数下词性编码的有效性对比
编码方式 字典的平均编码长度 字典的编码率提升值(%) 100个文本的平均编码长度 100个文本的编码率提升值(%) Huffman编码 10.61 / 10.58 / 词性编码(3类) 9.15 16.02 9.17 16.01 词性编码(4类) 8.98 18.15 8.95 18.21 词性编码(5类) 8.78 20.83 8.79 20.95 表 4 各个通信系统译码文本与传输文本的对比示例
不同通信系统 部分接收文本 总错误词语数 分词后的
发送文本这种图像的数据处理格式与美军地面部队已装备的各类战术系统相兼容
因而可直接下行链接到美国陆军各级指挥部门
该雷达还可为地形系统与情报系统提供多光谱图像/ Huffman系统 这种图像的数据处理格式与美军地面部队已装备的各类AMASK系统相兼容
因而可出下行链接AMASK美国陆军各级指挥被
该雷达还可为地形系统与情报系统提供多光谱图像179 Huffman+
词语级语义这种图像的数据处理格式与美军地面部队已装备的各类战术系统相兼容
因而可出下行链接到美国陆军各级指挥被
该雷达还可为地形系统与情报系统提供多光谱图像74 Huffman+
句子级语义这种图像的数据处理格式与美军地面部队已装备的各类战术系统相兼容
因而可以用下行链接到美国陆军各级指挥部门
该雷达还可为地形系统与情报系统提供多光谱图像72 DeepSC系统 这种图像的环节格式与美军地面部队已装备的各类战术系统相兼容
因而可直接梳理监督管理到美国陆军各级指挥部门
该雷达还图形运动系统与情报系统提供多检索图像262 词性+
词语级语义这种图像的数据处理格式与美军地面部队已装备的各类战术系统相兼容
开始并直接下行链接到美国陆军各级指挥部门
该雷达还可为地形系统与情报系统时多光谱图像181 词性+词语级+句子级语义 这种图像的数据处理格式与美军地面部队已装备的各类战术系统相兼容
开始并直接下行链接到美国陆军各级指挥部门
该雷达还可为地形系统与情报系统提供多光谱图像114 -
[1] SHANNON C E. A mathematical theory of communication[J]. The Bell System Technical Journal, 1948, 27(3): 379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [2] WEAVER W. Recent contributions to the mathematical theory of communication[J]. ETC: A Review of General Semantics, 1953, 10(4): 261–281. [3] 徐文伟, 张弓, 白铂, 等. 后香农时代ICT领域的十大挑战问题[J]. 中国科学: 数学, 2021, 51(7): 1095–1138. doi: 10.1360/SSM-2021-0013.XU Wenwei, ZHANG Gong, BAI Bo, et al. Ten key ICT challenges in the post-Shannon era[J]. SCIENTIA SINICA Mathematica, 2021, 51(7): 1095–1138. doi: 10.1360/SSM-2021-0013. [4] CARNAP R and BAR-HILLEL Y. An outline of a theory of semantic information[R]. Technical Report No. 247, 1952. [5] FLORIDI L. Outline of a theory of strongly semantic information[J]. Minds and Machines, 2004, 14(2): 197–221. doi: 10.1023/B:MIND.0000021684.50925.c9. [6] NIU Kai and ZHANG Ping. A mathematical theory of semantic communication[J]. Journal on Communications, 2024, 45(6): 7–59. doi: 10.11959/j.issn.1000-436x.2024111. [7] OUYANG Long, WU J, JIANG Xu, et al. Training language models to follow instructions with human feedback[J]. arXiv: 2203.02155, 2022. doi: 10.48550/arXiv.2203.02155. [8] RAO M, FARSAD N, and GOLDSMITH A. Variable length joint source-channel coding of text using deep neural networks[C]. 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications, Kalamata, Greece, 2018: 1–5. doi: 10.1109/SPAWC.2018.8445924. [9] XIE Huiqiang, QIN Zhijin, LI G Y, et al. Deep learning enabled semantic communication systems[J]. IEEE Transactions on Signal Processing, 2021, 69: 2663–2675. doi: 10.1109/TSP.2021.3071210. [10] XIE Huiqiang and QIN Zhijin. A lite distributed semantic communication system for internet of things[J]. IEEE Journal on Selected Areas in Communications, 2021, 39(1): 142–153. doi: 10.1109/JSAC.2020.3036968. [11] 张亦弛, 张平, 魏急波, 等. 面向智能体的语义通信: 架构与范例[J]. 中国科学(信息科学), 2022, 52(5): 907–921. doi: 10.1360/SSI-2020-0379.ZHANG Yichi, ZHANG Ping, WEI Jibo, et al. Semantic communication for intelligent devices: Architectures and a paradigm[J]. SCIENTIA SINICA Informationis, 2022, 52(5): 907–921. doi: 10.1360/SSI-2020-0379. [12] ZHANG Yichi, ZHAO Haitao, WEI Jibo, et al. Context-based semantic communication via dynamic programming[J]. IEEE Transactions on Cognitive Communications and Networking, 2022, 8(3): 1453–1467. doi: 10.1109/TCCN.2022.3173056. [13] 罗鹏, 刘月玲, 张聿远, 等. 高效融合全局和局部上下文特征的语义通信系统[J]. 通信学报, 2023, 44(7): 14–25. doi: 10.11959/j.issn.1000-436x.2023133.LUO Peng, LIU Yueling, ZHANG Yuyuan, et al. Semantic communication system with efficient integration of global and local context features[J]. Journal on Communications, 2023, 44(7): 14–25. doi: 10.11959/j.issn.1000-436x.2023133. [14] LUO Peng, ZHAO Haitao, CAO Kuo, et al. Emotion-aided semantic communication system for reliable semantic recovery under low SNR[J]. IEEE Communications Letters, 2024, 28(3): 503–507. doi: 10.1109/LCOMM.2024.3352559. [15] ZHANG Yuyuan, ZHAO Haitao, CAO Kuo, et al. Layered semantic communication system for dynamic scenarios[J]. IEEE Signal Processing Letters, 2024, 31: 2525–2529. doi: 10.1109/LSP.2024.3415967. [16] LIU Chuanhong, GUO Caili, YANG Yang, et al. OFDM-based digital semantic communication with importance awareness[J]. IEEE Transactions on Communications, 2024, 72(10): 6301–6315. doi: 10.1109/TCOMM.2024.3397862. [17] BO Yufei, DUAN Yiheng, SHAO Shuo, et al. Joint coding-modulation for digital semantic communications via variational autoencoder[J]. IEEE Transactions on Communications, 2024, 72(9): 5626–5640. doi: 10.1109/TCOMM.2024.3386577. [18] LI Yishen, CHEN Xuechen, DENG Xiaoheng, et al. Content adaptive distributed joint source-channel coding for image transmission with hyperprior[J]. IEEE Transactions on Cognitive Communications and Networking, 2025, 11(1): 105–117. doi: 10.1109/TCCN.2024.3438371. [19] WU Haotian, SHAO Yulin, BIAN Chenghong, et al. Deep joint source-channel coding for adaptive image transmission over MIMO channels[J]. IEEE Transactions on Wireless Communications, 2024, 23(10): 15002–15017. doi: 10.1109/TWC.2024.3422794. [20] JIAO Zhenyu, SUN Shuqi, and SUN Ke. Chinese lexical analysis with deep BI-GRU-CRF network[J]. arXiv: 1807.01882, 2018. doi: 10.48550/arXiv.1807.01882. [21] RONG Xin. Word2vec parameter learning explained[J]. arXiv: 1411.2738, 2016. doi: 10.48550/arXiv.1411.2738. [22] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, 2018: 4171–4186. [23] CUI Yiming, CHE Wanxiang, LIU Ting, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504–3514. doi: 10.1109/TASLP.2021.3124365. [24] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]. The 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, USA, 2002: 311–318. doi: 10.3115/1073083.1073135. [25] BANERJEE S and LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, USA, 2005: 65–72. -