高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于复述增广的医疗领域机器翻译

安波 龙从军

安波, 龙从军. 基于复述增广的医疗领域机器翻译[J]. 电子与信息学报, 2022, 44(1): 118-126. doi: 10.11999/JEIT210926
引用本文: 安波, 龙从军. 基于复述增广的医疗领域机器翻译[J]. 电子与信息学报, 2022, 44(1): 118-126. doi: 10.11999/JEIT210926
AN Bo, LONG Congjun. Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation[J]. Journal of Electronics & Information Technology, 2022, 44(1): 118-126. doi: 10.11999/JEIT210926
Citation: AN Bo, LONG Congjun. Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation[J]. Journal of Electronics & Information Technology, 2022, 44(1): 118-126. doi: 10.11999/JEIT210926

基于复述增广的医疗领域机器翻译

doi: 10.11999/JEIT210926
基金项目: 国家自然科学基金(62076233),中国社会科学院重大创新工程项目(2020YZDZX01-2)
详细信息
    作者简介:

    安波:1986年生,男,副研究员,硕士生导师,研究方向为自然语言处理、机器翻译

    龙从军:1978年生,男,副研究员,硕士生导师,研究方向为民族语言处理、自然语言处理

    通讯作者:

    安波 anbo724@163.com

  • 1) https://github.com/casnlu/Chinese-PPDB2) https://ai.baidu.com/broad/subordinate?dataset=paraphrasing
  • 3) https://ai.tencent.com/ailab/nlp/zh/embedding.html4) https://github.com/lancopku/pkuseg-python
  • 5) https://github.com/PaddlePaddle6) https://translate.google.cn/
  • 7) http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
  • 中图分类号: TN912.3; TP393

Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation

Funds: The National Natural Science Foundation of China (62076233), The Major Innovation Project of Chinese Academy of Social Sciences (2020YZDZX01-2)
  • 摘要: 医疗机器翻译对于跨境医疗、医疗文献翻译等应用具有重要价值。汉英神经机器翻译依靠深度学习强大的建模能力和大规模双语平行数据取得了长足的进步。神经机器翻译通常依赖于大规模的平行句对训练翻译模型。目前,汉英翻译数据主要以新闻、政策等领域数据为主,缺少医疗领域的数据,导致医疗领域的汉英机器翻译效果不佳。针对医疗垂直领域机器翻译训练数据不足的问题,该文提出利用复述生成技术对汉英医疗机器翻译数据进行增广,扩大汉英机器翻译的规模。通过多种主流的神经机器翻译模型的实验结果表明,通过复述生成对数据进行增广可以有效地提升机器翻译的性能,在RNNSearch, Transformer等多个主流模型上均取得了6个点以上的BLEU值提升,验证了复述增广方法对领域机器翻译的有效性。同时,基于MT5等大规模预训练语言模型可以进一步地提升机器翻译的性能。
  • 图  1  基于复述生成的机器翻译数据增广示意图

    图  2  复述生成整体框架图

    图  3  基于双语电子书的汉英医疗机器翻译数据抽取方法

    图  4  复述增广的机器翻译方法框架图

    图  5  不同复述数量对机器翻译性能的影响

    表  1  汉英医疗机器翻译数据集

    训练集验证集测试集中文平均字数英文平均词数
    850007500750014.311.2
    下载: 导出CSV

    表  2  模型参数设置

    机器翻译模型参数参数值
    Seq2SeqEmbedding size300
    Beam size50
    Batch size64
    Sentence length256
    Learning rate0.01
    OptimizerAdam
    RNN cellLSTM
    Drouput0.2
    RNNSearchEmbedding size300
    Beam size50
    Batch size64
    Sentence length256
    Learning rate0.01
    OptimizerAdam
    RNN cellLSTM
    Drouput0.2
    TransformerEmbedding size300
    Beam size50
    Batch size64
    Sentence length256
    Learning rate0.1
    OptimizerAdam
    RNN cellLSTM
    Drouput0.2
    Num head8
    下载: 导出CSV

    表  3  汉英医疗机器翻译结果

    机器翻译模型数据增广模型BLEU提升(%)
    Seq2Seq31.99
    WordRep32.120.41
    BiLSTM-para33.454.56
    Transformer-para35.2310.13
    Bert-para35.2810.28
    MT5-para35.7411.72
    RNNSearch41.28
    WordRep40.98-0.73
    BiLSTM-para43.254.77
    Transformer-para44.126.88
    Bert-para44.678.21
    MT5-para44.978.94
    Transformer48.21
    WordRep48.290.17
    BiLSTM-para49.863.42
    Transformer-para51.326.45
    Bert-para51.366.53
    MT5-para51.977.80
    下载: 导出CSV

    表  4  汉英医疗机器翻译例子

    汉语句子患者男,31岁,因中重度反复头痛18天入院,表现为枕部至双额部逐渐发作,呈搏动性,发作持续超过4h,并持续加重。
    百度The patient, a 31 year old male, was hospitalized for 18 days due to moderate and severe recurrent headache. He showed a gradual attack from the occipital part to the double frontal part, which was pulsatile. The attack lasted for more than 4 hours and continued to worsen.
    谷歌A 31-year-old male patient was admitted to the hospital for 18 days with moderate to severe recurrent headaches. The manifestations were pulsatile attacks from the occiput to the forehead. The attacks lasted more than 4 hours and gradually worsened.
    本文A 31-year-old man was admitted with an 18-day history of a moderate to severe recurrent headache, presenting gradual onset from occipital to bifrontal regions, pulsatile, in episodes lasting beyond four hours, and progressive worsening.
    下载: 导出CSV
  • [1] 刘群. 统计机器翻译综述[J]. 中文信息学报, 2003, 17(4): 1–12. doi: 10.3969/j.issn.1003-0077.2003.04.001

    LIU Qun. Survey on statistical machine translation[J]. Journal of Chinese Information Processing, 2003, 17(4): 1–12. doi: 10.3969/j.issn.1003-0077.2003.04.001
    [2] 李亚超, 熊德意, 张民. 神经机器翻译综述[J]. 计算机学报, 2018, 41(12): 2734–2755. doi: 10.11897/SP.J.1016.2018.02734

    LI Yachao, XIONG Deyi, and ZHANG Min. A survey of neural machine translation[J]. Chinese Journal of Computers, 2018, 41(12): 2734–2755. doi: 10.11897/SP.J.1016.2018.02734
    [3] STAHLBERG F. Neural machine translation: A review[J]. Journal of Artificial Intelligence Research, 2020, 69: 343–418. doi: 10.1613/jair.1.12007
    [4] TRIPATHI S and SARKHEL J K. Approaches to machine translation[J]. Annals of Library and Information Studies, 2010, 57: 388–393.
    [5] CHAROENPORNSAWAT P, SORNLERTLAMVANICH V, and CHAROENPORN T. Improving translation quality of rule-based machine translation[C]. Proceedings of the 2002 COLING workshop on Machine translation in Asia, Taipei, China, 2002.
    [6] LIU Shujie, LI C H, and ZHOU Ming. Statistic machine translation boosted with spurious word deletion[C]. Proceedings of Machine Translation Summit, Xiamen, China, 2011.
    [7] GOODFELLOW I, BENGIO Y, and COURVILLE A. Deep Learning[M]. Cambridge: MIT Press, 2016.
    [8] ECK M, VOGEL S, and WAIBEL A. Improving statistical machine translation in the medical domain using the Unified Medical Language System[C]. Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 2004.
    [9] DUŠEK O, HAJIČ J, HLAVÁČOVÁ J, et al. Machine translation of medical texts in the Khresmoi project[C]. Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, USA, 2014.
    [10] WOLK K and MARASEK K P. Translation of Medical Texts Using Neural Networks[M]. HERSHEY P A. Deep Learning and Neural Networks: Concepts, Methodologies, Tools, and Applications. IGI Global, 2020: 1137–1154.
    [11] ZOPH B, YURET D, MAY J, et al. Transfer learning for low-resource neural machine translation[C]. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, USA, 2016.
    [12] PARK C, YANG Y, PARK K, et al. Decoding strategies for improving low-resource machine translation[J]. Electronics, 2020, 9(10): 1562. doi: 10.3390/electronics9101562
    [13] FADAEE M, BISAZZA A, and MONZ C. Data augmentation for low-resource neural machine translation[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
    [14] LAMPLE G, CONNEAU A, DENOYER L, et al. Unsupervised machine translation using monolingual corpora only[J]. arXiv: 1711.00043, 2017.
    [15] ARTETXE M, LABAKA G, and AGIRRE E. An effective approach to unsupervised machine translation[C]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019: 194–203.
    [16] CHENG Yong. Semi-supervised Learning for Neural Machine Translation[M]. CHENG Yong. Joint Training for Neural Machine Translation. Singapore: Springer, 2019: 25–40.
    [17] DUAN Sufeng, ZHAO Hai, ZHANG Dongdong, et al. Syntax-aware data augmentation for neural machine translation[J]. arXiv: 2004.14200, 2020.
    [18] PENG Wei, HUANG Chongxuan, LI Tianhao, et al. Dictionary-based data augmentation for cross-domain neural machine translation[J]. arXiv: 2004.02577, 2020.
    [19] SUGIYAMA A and YOSHINAGA N. Data augmentation using back-translation for context-aware neural machine translation[C]. Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong, China, 2019.
    [20] FREITAG M, FOSTER G, GRANGIER D, et al. Human-paraphrased references improve neural machine translation[J]. arXiv: 2010.10245, 2020.
    [21] GANITKEVITCH J, VAN DURME B, and CALLISON-BURCH C. PPDB: The paraphrase database[C]. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, USA, 2013: 758–764.
    [22] BERANT J and LIANG P. Semantic parsing via paraphrasing[C]. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, 2014: 1415–1425.
    [23] STIX G. The Elusive goal of machine translation[J]. Scientific American, 2006, 294(3): 92–95. doi: 10.1038/scientificamerican0306-92
    [24] GERBER L and YANG Jin. Systran MT dictionary development[C]. Machine Translation: Past, Present, and Future. In: Proceedings of Machine Translation Summit VI, 1997.
    [25] NAGAO M, TSUJII J, MITAMURA K, et al. A machine translation system from Japanese into English: Another perspective of MT systems[C]. Proceedings of the 8th Conference on Computational Linguistics, Tokyo, Japan, 1980: 414–423.
    [26] JOHNSON R, KING M, and DES TOMBE L. Eurotra: A multilingual system under development[J]. Computational Linguistics, 1985, 11(2/3): 155–169. doi: 10.5555/1187874.1187880
    [27] WEAVER W. Translation[J]. Machine Translation of Languages, 1955, 14: 15–23.
    [28] PETER F B, PIETRA S A D, PIETRA V J D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational Linguistics, 1993, 19(2): 263–311.
    [29] KOEHN P, HOANG H, BIRCH A, et al. Moses: Open source toolkit for statistical machine translation[C]. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, 2007: 177–180.
    [30] XIAO Tong, ZHU Jingbo, ZHANG Hao, et al. NiuTrans: An open source toolkit for phrase-based and syntax-based machine translation[C]. Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea, 2012: 19–24.
    [31] KALCHBRENNER N and BLUNSOM P. Recurrent continuous translation models[C]. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA, 2013: 1700–1709.
    [32] TRAORE B B, KAMSU-FOGUEM B, and TANGARA F. Deep convolution neural network for image recognition[J]. Ecological Informatics, 2018, 48: 257–268. doi: 10.1016/j.ecoinf.2018.10.002
    [33] İRSOY O and CARDIE A. Deep recursive neural networks for compositionality in language[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2096–2104.
    [34] HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735
    [35] CHEN M X, FIRAT O, BAPNA A, et al. The best of both worlds: Combining recent advances in neural machine translation[C]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018.
    [36] LUONG T, PHAM H, and MANNING C D. Effective approaches to attention-based neural machine translation[C]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Porzgal, 2015: 1412–1421.
    [37] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [38] CHEN Jing, CHEN Qingcai, LIU Xin, et al. The BQ corpus: A large-scale domain-specific Chinese corpus for sentence semantic equivalence identification[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018: 4946–4951.
    [39] ZHANG Bowei, SUN Weiwei, WAN Xiaojun, et al. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese[C]. 8th CCF International Conference on Natural Language Processing and Chinese Computing, Dunhuang, China, 2019: 814–826.
    [40] EGONMWAN E and CHALI Y. Transformer and seq2seq model for paraphrase generation[C]. Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 2019: 249–255.
    [41] DEVLIN J, CHANG Minfwei, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019: 4171–4186.
    [42] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. JMLR, 2019, 21(140): 1–67.
    [43] XUE Linting, CONSTANT N, ROBERTS A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[C]. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico, USA, 2020: 483–498.
    [44] LIU Boxiang and HUANG Liang. NEJM-enzh: A parallel corpus for English-Chinese translation in the biomedical domain[J]. arXiv: 2005.09133, 2020.
    [45] CASACUBERTA F and VIDAL E. GIZA++: Training of statistical translation models[J]. Retrieved October, 2007, 29: 2019.
    [46] REIMERS N and GUREVYCH I. Sentence-BERT: Sentence embeddings using siamese BERT-networks[C]. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019: 3982–3992.
  • 加载中
图(5) / 表(4)
计量
  • 文章访问数:  1030
  • HTML全文浏览量:  361
  • PDF下载量:  94
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-09-01
  • 修回日期:  2021-11-30
  • 录用日期:  2021-12-24
  • 网络出版日期:  2021-12-29
  • 刊出日期:  2022-01-10

目录

    /

    返回文章
    返回