基于复述增广的医疗领域机器翻译

安波; 龙从军

doi:10.11999/JEIT210926

基于复述增广的医疗领域机器翻译

doi: 10.11999/JEIT210926

安波^,,
龙从军

中国社会科学院民族学与人类学研究所北京 100081

基金项目: 国家自然科学基金(62076233)，中国社会科学院重大创新工程项目(2020YZDZX01-2)

详细信息

作者简介:
安波：1986年生，男，副研究员，硕士生导师，研究方向为自然语言处理、机器翻译

龙从军：1978年生，男，副研究员，硕士生导师，研究方向为民族语言处理、自然语言处理

通讯作者:
安波　anbo724@163.com

¹⁾ https://github.com/casnlu/Chinese-PPDB²⁾ https://ai.baidu.com/broad/subordinate?dataset=paraphrasing
³⁾ https://ai.tencent.com/ailab/nlp/zh/embedding.html⁴⁾ https://github.com/lancopku/pkuseg-python
⁵⁾ https://github.com/PaddlePaddle⁶⁾ https://translate.google.cn/
⁷⁾ http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
中图分类号: TN912.3; TP393
计量
- 文章访问数: 1030
- HTML全文浏览量: 361
- PDF下载量: 94
- 被引次数: 0
出版历程
- 收稿日期: 2021-09-01
- 修回日期: 2021-11-30
- 录用日期: 2021-12-24
- 网络出版日期: 2021-12-29
- 刊出日期: 2022-01-10

Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation

AN Bo^,,
LONG Congjun

Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing, 100081, China

Funds: The National Natural Science Foundation of China (62076233), The Major Innovation Project of Chinese Academy of Social Sciences (2020YZDZX01-2)

摘要

摘要: 医疗机器翻译对于跨境医疗、医疗文献翻译等应用具有重要价值。汉英神经机器翻译依靠深度学习强大的建模能力和大规模双语平行数据取得了长足的进步。神经机器翻译通常依赖于大规模的平行句对训练翻译模型。目前，汉英翻译数据主要以新闻、政策等领域数据为主，缺少医疗领域的数据，导致医疗领域的汉英机器翻译效果不佳。针对医疗垂直领域机器翻译训练数据不足的问题，该文提出利用复述生成技术对汉英医疗机器翻译数据进行增广，扩大汉英机器翻译的规模。通过多种主流的神经机器翻译模型的实验结果表明，通过复述生成对数据进行增广可以有效地提升机器翻译的性能，在RNNSearch, Transformer等多个主流模型上均取得了6个点以上的BLEU值提升，验证了复述增广方法对领域机器翻译的有效性。同时，基于MT5等大规模预训练语言模型可以进一步地提升机器翻译的性能。
- 神经机器翻译 /
- 汉英翻译 /
- 复述生成 /
- 数据增广 /
- 大规模预训练语言模型
Abstract: Medical machine translation is of great value for cross-border medical translation. Chinese to English neural machine translation has made great progress based on deep learning, powerful modeling ability and large-scale bilingual parallel data. Neural machine translation relies usually on large-scale parallel sentence pairs to train translation models. At present, Chinese-English translation data are mainly in the fields of news, policy and so on. Due to the lack of parallel data in the medical field, the performance of Chinese to English machine translation in the medical field is not compromising. To reduce the size of parallel data for training medical machine translation models, this paper proposes a paraphrase based data augmentation mechanism. The experimental results on a variety of neural machine translation models show that data augmentation through paraphrase augmentation can effectively improve the performance of medical machine translation, and has achieved consistency improvements on mainstream models such as RNNSearch and Transformers, which verifies the effectiveness of paraphrase augmentation method for domain machine translation. Meanwhile, the medical machine translation performances could be further improved based on large-scale pre-training language model, such as MT5.
- Neural machine translation /
- Chinese to English translation /
- Paraphrase generation /
- Data augmentation /
- Large scale pre-traing language model
¹⁾ https://github.com/casnlu/Chinese-PPDB²⁾ https://ai.baidu.com/broad/subordinate?dataset=paraphrasing

³⁾ https://ai.tencent.com/ailab/nlp/zh/embedding.html⁴⁾ https://github.com/lancopku/pkuseg-python

⁵⁾ https://github.com/PaddlePaddle⁶⁾ https://translate.google.cn/

⁷⁾ http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

HTML全文

图 1 基于复述生成的机器翻译数据增广示意图

下载: 全尺寸图片幻灯片

图 2 复述生成整体框架图

下载: 全尺寸图片幻灯片

图 3 基于双语电子书的汉英医疗机器翻译数据抽取方法

下载: 全尺寸图片幻灯片

图 4 复述增广的机器翻译方法框架图

下载: 全尺寸图片幻灯片

图 5 不同复述数量对机器翻译性能的影响

下载: 全尺寸图片幻灯片

表 1 汉英医疗机器翻译数据集

训练集	验证集	测试集	中文平均字数	英文平均词数
85000	7500	7500	14.3	11.2

下载: 导出CSV

表 2 模型参数设置

机器翻译模型	参数	参数值
Seq2Seq	Embedding size	300
	Beam size	50
	Batch size	64
	Sentence length	256
	Learning rate	0.01
	Optimizer	Adam
	RNN cell	LSTM
	Drouput	0.2
RNNSearch	Embedding size	300
	Beam size	50
	Batch size	64
	Sentence length	256
	Learning rate	0.01
	Optimizer	Adam
	RNN cell	LSTM
	Drouput	0.2
Transformer	Embedding size	300
	Beam size	50
	Batch size	64
	Sentence length	256
	Learning rate	0.1
	Optimizer	Adam
	RNN cell	LSTM
	Drouput	0.2
	Num head	8

下载: 导出CSV

表 3 汉英医疗机器翻译结果

机器翻译模型	数据增广模型	BLEU	提升(%)
Seq2Seq	–	31.99	–
	WordRep	32.12	0.41
	BiLSTM-para	33.45	4.56
	Transformer-para	35.23	10.13
	Bert-para	35.28	10.28
	MT5-para	35.74	11.72
RNNSearch	–	41.28	–
	WordRep	40.98	-0.73
	BiLSTM-para	43.25	4.77
	Transformer-para	44.12	6.88
	Bert-para	44.67	8.21
	MT5-para	44.97	8.94
Transformer	–	48.21	–
	WordRep	48.29	0.17
	BiLSTM-para	49.86	3.42
	Transformer-para	51.32	6.45
	Bert-para	51.36	6.53
	MT5-para	51.97	7.80

下载: 导出CSV

表 4 汉英医疗机器翻译例子

汉语句子	患者男，31岁，因中重度反复头痛18天入院，表现为枕部至双额部逐渐发作，呈搏动性，发作持续超过4h，并持续加重。
百度	The patient, a 31 year old male, was hospitalized for 18 days due to moderate and severe recurrent headache. He showed a gradual attack from the occipital part to the double frontal part, which was pulsatile. The attack lasted for more than 4 hours and continued to worsen.
谷歌	A 31-year-old male patient was admitted to the hospital for 18 days with moderate to severe recurrent headaches. The manifestations were pulsatile attacks from the occiput to the forehead. The attacks lasted more than 4 hours and gradually worsened.
本文	A 31-year-old man was admitted with an 18-day history of a moderate to severe recurrent headache, presenting gradual onset from occipital to bifrontal regions, pulsatile, in episodes lasting beyond four hours, and progressive worsening.

下载: 导出CSV

参考文献(46)

[1]	刘群. 统计机器翻译综述[J]. 中文信息学报, 2003, 17(4): 1–12. doi: 10.3969/j.issn.1003-0077.2003.04.001 LIU Qun. Survey on statistical machine translation[J]. Journal of Chinese Information Processing, 2003, 17(4): 1–12. doi: 10.3969/j.issn.1003-0077.2003.04.001
[2]	李亚超, 熊德意, 张民. 神经机器翻译综述[J]. 计算机学报, 2018, 41(12): 2734–2755. doi: 10.11897/SP.J.1016.2018.02734 LI Yachao, XIONG Deyi, and ZHANG Min. A survey of neural machine translation[J]. Chinese Journal of Computers, 2018, 41(12): 2734–2755. doi: 10.11897/SP.J.1016.2018.02734
[3]	STAHLBERG F. Neural machine translation: A review[J]. Journal of Artificial Intelligence Research, 2020, 69: 343–418. doi: 10.1613/jair.1.12007
[4]	TRIPATHI S and SARKHEL J K. Approaches to machine translation[J]. Annals of Library and Information Studies, 2010, 57: 388–393.
[5]	CHAROENPORNSAWAT P, SORNLERTLAMVANICH V, and CHAROENPORN T. Improving translation quality of rule-based machine translation[C]. Proceedings of the 2002 COLING workshop on Machine translation in Asia, Taipei, China, 2002.
[6]	LIU Shujie, LI C H, and ZHOU Ming. Statistic machine translation boosted with spurious word deletion[C]. Proceedings of Machine Translation Summit, Xiamen, China, 2011.
[7]	GOODFELLOW I, BENGIO Y, and COURVILLE A. Deep Learning[M]. Cambridge: MIT Press, 2016.
[8]	ECK M, VOGEL S, and WAIBEL A. Improving statistical machine translation in the medical domain using the Unified Medical Language System[C]. Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 2004.
[9]	DUŠEK O, HAJIČ J, HLAVÁČOVÁ J, et al. Machine translation of medical texts in the Khresmoi project[C]. Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, USA, 2014.
[10]	WOLK K and MARASEK K P. Translation of Medical Texts Using Neural Networks[M]. HERSHEY P A. Deep Learning and Neural Networks: Concepts, Methodologies, Tools, and Applications. IGI Global, 2020: 1137–1154.
[11]	ZOPH B, YURET D, MAY J, et al. Transfer learning for low-resource neural machine translation[C]. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, USA, 2016.
[12]	PARK C, YANG Y, PARK K, et al. Decoding strategies for improving low-resource machine translation[J]. Electronics, 2020, 9(10): 1562. doi: 10.3390/electronics9101562
[13]	FADAEE M, BISAZZA A, and MONZ C. Data augmentation for low-resource neural machine translation[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
[14]	LAMPLE G, CONNEAU A, DENOYER L, et al. Unsupervised machine translation using monolingual corpora only[J]. arXiv: 1711.00043, 2017.
[15]	ARTETXE M, LABAKA G, and AGIRRE E. An effective approach to unsupervised machine translation[C]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019: 194–203.
[16]	CHENG Yong. Semi-supervised Learning for Neural Machine Translation[M]. CHENG Yong. Joint Training for Neural Machine Translation. Singapore: Springer, 2019: 25–40.
[17]	DUAN Sufeng, ZHAO Hai, ZHANG Dongdong, et al. Syntax-aware data augmentation for neural machine translation[J]. arXiv: 2004.14200, 2020.
[18]	PENG Wei, HUANG Chongxuan, LI Tianhao, et al. Dictionary-based data augmentation for cross-domain neural machine translation[J]. arXiv: 2004.02577, 2020.
[19]	SUGIYAMA A and YOSHINAGA N. Data augmentation using back-translation for context-aware neural machine translation[C]. Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong, China, 2019.
[20]	FREITAG M, FOSTER G, GRANGIER D, et al. Human-paraphrased references improve neural machine translation[J]. arXiv: 2010.10245, 2020.
[21]	GANITKEVITCH J, VAN DURME B, and CALLISON-BURCH C. PPDB: The paraphrase database[C]. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, USA, 2013: 758–764.
[22]	BERANT J and LIANG P. Semantic parsing via paraphrasing[C]. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, 2014: 1415–1425.
[23]	STIX G. The Elusive goal of machine translation[J]. Scientific American, 2006, 294(3): 92–95. doi: 10.1038/scientificamerican0306-92
[24]	GERBER L and YANG Jin. Systran MT dictionary development[C]. Machine Translation: Past, Present, and Future. In: Proceedings of Machine Translation Summit VI, 1997.
[25]	NAGAO M, TSUJII J, MITAMURA K, et al. A machine translation system from Japanese into English: Another perspective of MT systems[C]. Proceedings of the 8th Conference on Computational Linguistics, Tokyo, Japan, 1980: 414–423.
[26]	JOHNSON R, KING M, and DES TOMBE L. Eurotra: A multilingual system under development[J]. Computational Linguistics, 1985, 11(2/3): 155–169. doi: 10.5555/1187874.1187880
[27]	WEAVER W. Translation[J]. Machine Translation of Languages, 1955, 14: 15–23.
[28]	PETER F B, PIETRA S A D, PIETRA V J D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational Linguistics, 1993, 19(2): 263–311.
[29]	KOEHN P, HOANG H, BIRCH A, et al. Moses: Open source toolkit for statistical machine translation[C]. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, 2007: 177–180.
[30]	XIAO Tong, ZHU Jingbo, ZHANG Hao, et al. NiuTrans: An open source toolkit for phrase-based and syntax-based machine translation[C]. Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea, 2012: 19–24.
[31]	KALCHBRENNER N and BLUNSOM P. Recurrent continuous translation models[C]. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA, 2013: 1700–1709.
[32]	TRAORE B B, KAMSU-FOGUEM B, and TANGARA F. Deep convolution neural network for image recognition[J]. Ecological Informatics, 2018, 48: 257–268. doi: 10.1016/j.ecoinf.2018.10.002
[33]	İRSOY O and CARDIE A. Deep recursive neural networks for compositionality in language[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2096–2104.
[34]	HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735
[35]	CHEN M X, FIRAT O, BAPNA A, et al. The best of both worlds: Combining recent advances in neural machine translation[C]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018.
[36]	LUONG T, PHAM H, and MANNING C D. Effective approaches to attention-based neural machine translation[C]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Porzgal, 2015: 1412–1421.
[37]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[38]	CHEN Jing, CHEN Qingcai, LIU Xin, et al. The BQ corpus: A large-scale domain-specific Chinese corpus for sentence semantic equivalence identification[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018: 4946–4951.
[39]	ZHANG Bowei, SUN Weiwei, WAN Xiaojun, et al. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese[C]. 8th CCF International Conference on Natural Language Processing and Chinese Computing, Dunhuang, China, 2019: 814–826.
[40]	EGONMWAN E and CHALI Y. Transformer and seq2seq model for paraphrase generation[C]. Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 2019: 249–255.
[41]	DEVLIN J, CHANG Minfwei, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019: 4171–4186.
[42]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. JMLR, 2019, 21(140): 1–67.
[43]	XUE Linting, CONSTANT N, ROBERTS A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[C]. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico, USA, 2020: 483–498.
[44]	LIU Boxiang and HUANG Liang. NEJM-enzh: A parallel corpus for English-Chinese translation in the biomedical domain[J]. arXiv: 2005.09133, 2020.
[45]	CASACUBERTA F and VIDAL E. GIZA++: Training of statistical translation models[J]. Retrieved October, 2007, 29: 2019.
[46]	REIMERS N and GUREVYCH I. Sentence-BERT: Sentence embeddings using siamese BERT-networks[C]. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019: 3982–3992.

施引文献

资源附件(0)

访问统计

图(5) / 表(4)

计量

文章访问数: 1030
HTML全文浏览量: 361
PDF下载量: 94
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于复述增广的医疗领域机器翻译

doi: 10.11999/JEIT210926

作者简介:
安波：1986年生，男，副研究员，硕士生导师，研究方向为自然语言处理、机器翻译

龙从军：1978年生，男，副研究员，硕士生导师，研究方向为民族语言处理、自然语言处理

通讯作者:
安波　anbo724@163.com

计量

Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation

计量

目录

留言板

基于复述增广的医疗领域机器翻译

doi: 10.11999/JEIT210926

作者简介: 安波：1986年生，男，副研究员，硕士生导师，研究方向为自然语言处理、机器翻译 龙从军：1978年生，男，副研究员，硕士生导师，研究方向为民族语言处理、自然语言处理

通讯作者: 安波 anbo724@163.com

计量

出版历程

Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation

计量

出版历程

目录

作者简介:
安波：1986年生，男，副研究员，硕士生导师，研究方向为自然语言处理、机器翻译

龙从军：1978年生，男，副研究员，硕士生导师，研究方向为民族语言处理、自然语言处理

通讯作者:
安波　anbo724@163.com