从粗粒度到细粒度的神经机器翻译系统推断加速方法

Abstract

近年来,Transformer模型中多层注意力网络的使用有效提升了翻译模型的译文质量,但同时大量注意力操作的使用也导致模型整体的推断效率相对较低.基于此,提出了从粗粒度到细粒度(coarse-to-fine,CTF)的方法,根据注意力权重中的信息量差异对信息表示进行细粒度压缩,最终达到加速推断的目的.实验发现,在NIST中英和WMT英德翻译任务上,该方法在保证模型性能的同时,推断速度分别提升了13.9%和12.8%.此外,还进一步分析了注意力操作在不同表示粒度下的信息量差异,对该方法的合理性提供支持.
In recent years,Transformer system has effectively improved the translation quality of the translation model through the introduction of multi-layer attention network.At the same time,the use of a large number of attention operations has also led to low overall inference efficiencies of the model.In order to solve this problem,we propose a method based on coarse-to-fine algorithm,which compresses the information representation according to the difference of the amount of information in the attention weight,and finally achieves the purpose of accelerating decoding.Experimental results show that,on the Chinese-English translation task of NIST and the English-German translation task of WMT,the inference speed of this method can be improved by 13.9% and 12.8% respectively on the premise of ensuring the performance of the model.At the same time,we further analyze the information difference of attention operation under different representation granularity,which provides support for the rationality of coarse-to-fine method.

Description

Citation

厦门大学学报(自然科学版),2020,59(02):175-184.

Collections

Endorsement

Review

Supplemented By

Referenced By