Neural Machine Translation Speed

Machine translation can be computationally expensive, leading to the term TPU core century. To encourage computational efficiency, the Conference on Machine Translation has an efficiency shared task. Since 2020, I organize the task. Somewhat corruptly, my group also participates.

These are preliminary results using automatic metrics. Microsoft has agreed to conduct a human evaluation. Participants built machine translation systems from English to German using the WMT 2021 news constrained data condition. Then we measured their performance translating 1 million sentences.

The task focuses on the quality and cost of deploying translation systems:

How good are the translations?
We will be conducting a human evaluation. For now, quality is approximated by automatic metrics. Choose COMET 1.0.0rc2, chrF, or sacrebleu. All computed on references A, C, and D of WMT 2021's test set. (Reference B was excluded because the agency tried to pass off post-edited DeepL as original translations.) COMET and chrF do not support multiple references, so the scores are averaged over the references.
How fast?
Speed on a NVIDIA A100 GPU or dual socket Intel Xeon Gold 6354 CPU. More specifically, we used Oracle Cloud BM.GPU4.8 for GPU (but limited participants to one GPU) or BM.Optimized3.36 for CPU.
How big?
The size of the model on disk and how much RAM it consumes while running. There's also Docker image size, but this mostly reflects how much of Ubuntu teams threw into their Docker image.

Results

There is no single "best" system but rather a range of trade-offs between quality and efficiency. Hence we highlight the submissions that have the best quality for a given cost (or equivalently the best cost for a given quality). These are the systems that appear on the Pareto frontier: the black staircase shown on the plots. Anything below the Pareto frontier is worse than another submission according to the metrics on the plot (but may have optimized for something else). The happy face 😊 shows the direction an ideal system should go.

This year, we offered throughput and latency options. In throughput, participants get the entire file upfront so they can do batching. In latency, a script spoon feeds one source sentence, waits for the translation, provides next source sentence, and so on. The latency option was intended to recruit non-autoregressive machine translation participants who currently use weak baselines to claim speedups. All of the latency submissions were autoregressive.

These graphs are based on statistics collected; contact me for raw output and logs. They are generated with Gnuplot 5.2 so you can hover to show docker image names, left click to toggle cursor coordinates, drag, zoom with the wheel, click on the key to erase a participant, and right click to reset.

Throughput Speed

Speed is measured in terms of the words per second translating 1 million sentences from English to German.

We ran the evaluation on Oracle Cloud, namely the BM.GPU4.8 for GPU that costs $3.05/hr/GPU or the BM.Optimized3.36 for CPU that costs $2.7/hr with all 36 cores. This allows us to compare the cost of using a GPU versus all CPU cores in one graph:

For throughput, translating on GPUs is more cost-effective with these instances.

Latency

Huawei and Edinburgh submitted to latency. All submissions on the Pareto frontier have latencies of 16.85 ms or less.

Size

We also looked at how large the models are, encouraging participants to make small models. This is size at rest on disk; participants were permitted to decompress before running, including the use of normal compression tools. Model size includes parameters, word segmentation models, and anything data-dependent. This is all submissions regardless of hardware platform. The x axis is log scale.

RAM

RAM consumption is mostly driven by the batch size, which is typically larger to optimize for speed. So systems that optimized for speed may not have optimized for RAM. On the CPU, some participants ran multiple processes with memory pinning, which increases speed but also increases memory usage. This shows all tasks; single-CPU throughput and GPU latency are boring with only one participant. The x axis is log scale.

Docker size

We also measured Docker image size, though participants were encouraged to prioritize packaging multiple submissions over size. Dockers were saved as an image or tar file then compressed with xz before measuring. Huawei is the clear winner.

Acknowledgments

Qianqian Zhu collaborated on measuring systems. The University of Edinburgh's submissions were made by Maximiliana Behnke, Nikolay Bogoychev, Alham Fikri Aji, Kenneth Heafield, Graeme Nail, Qianqian Zhu, Svetlana Tchistiakova, Jelmer van der Linde, Pinzhen Chen, Sidharth Kashyap and Roman Grundkiewicz. Thanks to Rawn Henry for optimizing on GPUs.

Thanks to the other participants! We know it's a lot of work. Graham Neubig and Yusuke Oda organized past evaluations, providing code and commentary.

European Union
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303. This project was funded by the Connecting Europe Facility under grant agreement No INEA/CEF/ICT/A2019/1927024 (User-Focused Marian).
Intel
Intel Corporation has supported organization.
Oracle for Research
Oracle has contributed cloud credits under the Oracle for Research program.
Microsoft
Microsoft is supporting human evaluation.