Neural Machine Translation Speed

Machine translation can be computationally expensive, leading to the term TPU core century. To encourage computational efficiency, the Workshop on Neural Generation and Translation has a recurring efficiency shared task. I took over the 2020 task and now it's open for rolling submission.

See also:

Participants built machine translation systems from English to German using the WMT 2019 news data condition. Then I measured their performance translating 1 million sentences.

The original evaluation had three participants: OpenNMT, NiuTrans, and (also the organizer) the University of Edinburgh. The graphs in this page include original submissions and those made since the evaluation.

The task focuses on the quality and cost of deploying translation systems:

How good are the translations?
Approximated by sacrebleu. Specifically, the average sacrebleu on WMT11 and WMT13-WMT19; I call this WMT1*. The purpose was to create a bit of surprise for participants and avoid overfitting to one test set. BLEU is not as good as human evaluation, so we submitted two fast Czech systems for human evaluation in WMT20.
How fast?
Speed on an Intel Xeon Platinum 8270 CPU and NVIDIA T4 GPU.
How big?
The size of the model on disk and how much RAM it consumes while running. There's also Docker image size, but this mostly reflects how much of Ubuntu teams threw into their Docker image.

Results

There is no single "best" system but rather a range of trade-offs between quality and efficiency. Hence we highlight the submissions that have the best quality for a given cost (or equivalently the best cost for a given quality). These are the systems that appear on the Pareto frontier: the black staircase shown on the plots. Anything below the Pareto frontier is worse than another submission according to the metrics on the plot (but may have optimized for something else). The happy face 😊 shows where an ideal system would be.

Speed

Speed is measured in terms of the words per second translating 1 million sentences from English to German.

Some of the slower Edinburgh submissions were a buggy version with a memory leak; the fixed versions also appear.

Single CPU speed plot All cores speed plot GPU speed plot

We ran the evaluation on Amazon Web Services machines, namely the c5.metal for CPU that costs $4.08/hr and the g4dn.xlarge for one GPU that costs $0.526/hr. This allows us to compare the cost of using all CPU cores with a GPU in one graph:

Cost plot

Currently translating on GPUs is more cost-effective, though the CPU is close in some cases.

Size

We also looked at how large the models are, encouraging participants to make small models. Here, it's better to be near the top left of the graph. Edinburgh's systems dominate the entire Pareto frontier, partly due to 4-bit log compression. This is size at rest on disk; participants were permitted to decompress before running, including the use of normal compression tools. Model size includes parameters, word segmentation models, and anything data-dependent. This is all submissions regardless of hardware platform.

Model size

RAM

RAM consumption is mostly driven by the batch size, which is typically larger to optimize for speed. So systems that optimized for speed may not have optimized for RAM.

The really bad systems from Edinburgh had a memory leak, which has been fixed. In the 48 CPU cores setting, OpenNMT shared memory across processes while Edinburgh ran separate processes with core pinning. The separate processes did not share memory. Threads would be better for RAM consumption, but Edinburgh optimized for speed.

Single core RAM usage All core RAM usage GPU RAM usage

Commentary

While the NVIDIA T4 GPU supports 8-bit Tensor Core operations, none of the participants used it due to code readiness. Instead, they used 16-bit floating-point Tensor Cores. Two participants used 8-bit integers on the CPU. We are working on 8-bit integer support on GPUs in Marian.

Latency

The original evaluation had batching enabled for speed. Sometimes the goal is to translate one sentence at a time. So I measured latency. This wasn't an official task; we just ran our fork of Marian with batch size 1. If there's enough interest in making this a task, we'll create a test harness that feeds one sentence in a time. When 4 CPU cores are used, that refers to OMP parallelization within a sentence; the system still translated one sentence at a time.

Latency plot

Acknowledgments

The University of Edinburgh's submissions were made by Nikolay Bogoychev, Roman Grundkiewicz, Alham Fikri Aji, Maximiliana Behnke, Kenneth Heafield, Sidharth Kashyap, Emmanouil-Ioannis Farsarakis, and Mateusz Chudyk. Sidharth Kashyap and Emmanouil-Ioannis Farsarakis are affiliated with Intel Corporation. Mateusz Chudyk is affiliated with Samsung R&D Institute Poland. Intel Corporation has provided funding and hardware.

Thanks to the other participants! We know it's a lot of work. Graham Neubig and Yusuke Oda organized past evaluations, providing code and commentary.

EU flag

In the Bergamot project, we're adding client side machine translation to desktops, so it needs to be efficient. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 (Bergamot).

In User-Focused Marian, we're making Marian easier to use and more efficient on GPUs. This project was funded by the Connecting Europe Facility under grant agreement No INEA/CEF/ICT/A2019/1927024 (User-Focused Marian).

Amazon provided $2000 in AWS credits for evaluation.