ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, and hardware platforms. It is used extensively in Microsoft products, like Office 365 and Bing, delivering over 20 billion inferences every day and up to 17 times faster inferencing.
Today we are introducing significant updates to ONNX Runtime. In addition to improvements for model inferencing, we’re announcing the preview of training acceleration.
ONNX Runtime for training
ONNX Runtime now supports accelerated training of transformer models. Transformer models have become the building blocks for advanced language processing and generation. These models contain hundreds of millions of parameters and training them can consume many clusters of GPUs over days. Reducing the total training time can help enable rapid improvements in, and thus faster deployment of, these models.
- Office 365 uses ONNX Runtime to accelerate pre-training of the Turing Natural Language Representation (T-NLR) model, a transformer model with more than 400 million parameters, powering rich end-user features like Suggested Replies, Smart Find, and Inside Look. Using ONNX Runtime has reduced training time by 45% on a cluster of 64 NVIDIA V100 Tensor Core GPUs in Azure Machine Learning.
- Bing uses large transformer models with more than 500 million parameters to train and service task-specific models. These models use ONNX Runtime to accelerate pre-training and fine-tuning throughput, cutting training time by 44%.
- Visual Studio uses ONNX Runtime to accelerate pre-training a model, similar to GPT-2 Medium, with more than 300 million parameters to power code autocompletion in the IntelliCode feature.
To further accelerate training, we built custom kernels and graph optimizations to eliminate redundant operations. Additionally, ONNX Runtime enables larger batch sizes on the same 32GB memory of NVIDIA V100 Tensor Core GPUs. We tested ONNX Runtime by pretraining BERT-Large, reusing the training scripts and datasets from benchmarking tests by NVIDIA.
In the table below, you’ll see the relative training time improvements for pre-training the BERT-Large model on a 4 node NVIDIA DGX-2 cluster. The batch sizes reflect the Phase-1 and Phase-2 stages for the training experiment, using the datasets as detailed in NVIDIA repo. The detailed test report is here.
(64x V100 32GB)
with NGC 20.03-py3
with ONNX Runtime
with ONNX Runtime
|Phase 1 time (hours)||11.12||9.99||10.16%|
|Phase 2 time (hours)||6.62||5.77||12.84%|
|Total time (hours)||17.74||15.76||11.16%|
Developers can use the sample for pretraining BERT-Large with ONNX Runtime and fine-tune to their datasets as needed. We have also published a ready-to-use sample to start experiments in Azure Machine Learning. To use in custom environments, developers can build from the source code using the instructions published here.
ONNX Runtime for inferencing
We continue to improve inference acceleration with ONNX Runtime and are now partnering with Hugging Face to make it easy to accelerate popular transformer models.
We have seen gains from using ONNX Runtime with transformer models and are excited to release functionality that makes it easy to inference Hugging Face Transformer models with ONNX Runtime.
Clément Delangue, CEO of Hugging Face.
Today, we are also releasing multiple updates to ONNX Runtime for inferencing. The new ONNX Runtime inference version 1.3 includes:
- Compatibility with the new ONNX v1.7 spec
- DirectML execution provider on Windows 10 platform generally available (GA)
- Python package for ARM64 CPU for Ubuntu, CentOS, and variants
- Preview release for Execution Provider 2.0 with support for the latest Intel® Distribution of OpenVINO™ tool kit
- Preview release for integration with the VitisAI for acceleration on the Xilinx U250 FPGA platform
- Preview release for integration with Rockchip RK1808 AIoT chipset
Questions or feedback? Please let us know in the comments below.