a man holding a sign posing for the camera

This blog is co-authored by Fuheng Wu, Principal Machine Learning Tech Lead, Oracle Cloud AI Services, Oracle Inc.

Logos for Oracle and ONNX Runtime.

Enabling scenarios through the usage of Deep Neural Network (DNN) models is critical to our AI strategy at Oracle, and our Cloud AI Services team has built a solution to serve DNN models for customers in the healthcare sector. In this blog post, we’ll share challenges our team faced, and how ONNX Runtime solves these as the backbone of success for high-performance inferencing.

Challenge 1: Models from different training frameworks

To provide the best solutions for specific AI tasks, Oracle Cloud AI supports a variety of machine learning models trained from different frameworks, including PyTorch, TensorFlow, PaddlePaddle, and Scikit-learn. While each of these frameworks has its own built-in serving solutions, maintaining so many different serving frameworks would be a nightmare in practice. Therefore, one of our biggest priorities was to find a versatile unified serving solution to streamline maintenance.

Challenge 2: High performance across diverse hardware ecosystem

For Oracle Cloud AI services, low latency and high accuracy are crucial for meeting customers’ requirements. The DNN model servers are hosted in Oracle Cloud Compute clusters, and most of them are equipped with different CPUs (Intel, AMD, and ARM) and operating systems. We needed a solution that would run well on all the different Oracle compute shapes while remaining easy to maintain.

Solution: ONNX Runtime

In our search for the best DNN inference engine to support our diverse models and perform well across our hardware portfolio, ONNX Runtime caught our eye and stood out from alternatives.

ONNX Runtime is a high-performance, cross-platform accelerator for machine learning models. Because ONNX Runtime supports the Open Neural Network Exchange (ONNX), models trained from different frameworks can be converted to the ONNX format and run on all platforms supported by ONNX Runtime. This makes it easy to deploy machine learning models across different environments, including cloud, edge, and mobile devices. ONNX Runtime supports all the Oracle Cloud compute shapes including VM.Standard.A1.Flex (ARM CPU), VM.Standard.3/E3/4.Flex (AMD and Intel CPU), and VM.Optimized3.Flex (Intel CPU). Not only does ONNX Runtime run on a variety of hardware, but its execution provider interface also allows it to efficiently utilize accelerators specific to each hardware.

Validating ONNX Runtime

Based on our evaluation, we were optimistic about using ONNX Runtime as our model inferencing solution, and the next step was to verify its compatibility and performance to ensure it could meet our targets.

It was relatively easy to verify hardware, operating system, and model compatibility by just launching the model servers with ONNX Runtime in the cloud. To systematically measure and compare ONNX Runtime’s performance and accuracy to alternative solutions, we developed a pipeline system. ONNX Runtime’s extensibility simplified the benchmarking process, as it allowed us to seamlessly integrate other inference engines by compiling them as different execution providers (EP) for ONNX Runtime. Thus, ONNX Runtime served not only as a runtime engine but as a platform where we could support many inference engines and choose the best one to suit our needs at runtime.

We compiled TVM, OneDNN, and OpenVINO into ONNX Runtime, and it was very convenient to switch between these different inference engines with a unified programming interface. For example, in Oracle’s VM.Optimized3.Flex and BM.Optimized 3.36 compute instances, where the Intel(R) Xeon(R) Gold 6354 CPU is available, OpenVINO could run faster than other inference engines by a large margin due to the AVX VNNI instruction set support. We didn’t want to change our model serving code to fit different serving engines, and ONNX Runtime’s EP feature conveniently allowed us to write the code once and run it with different inference engines.

Benchmarking ONNX Runtime with alternative inference engines

With our pipeline configured to test all relevant inference engines, we began the benchmarking process for different models and environments. In our tests, ONNX Runtime was the clear winner against alternatives by a big margin, measuring 30 to 300 percent faster than the original PyTorch inference engine regardless of whether just-in-time (JIT) was enabled.

ONNX Runtime on CPU was also the best solution compared to DNN compilers like TVM, OneDNN (formerly known as Intel MKL-DNN), and MLIR. OneDNN was the closest to ONNX Runtime, but still 20 to 80 percent slower in most cases. MLIR was not as mature as ONNX Runtime two years ago, and the conclusion still holds at the time of this writing. It doesn’t support dynamic input shape models and only supports limited ONNX operators. TVM also performed well in static shapes model inference, but for accuracy consideration, most of our models use dynamic shape input and TVM raised exceptions for our models. Even with static shape models, we found TVM to be slower than ONNX Runtime.

We investigated the reason for ONNX Runtime’s strong performance and found ONNX Runtime to be extremely optimized for CPU servers. All the core algorithms, such as the crucial 2D convolution, transpose convolution, and pooling algorithm, are delicately written with assembly code by hand and statically compiled into the binary. It even won against TVM’s autotuning without any extra preprocessing or tuning. OneDNN’s JIT is designed to be flexible and extensible and can dynamically generate machine code for DNN primitives on the fly. However, it still lost to ONNX Runtime in our benchmark tests because ONNX Runtime statically compiled the primitives beforehand. Theoretically, there are several tunable parameters in the DNN primitives algorithms, so in some cases like edge devices with different register files and CPU cache sizes, there might be better algorithms or implementations with different choices of parameters. However, for the DNN models in Oracle Cloud Compute CPU clusters, ONNX Runtime is a match in heaven and is the fastest inference engine we have ever used.

Conclusion

We really appreciate the ONNX Runtime team for open-sourcing this amazing software and continuously improving it. This enables Oracle Cloud AI Services to provide a performant DNN model serving solution to our customers and we hope that others will also find our experience helpful.

Learn more about ONNX Runtime