Accelerate traditional machine learning models on GPU with ONNX Runtime

With the growing trend towards deep learning techniques in AI, there are many investments in accelerating neural network models using GPUs and other specialized hardware. However, many models used in production are still based on traditional machine learning libraries or sometimes a combination of traditional machine learning (ML) and DNNs. We’ve previously shared the performance gains that ONNX Runtime provides for popular DNN models such as BERT, quantized GPT-2, and other Huggingface Transformer models. Now, by utilizing Hummingbird with ONNX Runtime, you can also capture the benefits of GPU acceleration for traditional ML models.

This capability is enabled through the recently added integration of Hummingbird with the LightGBM converter in ONNXMLTools, an open source library that can convert models to the interoperable ONNX format. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, designed for fast training speed and low memory usage. By simply setting a flag, you can feed a LightGBM model to the converter to produce an ONNX model that uses neural network operators rather than traditional ML. This Hummingbird integration allows users of LightGBM to take advantage of the GPU accelerations typically only available for neural networks.

What is Hummingbird?

Hummingbird is a library for converting traditional ML operators to tensors, with the goal of accelerating inference (scoring/prediction) for traditional machine learning models. You can learn more about Hummingbird in our introductory blog post, but we’ll present a short summary here.

Traditional ML libraries and toolkits are usually developed to run in CPU environments. For example, LightGBM does not support using GPU for inference, only for training. Traditional ML models (such as DecisionTrees and LinearRegressors) also do not support hardware acceleration.
Hummingbird addresses this gap and allows users to seamlessly leverage hardware acceleration without having to re-engineer their models. This is done by reconfiguring algorithmic operators in the traditional ML pipelines such that we can perform computations which are amenable to GPU execution.
Hummingbird is competitive and even outperforms hand-crafted kernels on micro-benchmarks, while enabling seamless end-to-end acceleration of ML pipelines. We’ll show an example of this speedup below.

Why use ONNX Runtime?

The integration of Hummingbird with ONNXMLTools allows users to take advantage of the flexibility and performance benefits of ONNX Runtime. ONNX Runtime provides a consistent API across platforms and architectures with APIs in Python, C++, C#, Java, and more. This allows models trained in Python to be used in a variety of production environments. ONNX Runtime also provides an abstraction layer for hardware accelerators, such as Nvidia CUDA and TensorRT, Intel OpenVINO, Windows DirectML, and others. This gives users the flexibility to deploy on their hardware of choice with minimal changes to the runtime integration and no changes in the converted model.

While ONNX Runtime does natively support both DNNs and traditional ML models, the Hummingbird integration provides performance improvements by using the neural network form of LightGBM models for inferencing. This may be particularly useful for those already utilizing GPUs for the acceleration of other DNNs. Let’s take a look at this in action.

Code and performance

Import

import numpy as np
import lightgbm as lgb
import timeit
 
import onnxruntime as ort
from onnxmltools.convert import convert_lightgbm
from onnxconverter_common.data_types import FloatTensorType

Create some random data for binary classification

max_depth = 8
num_classes = 2
n_estimators = 1000
n_features = 30
n_fit = 1000
n_pred= 10000
X = np.random.rand(n_fit, n_features)
X = np.array(X, dtype=np.float32)
y = np.random.randint(num_classes, size=n_fit)
test_data = np.random.rand(n_pred, n_features).astype('float32')

Create and train a LightGBM model

model = lgb.LGBMClassifier(n_estimators=n_estimators, max_depth=max_depth, pred_early_stop=False)
model.fit(X, y)

Use ONNXMLTOOLS to convert the model to ONNXML

input_types = [("input", FloatTensorType([n_pred, n_features))] # Define the inputs for the ONNX
onnx_ml_model = convert_lightgbm(model, initial_types=input_types)

Predict with LightGBM

lgbm_time = timeit.timeit("model.predict_proba(test_data)", number=7, 
                          setup="from __main__ import model, test_data")
print("LightGBM (CPU): {}".format(num_classes, max_depth, n_estimators, lgbm_time))

Predict with ONNX ML model

sessionml = ort.InferenceSession(onnx_ml_model.SerializeToString())
onnxml_time = timeit.timeit("sessionml.run( [sessionml.get_outputs()[1].name],  
                             {sessionml.get_inputs()[0].name: test_data} )", 
                            number=7, setup="from __main__ import sessionml, test_data")
print("LGBM->ONNXML (CPU): {}".format(num_classes, max_depth, n_estimators, onnxml_time))

The result is the following:

LightGBM (CPU): 1.1157575770048425
LGBM->ONNXML (CPU) 1.0180995319969952

Not bad! Now let’s see Hummingbird in action. The only change to the conversion code above is the addition of without_onnx_ml=True

Use ONNXMLTOOLS to generate an ONNX (model without any ML operator) using Hummingbird

input_types = [("input", FloatTensorType([n_pred, n_features))] # Define the inputs for the ONNX
onnx_model = convert_lightgbm(model, initial_types=input_types, without_onnx_ml=True)

We can now pip install onnxruntime-gpu and run the prediction over the onnx_model:

Predict with the ONNX model (on GPU)

sess_options = ort.SessionOptions()
session = ort.InferenceSession(onnx_model.SerializeToString(), sess_options)
onnx_time = timeit.timeit("session.run( [session.get_outputs()[1].name], {session.get_inputs()[0].name:
                            test_data} )", number=7, setup="from __main__ import session, test_data")
print("LGBM->ONNX (GPU): {}".format(onnx_time))

And we get:

LGBM->ONNXML->ONNX (GPU): 0.2364534509833902

There is an approximate 5x improvement over the CPU implementation. Additionally, the ONNX model can take advantage of any additional optimizations available in future releases of ORT, and it can run on any hardware accelerator supported by ORT.

Going forward

Hummingbird currently supports converters for ONNX, scikit-learn, XGBoost, and LightGBM. In the future, we plan to provide similar features for other converters in the ONNXMLTools family, such as XGBoost and scikit-learn. If there are additional operators or integrations you would like to see, please file an issue. We would love to hear about how Hummingbird can help speed-up your workloads and we look forward to adding more features!