Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs 

5 min read

Mohit Ayani, Solutions Architect, NVIDIA Shang Zhang, Senior AI Developer Technology Engineer, NVIDIA Jay Rodge, Product Marketing Manager-AI, NVIDIA Transformer-based models have revolutionized the natural language processing (NLP) domain. Ever since its inception, transformer architecture has been integrated into models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) for performing tasks Read more

Journey to optimize large scale transformer model inference with ONNX Runtime 

7 min read

“With its resource-efficient and high-performance nature, ONNX Runtime helped us meet the need of deploying a large-scale multi-layer generative transformer model for code, a.k.a., GPT-C, to empower IntelliCode with the whole line of code completion suggestions in Visual Studio and Visual Studio Code.” Large-scale transformer models, such as GPT-2 and GPT-3, are among the most Read more