The development of deep learning models has seen a shift from shallow models to deeper, more complex models. For example, GPT3, a language model that can generate stunning articles, has 175 billion parameters. This has led to a need for more sophisticated methods for running these models in data centers. ONNC Compiler can partition large models into smaller shards, and ONNC Runtime streams these shards on several heterogeneous systems. Together, ONNC Compiler and Runtime make inference of Large Language Models(LLMs) not only possible, but even more efficient.
Deep learning models are becoming larger and larger. For example, the recently announced Megatron-Turing NLG 530B generative language model is made up of 530 billion parameters. The computational power it takes to make this kind of system presents some unique challenges.
For one, trying to fit this kind of deep learning model on a single server would either be impossible or result in insufficient throughput. For example, the Megatron-Turing NLG 530B requires 493 GB of RAM, and there’s no single GPU with such a large memory capacity. This is because the batch size is limited due to the relatively small amount of available memory.
One option for overcoming this obstacle is running multiple servers together in parallel, but designing the infrastructure needed to orchestrate this kind of system can be complicated, time-consuming, and expensive.
Model Partitioning for heterogeneous multi-card can solve these problems. ONNC Compiler can partition a large model into several shards for different hardware architectures. ONNC Runtime sends these shards to different device, such as CPUs, GPUs, and deep learning accelerator cards, and stream them together.
The ONNC software stack, including Compiler and Runtime, is able to stream interference on heterogeneous multi-card and multi-server systems. In other words, the runtime gets access to multiple cards and servers even in a completely different hardware architecture, combining their computational power to tackle the load demand of deep learning models.
Furthermore, ONNC Compiler and Runtime enable customers to run a single deep learning model on:
Multiple chips in a card
Multiple cards in a server
Multiple servers in a rack
This enables AI chip vendors to compete with tech giants and gain a market share in the data center industry. They can accomplish this by letting end-users choose what they feel will work best and boosting their processing power using task-specific accelerator cards. With the power of ONNC, vendors overcome the challenge of orchestrating a series of servers and cards tasked with managing the demands of the deep learning model.
Skymizer provides ONNC modularized components that serve as building blocks to adapt, extend, and improve an existing system software, including a compiler, calibrator, runtime, and virtual platform for deep learning processing hardware. Also, because Skymizer’s components are modular and reusable, vendors can reduce mass production risk and shorten time-to-market. Each of Skymizer’s components has been battle-tested and proven to withstand what various deep learning models throw at them.
Also, the process of getting what vendors need with Skymizer is straightforward. Once the vendor describes the specifications they require, Skymizer provides consultancy to optimize hardware via ONNC, either based on the existing system software or by using ONNC as the fundamental software stack.
Further, Skymizer’s solution has already been used by top-tier providers, verifying its effectiveness in meeting the demands of intensive AI systems. And because Skymizer’s software stack easily integrates within the existing system software, customers don’t have to worry about compatibility issues. They’re empowered to optimize their accelerator cards to manage even the most demanding loads in data centers.
Reach out today to see how Skymizer can meet your AI processing needs.