For the modern data center to keep up with the surge in artificial intelligence and deep learning applications, it needs to have the power to process unprecedented amounts of data in a fraction of the time. This is where a deep learning accelerator can play a vital role, enabling data centers to house complex computational models that natural language processors, chatbots, and more depend on.
Deep learning models are becoming larger and larger. For example, the recently announced Megatron-Turing NLG 530B generative language model is made up of 530 billion parameters. The computational power it takes to make this kind of system presents some unique challenges.
For one, trying to fit this kind of deep learning model on a single server would either be impossible or result in insufficient throughput. For example, the Megatron-Turing NLG 530B requires 493 GB of RAM, and there’s no single GPU with such a large memory capacity. This is because the batch size is limited due to the relatively small amount of available memory. One option to overcoming this obstacle is running multiple servers together in parallel, but designing the infrastructure needed to orchestrate this kind of system can be complicated, time-consuming, and expensive.
Heterogeneous multi-card portioning can solve these problems. An open neural network compiler (ONNC) can distribute a large model to different hardware architectures. For example, they can be sent to different brands of GPUs, CPUs, and deep learning accelerator cards.
A central part of the solution relies on model partitioning technology. ONNC partitions a large deep learning model, dividing it up into smaller pieces. The runtime distributes these smaller models onto multiple accelerators and streams them together as a pipeline. The end result? Many “hands” make the throughput higher.
The ONNC software stack, including compiler and runtime, is able to stream interference on Heterogeneous multi-card and multi-server systems. In other words, the runtime gets access to multiple cards and servers even in a completely different hardware architecture, combining their computational power to tackle the load demand of deep learning models.
Furthermore, distributed runtime enables customers to run a single deep learning model on:
Multiple chips in a card
Multiple cards in a server
Multiple servers in a rack
Despite these performance upgrades, the drop in accuracy is less than 1%. What it boils down to is this: You can run deep learning models on regular IoT devices without significantly changing their architecture.
Skymizer provides ONNC modularized components that serve as building blocks to adapt, extend, and improve an existing system software, including a compiler, calibrator, runtime, and virtual platform for deep learning processing hardware. Also, because Skymizer’s components are modular and reusable, vendors can reduce mass production risk and shorten time-to-market. Each of Skymizer’s components has been battle-tested and proven to withstand what various deep learning models throw at them.
Also, the process of getting what vendors need with Skymizer is straightforward. Once the vendor describes the specifications they require, Skymizer provides consultancy to optimize hardware via ONNC, either based on the existing system software or by using ONNC as the fundamental software stack.
Further, Skymizer’s solution has already been used by top-tier providers, verifying its effectiveness in meeting the demands of intensive AI systems. And because Skymizer’s software stack easily integrates within the existing system software, customers don’t have to worry about compatibility issues. They’re empowered to optimize their accelerator cards to manage even the most demanding loads in data centers. Reach out today to see how Skymizer can meet your AI processing needs.