Skymizer Taiwan Inc., a leader of AI system software suppliers, collaborated with Andes Technology, a leading company in RISC-V CPU IP supplier, to develop a new neural network compiler – Tiny ONNC – for embedded software engineers running modern PyTorch models on RISC-V platforms. Andes Technology will announce the result in Andes RISC-V Conf, 2021. Here, let’s unveil the details of this powerful tool and see how it benefits embedded software engineers.

Tiny ONNC: Tiny and Faster on RISC-V

Tiny ONNC is an easy-to-use tool, converting popular PyTorch models into a function within a series of CMSIS-NN or Andes libNN function calls. With powerful Open Neural Network Collection (ONNC) inside, Tiny ONNC supports rich PyTorch models which are designed for MCUs. Comparing with TensorflowLite for Microcontroller, the converted program supreme fast and tiny. It runs 3x faster and uses only 1/2 memory footprint on SRAM, and its code size is smaller than 1/7.

Best Supports for MCU NN Libraries – Andes LibNN and CMSIS-NN

Tiny ONNC bridges the gaps between data scientists and embedded software engineers. With Tiny ONNC, data scientists can directly transform PyTorch models into quality C source code. It helps embedded engineers can focus on implementing AI applications instead of on handling the semantics and precision of neural network models.

Tiny ONNC based on the most recent stable PyTorch version 1.8.1. It uses Open Neural Network Collection (ONNC) as a backend toolchain. Tiny ONNC exports a neural network model into an ONNX file. Then ONNC calibrator will import the ONNX file to quantize the floating-point weights/activation data to Qm.n format automatically. A simple calibration command suffices, engineers have no need to write scripts. After calibration, ONNC compiler will complete the transformation, to generate a C source code that contains a series of NN library calls.

To make Tiny ONNC a joy to work with, we put our energy at the summit level to support popular NN libraries of MCUs. Current Tiny ONNC can transform PyTorch models into two types of NN library calls – CMSIS-NN of ARM Cortex-M and Andes LibNN of Andes RISC-V. Andes LibNN is a comparable library with CMSIS-NN. Every function in the CMSIS-NN library has commensurable interfaces in Andes libNN. Like CMSIS-NN, Andes LibNN collects more than 60 popular functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). The library is optimized for the Andes RVP (SIMD/DSP instruction set) and Andes RVV (Vector instruction set).

Performance, SRAM consumption, and flash consumption are the primary factors to consider when it comes to evaluating a new tool for a successful MCU application.  We used the Tiny MLPerf v0.1 benchmark to do the evaluation. The evaluation was conducted on two platforms: one is a cycle approximate RISC-V simulator in Andes AndeSight v5.0.0 Beta and the other is an ARM Cortex-M4 EVB board from Nuvoton. We also compared the results with Tensorflow Lite for Microcontroller.

Table 1 – experimental environment

The following table shows a precision comparison between Tiny ONNC and TensorFlow Lite for Microcontroller (TLFM). Because the AndeSight simulator can’t run TFLM, all precision tests ran on ARM Cortex-M platform only. The precision tests come from two sources. Model L1/L5 is a classic Caffe model from ARM CMSIS-NN github repository. The rest of the models are from Tiny MLPerf v0.1. We don’t have the result of LeNet on TLFM because TLFM can not run Caffe model directly, and we found no path to transform the Caffe model into TensorFlow Lite model.

In terms of precision, the experimental results show that Tiny ONNC delivers a similar performance to TLFM.

Table 2 – target neural network models on ARM Cortex-M4

3x Faster with RVP and RVV Supports

For evaluating the performance of the Tiny ONNC and TLFM, our first evaluation was conducted with the two tools on the same ARM Cortex-M4 platform. The results are on the last two rows in Figure 1: TPT CM4 and TFLM CM4. Figure 1 shows the inference time of each model. Smaller execution time means better performance.

Figure 1 – performance comparison between Tiny PyTorch for RISC-V(D25), Tiny PyTorch for ARM Cortex-M4 and TLFM for Cortex-M4.

In terms of performance, Tiny ONNC delivers similar performance results to TLFM. In VWW96, KWS, and AD, the performance gap between Tiny ONNC and TFLM are tiny. But in ResNet, Tiny ONNC outperforms TLFM up to 24%. We didn’t dive deeply and we guess the gap comes from better memory allocation algorithm of ONNC.

When we compared CMSIS-NN library with Andes LibNN library, we found the models with Andes LibNN are 3x faster. In VWW69 and KWS, depthwise convolution operators contribute the most execution time. And in ResNet, the majority are ordinary convolution operators. Figure 1 shows that depthwise convolution operators in Andes LibNN are ~3x faster than CMSIS-NN, and the ordinary convolution operator is almost 4x faster. The performance comes from better use of RVP (DSP/SIMD instructions) and RVV (Vector instructions).

Smaller SRAM consumption

Most neural networks have a large number of parameters. It leads SRAM size dominates the cost of MCU for neural network applications. Tiny ONNC leverages the ONNC compiler as one tool in the backend toolchain. At the very beginning of designing ONNC, we knew the memory size is a major constraint and we paid most of our attention to designing better memory allocation algorithms. ONNC compiler has sophisticated algorithms to split big tensors into small pieces and to reuse all memory spaces smartly.

Figure 2 – SRAM consumption (data+bss) between Tiny ONNC and TLFM on ARM Cortex-M4

We executed Tiny MLPerf v0.1 with TFLM and ONNC. Because TFLM needs developers to manually set memory size, we wrote a binary search tool to find out the smallest memory size for each model. However, ONNC has the ability to automatically find out the smallest memory size. Users don’t need any extra scripts for memory settings. Figure 2 shows the eventual experimental results. ONNC saves more SRAM space in all cases. It’s worth mentioning that, ONNC doesn’t enable any tensor splitting or operator splitting algorithms in this experiment. In many cases, ONNC can save 46%~57% global SRAM in variant accelerators when enabling tensor splitting and operator splitting algorithms.

Smaller Code Size

Trimming down code size likes an eternal war between software engineers and embedded microcontrollers. Toggling TFLM in microcontrollers has its price. TFLM is a sort of interpreter. It must invoke the entire NN library at linking and running time. Comparing with TFLM, Tiny ONNC can generate C source code directly. And only the functions are used in the generated C source code are linked in the program. Optimizing linkers should strip unessential functions in the NN library for you automatically.

Figure 3 – Flash consumption (text+data) between Tiny ONNC and TLFM on ARM Cortex-M4

We compiled the generated C code into a runnable program and compared the program code size with TFLM. As TFLM’s requirement, we’ve embedded each model in TFLM for code size optimization. The experimental result shows in terms of code size, Tiny ONNC has at most x10 advantage over TFLM. We could say Tiny ONNC has arrived at the smallest code size a neural network could be.

Tiny ONNC will be available for licensing after June 2021. For more details of the Tiny ONNC™ features, please see the product page and contact us.

About Skymizer Taiwan, Inc

Skymizer Taiwan, Inc. (Skymizer) is a leading AI system software supplier to help IC design houses to shorten the distance from AI models to ASICs by unique compiler technology.

Overall, Skymizer can be characterized as an embedded system software (ESS) tool and intellectual property (IP) business specializing in providing an AI system development environment, including starter kits, reference designs, and turn-key solutions.



PyTorch, the PyTorch logo, and any related marks are trademarks of Facebook, Inc.
TensorFlow, the TensorFlow logo, and any related marks are trademarks of Google Inc.