YOLO-NAS Uncovered: Essential Insights and Implementation Techniques for Machine Learning Engineers

The latest state-of-the-art Object Detection architecture has excited the computer vision world.

13 min readJun 12, 2023

Introduction

Computer systems’ ability to understand objects in the real world has resulted in a number of applications in domains such as pose estimation (tracking key body joints), object detection, image classification, and more. These applications drive the core functionality in use cases such as surveillance cameras, autonomous vehicles, and robotics.

The plethora of research papers and applications in the world of computer vision is a testament to the feat our collective efforts and accumulated knowledge can achieve.

For example, PapersWithCode provides a platform that combines over 900 methodologies, more than 200 datasets, and an extensive academic collection of thousands of academic papers, all focusing on object detection and the broader landscape of computer vision.

This article presents key information on another novel neural network architecture from the research team at Deci that pushes state-of-the-art performance on object detection tasks and advances the frontier of computer vision application to research and real-world use cases.

Deci offers a deep learning platform that seamlessly facilitates processes involved in deep learning models, such as their management, implementation, deployment and maintenance. To create YOLO-NAS, the latest deep neural network architecture that improves previous object detection models, Deci’s research team leveraged their proprietary software, industry experience, and knowledge gained from working with prominent organizations within the computer vision domain.

In this article, we explore the following:

Brief summary of object detection tasks within the field of computer vision.
The importance of the YOLO (You Only Look Once) architecture.
The concept of Network Architecture Search (NAS).
Deci’s Automated Neural Architecture Construction (AutoNAC)
Discovery of YOLO-NAS using AutoNAC.
The application of Hybrid Quantization in YOLO-NAS to reduce the size of the network for real-time operation on edge devices.
Techniques employed in YOLO-NAS for improved training and performance, such as Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Quantization Aware Blocks (QAB).

Primer: Object Detection Overview

For many machine learning practitioners exploring the domain of computer vision(CV), object detection is the typical entry point CV task that directly correlates with providing real-world value, meaning practitioners can quickly implement CV techniques in applications deployed to solve actual problems.

What is Object Detection?

Object detection is the task of identifying and localizing objects present in an image, multiple images, or a sequence of images (video). The output of deep learning models designed and developed to solve computer vision problems are bounding boxes or masks around the identified and localized objects.

YOLO Object detection inference result of bounding boxes and labels layered on top of objects of interest in the image

Object detection has risen in popularity over the decades as an area of research study, industrial, and commercial application. Numerous research teams continually release innovative and intuitive architectures that push state-of-the-art benchmarks. At the same time, companies tackle real-world problems with object detection solutions in industries such as automotive (autonomous vehicles), healthcare (AI medical imaging), robotics, and defence.

YOLO (You Only Look Once) is a prominent deep learning neural network architecture that tackles object detection. It has led to variants of the initial architecture that enable object detection in real time on edge devices (mobile phones, offshore equipment) with high accuracy.

Be Conversation Ready: Key takeaways from YOLO-NAS

NAS (Network Architecture Search)

To understand YOLO-NAS, exploring how the research team at Deci discovered the latest state-of-the-art architecture is necessary. And that involves a brief introduction to the concept of NAS and AutoNAC.

NAS (Network Architecture Search) is a subfield within AutoML (automated machine learning). NAS involves the development of machine learning models that automatically design and configure deep neural network architectures with the aim of outperforming their counterparts that are designed, configured and built manually by human researchers.

NAS research advancement facilitates efficient discovery of high-performing neural network architectures through a systemized process that considers factors such as speed of inference, available compute resources, architectural complexity and prediction accuracy. Successful efforts of novel architecture design and development produced through NAS techniques have surpassed human-engineered architectures in tasks like natural language processing, object detection, and image classification.

A significant benefit of utilizing NAS techniques is that streamlining machine learning processes involved in research, notably through automation, reduces involved human hours in research efforts. In addition, scarcity or lack of expert domain knowledge is tackled by introducing systems that can consider multiple parameters to optimize for and explore multiple internal network configurations in a vast search space through brute force, which eventually configures the optimal network with configurations that, in some cases, can only be achieved with human expert knowledge, creativity and intuition.

NAS techniques are even more relevant today, as they focus on optimizing factors such as computational resources, efficiency, accuracy, power consumption, and memory usage, which are all crucial for adapting architectures to edge devices(smartphones) or real-time scenarios.

Notably, models designed to run on edge devices and compute-scarce environments, such as MobileNetV3 and EfficientDet, were discovered by designing novel neural architecture and conducting an automated search for suitable architectures that improved on state-of-the-art performance in computer vision tasks.

NAS provides significant benefits for researchers. But the ability to traverse through a search space of possible deep neural network architectures and configurations consisting of millions of parameters, combined with factors such as compute resources, target accuracy metrics and more, requires extensive computing resources. The large computing resources and domain expertise required to execute NAS techniques limit its effective utilization to a small number of organizations.

Automated Neural Architecture Construction (AutoNAC)

Automated Neural Architecture Construction (AutoNAC) is Deci’s proprietary NAS technology that efficiently explores a vast search space of diverse architectural configurations and structures while considering the encompassing block types, the number of blocks, and channel allocations.

AutoNAC facilitated the discovery of the innovative YOLO-NAS novel architecture and its variants (YOLO-NAS-S, YOLO-NAS-M, and YOLO-NAS-L architectures) by searching for an optimal model architecture that combined the fundamental architectural contributions of YOLO variants and incorporated several of Deci’s research teams innovative neural components that enabled optimized training and inference. This brought the possible number of neural network architectures to 10¹⁴. To put the vastness of the search space into perspective, this is greater than the number of stars in our galaxy.

The search space that led to the discovery of the YOLO-NAS architecture incorporated principles, design, and compute considerations that prioritized efficiency, scalability, robustness and interpretability.

Let’s explore some of these considerations.

Depiction of factors taken into consideration in the development of YOLO-NAS — Image by Author

Efficiency refers to the modern requirement of deep learning models to meet computational storage requirements of edge devices such as smartphones while still meeting performance requirements in terms of speed and accuracy of inference. The YOLO-NAS architecture delivers high-accuracy performance for object detection tasks without requiring high computing resources. AutoNAC technology traverses the vast architectural search space for architectures that balance latency(time taken to receive inference results) and throughput(image frames processed within a specific time period).

The robustness of the YOLO-NAS model is evident in its resilience to changes in input data, ability to handle noise or uncertainty and maintaining high accuracy rates even during post-training quantization. Furthermore, the principles of single-stage detection, grid division of image, bounding box prediction, multi-scale predictions, and non-max suppression — which are integral to YOLO architectures — equip YOLO-NAS with the robustness to detect objects of various sizes across different scenarios effectively.

And now we’ve reached the ‘Efficiency Frontier’. The efficiency frontier refers to the search space that covers architecture that presents an optimal balance of latency, throughput and accuracy. AutoNAC facilitates the discovery of novel neural network architectures by considering hardware availability, performance targets, quantization etc. The efficiency frontier contains the YOLO-NAS variants YOLO-NAS-S, YOLO-NAS-M, and YOLO-NAS-L; these model variants cater to different hardware and computational constraints, introducing scalability based on computational resources to the YOLO-NAS model.

This Efficiency Frontier graph presents a comparison between the YOLO-NAS architecture and other YOLO architectures based on object detection performance on the COCO2017 validation dataset.

Hybrid Quantization Method

In order to get large deep learning mode architectures to operate in real-time with high accuracy in edge environments, there’s a requirement to make the network smaller. Making a deep learning model ‘smaller’ is more appropriately described as reducing the number of bits that are utilised to represent a single number. The technical term for this is Quantization.

Quantization is the process of reducing the size of a neural network architecture through various techniques, including the reduction of precision and bits used to store numbers in the values of weight and biases within the neural network. For example, weights in neural networks typically take on values using 32-bit floating point numbers(also known as single point precision), which is a continuous value and assists with efficient and accurate training of neural networks. After putting a neural network through quantization, the number of bits used to represent the numerical values in the network is reduced to lower ’n’ bit representations that are smaller than the initial bit representation, in this case, 2, 4, 8 and 16-bit representations.

Understanding Precision Formats: FP32, FP16, and INT8 — Image by Author

Quantization provides several benefits, such as:

Reducing the storage memory requirement required to store a network
Reducing the computational requirement needed to run inference on a trained model
Enable compute resource-constrained edge devices such as smartphones and embedded systems to run models

But the apparent limitation of quantization is that there can be significant information loss due to the reduction of the number of bits used to represent numerical values, which consequently causes a loss in accuracy and performance of the quantized version of the model.

Deci’s research team utilized a novel hybrid quantization method to combat information loss whilst producing an architecture that is small enough to run on edge devices with high accuracy and performance. The hybrid quantization method applies quantization in a non-uniform manner across a network architecture; this non-uniform application of quantization is facilitated by a selective consideration of what areas of the model to apply quantization to and what level of quantization is required. The key consideration for the hybrid quantization method is determining what layers require a high level of quantization (represented with more bits, 16, 32 bits) and layers that only require a low-level quantization (represented with lower bits, 4, 8 bits).

Conducting hybrid quantization requires understanding the neural network configurations, including the prerequisite knowledge of what layers are sensitive to quantization. Deci’s algorithm used to design, develop and discover the YOLO-NAS architecture analyzes the architecture to consider which layers are exposed to quantization to reduce information loss. Subsequently, Deci’s software provides the tools required to quantize large models(FP32) to create a smaller version(FP16 or INT8) without significantly reducing accuracy and performance.

Post-Training Quantization (PTQ), Quantization-Aware Training (QAT) and Quantization Aware Blocks (QAB)

Some of the key contributions of YOLO-NAS are around its quantization strategy, so these are worth exploring. This section introduces three concepts that enable the efficient training, fine-tuning and inference performance of YOLO-NAS and its variants.

Post-Training Quantization(PTQ)
Quantization-Aware Training(QAT)
Quantization Aware Blocks(QAB)

Post-Training Quantization(PTQ), as the name implies, is conducted after a neural network architecture has been trained.

The training of the YOLO-NAS architecture was conducted with Object365 and COCO Pseudo Labeled data. After training, the YOLO-NAS model was put through a post-training quantization procedure, reducing the model size by decreasing the model weight’s precision from high floating-point precision(FP32) to low integer representation(INT8). This procedure is carried out to reduce the computational resources required by the YOLO-NAS architecture to perform inference whilst maintaining a minimal loss of accuracy from the reduction in the precision of the values used to represent the model’s weight. Subsequently, due to the PTQ process, the quantized versions of YOLO-NAS have lower inference speeds and are designed to run on edge devices on applications that require object detection to be run in real-time.

The YOLO-NAS quantized versions are YOLO-NAS-INT8-S, YOLO-NAS-INT8-M, and YOLO-NAS-INT8-L. The models’ mAP(mean average precision) decreases only slightly, with 0.51, 0.65, and 0.45 points of mAP for the S, M, and L variants.

After the PTQ process that’s carried out on the pre-trained YOLO-NAS model, the quantized variants of YOLO-NAS architecture are placed through a fine-tuning procedure that utilizes Quantization-aware training(QAT) techniques.

The QAT technique functions by simulating the effect of information and accuracy loss caused via quantization and enables the model to learn how to retain its initial accuracy and compensate for any loss in accuracy. The result is a model with lower latency, smaller size, and deployable on edge devices. The quantization-aware training is supported by the network’s architectural incorporation of Quantization Aware Blocks (QAB), which are components within the neural network that are designed to adapt to the effect of reducing the precision values of the weights and activations.

Depiction of the quantisation process involved in the development of the YOLO-NAS quantized variants — Image by Author

Attention Mechanism

As a machine learning practitioner, you are most likely familiar with the machine learning technique: Attention mechanism, made popular by the paper that introduced the Transformer neural network: Attention is all You Need.

The YOLO-NAS architecture incorporates the attention mechanism and leverages it to selectively focus on certain parts of an image containing target object(s) relevant to the problem domain or use case.

YOLO-NAS incorporation of the attention mechanism in its architecture enables the prioritization of areas of an image that contains a target object, effectively reducing the influence of irrelevant information, such as non-target objects and image background. This application of attention refines the model’s focus and significantly boosts its object detection capabilities.

Training YOLO-NAS

YOLO-NAS training procedure is comprehensive and leverages multiple datasets, labelled and unlabelled, and supervised and unsupervised training procedures.

An overview of the YOLO-NAS training process — Image by Author

The pre-training phase of the YOLO-NAS architecture is conducted by first utilising the Object365 dataset, a collection of 2 million images containing 365 categories and 30 million bounding boxes, all suitable for object detection research. Adding images from the COCO pseudo-labelled dataset adds 123k images on which the YOLO-NAS architecture is trained. Pseudo-labelling is a semi-supervised technique leveraged to infer the annotation of unlabeled data using an accurate model trained on a similar dataset and annotation. The pseudo-labelled dataset augments the initial Object365 dataset, creating more training data and resulting in better-performing and robust models through diverse training examples.

The training phase of YOLO-NAS architecture incorporates Knowledge Distillation (KD) and Distribution Focal Loss (DFL) to improve the performance and accuracy of the training model for the object detection task.

Knowledge Distillation(KD) is a machine learning technique used to reduce the computational resource required by a model by training a simpler version(student model) of the original model(teacher model) to perform at the same accuracy level but at a fraction of the computational requirement and memory footprint in comparison to the teacher model.

In this scenario, the less complex model is trained to align its predictions to those of the more sophisticated teacher model. The YOLO-NAS student model, shaped by knowledge distillation, is better optimized for devices with limited memory or processing capabilities, like smartphones and other low compute devices.

YOLO-NAS is designed to be a versatile neural network architecture for several objective detection tasks, specifically scenarios requiring efficient low-latency prediction. The requirements of object detection models today are that they are robust and applicable to a wide range of real-world scenarios and use cases. For example, object detection tasks can significantly vary in scope and scale, from identifying galaxies or planets in astronomical images to detecting microscopic organisms in biomedical research through microscopy.

Applying the Distribution Focal Loss (DFL) technique to the YOLO-NAS architectural design enables the model to successfully navigate the variability in target object size and position of target objects; this ensures the model’s applicability across various scenarios. DFL is a modification of the focal loss function, crafted initially to address class imbalances in object detection tasks. The DFL technique expands this functionality by categorizing a continuous range of probable bounding box values into separate possibilities.

Let’s think about detecting varying types of vehicles in a traffic scene. We have large buses and smaller cars, motorcycles, and bicycles here. The conventional approach might be inaccurate when placing bounding boxes(detecting) on smaller or larger vehicles. But by employing DFL, YOLO-NAS predicts a spread of bounding box sizes, thereby detecting each vehicle type with enhanced accuracy, regardless of its size.

In this way, YOLO-NAS’s use of DFL allows it to predict a diversity of bounding box sizes rather than being limited to just one; this improves accuracy. This results in more precise detection while maintaining YOLO-NAS’s fast performance. Incorporating DFL into YOLO-NAS is a strategy to adjust to various object detection tasks through its inherent robustness, adaptability, and dependable performance.

Finally, the trained YOLO-NAS model is tested on Roboflow’s RoboFlow100 dataset, which is a diverse dataset that contains over 200,000 annotated images with 829 classes.

Average mAP on Roboflow-100 for YOLO-NAS variants and other YOLO variants

Conclusion

The YOLO-NAS architecture created by Deci’s research and engineering team sets the new state-of-the-art performance in object detection and produces variant models that enable the model to be utilized in low compute environments such as edge devices whilst providing real-time and high accuracy performance for a number of object detection tasks.

Although the training process of the architecture is multi-phased and presumably expensive, the resulting models operate with low latency, provide highly accurate object detection results and enable ease of fine-tuning for more intricate object detection tasks.

The field of computer vision is ever-evolving, and the utilization of automated neural architecture search procedures and algorithms has solidified the presence of AutoML in deep learning research efforts. Advanced techniques introduced by AutoNAC and Deci’s novel neural components provide opportunities for machine learning practitioners and teams to leverage a technology that provides high performance out of the box and enables fine-tuning procedures to be carried out easily by utilising the SuperGradients library.

Where to next?

The team at Deci have put together a Google Colab notebook that introduces the SuperGradients library, provides a brief overview of YOLO-NAS and gives step-by-step instructions on how to perform inference on images and videos using the YOLO-NAS model.

Thanks for reading