Real-Time Object Detection with Transformer-Based Model for Next-Gen Computing Applications
Keywords:
Real-time object detection, Vision Transformer, Edge AI, Lightweight deep learning, Embedded inference, Self-attention mechanism, Next-gen computing, ONNX optimization, Smart surveillance, Transformer-based vision modelsAbstract
Real-time object detection itself is an essential ingredient to many future computer applications including autonomous driving, intelligent surveillance systems, robotics, augmented reality or virtual reality interfaces. Although with convolutional neural networks (CNNs) such as YOLO and SSD, there have been a notable performance in the respect of the detection speed and accuracy, due to the usage of local receptive fields and little global context, these networks fail to effectively model complex object relations, especially when requirements are placed on cluttered or dynamic scenes. In the most recent past, vision transformers (ViTs) have become an effective alternative because of their ability to handle long-range dependencies with the help of self-attention. But conventional transformer-based models have high processing overheads and deserve to be deployed on a smartending and resource-constrained device. To resolve this shortcoming, we introduce a new transformer-based object detection framework TransDetect, which is optimized in terms of efficiency, low inference latency and compatible with edge AI platform. TransDetect employs lightweight convolutional tokenization, hierarchical multi-head self-attention, and context-aware decoding layers to balance the detection accuracy and the expediency of the computation well. The model is also optimized by quantization-aware training, structural pruning, an ONNX-based deployment, making it perform real-time requirements on devices like the Nvidia Jetson Nano and Raspberry Pi 4. Using large-scale tests on the most popular datasets such as MS COCO and Pascal VOC, it can be concluded that TransDetect gained 81.3% on mean Average Precision (mAP@0.5), exceeding the other lightweight CNN-based models, despite having an extremely low inference latency of less than 30 milliseconds. These findings present opportunities of transformer-based architectures in real-time vision tasks under constrained circumstances. The model proposed is scalable and deployable to operate edge inference thereby making intelligent computing systems to realize an operation that realizes robust object detection using few hardware and energy consumption.