Transformer-Powered Object Detection for Smart and Real-Time Computing Platforms
Keywords:
Transformer-based object detection, Vision Transformer (ViT), real-time inference, smart computing platforms, edge AI, embedded systems, lightweight transformer models, deep learning, object recognition, attention mechanism, quantization-aware training (QAT), resource-constrained environments.Abstract
The paper gives out a high-performance Transformer-based object detection framework particularly adapted to smart and real-time, edge, and embedded computing. As visual recognition demands scale, high accuracy in latency-sensitive applications, viz. autonomous systems, intelligent surveillance, augmented reality and industrial automation, the existing CNN-based detectors are plagued with issues in scalability, computational efficiency and context modeling. With these difficulties in mind, our idea is to use a simple and powerful architecture using the backbone of the Vision Transformers (ViTs), and add the self-attention layers to arrive at long-range dependency and embedding of the global features. The proposed model was implemented with patch embedding, multi-head self-attention and anchor-free detection head in addition to quantization-aware training (QAT) and knowledge distillation methods to lessen the model burden and increase deployment performance. The performance of our framework when trained against benchmark datasets (COCO and PASCAL VOC) proves to be better in terms of mean average precision (mAP) factor along with its inference latency and model size than thoseprovided by the state-of-the-art CNN-based detectors in current competitive applications like YOLOv5 and Faster R-CNN. Our results indicate that our approach is able to reach above 94 percent mAP and an inference rate greater than 30 FPS on edge devices, such as the NVIDIA Jetson Nano, with a much fewer footprint on computations. The architecture enables low CPU work and power costs coupled with real-time object detection and low memory overheads, thus ideal to work with resource constrained smart environments. Moreover, a large study of ablations also supports the influence of Transformer-specific elements and optimization techniques on detector performance. The contribution will contribute to the deployment of high-performance object detection in real deployment requirements such as low-power real-time systems. The presented solution preconditions a wider range of services based on Transformer models and their utilization outside of cloud-based settings with efficient, scaleable, and intelligent perceptions of visual information in next-generation computing environments.