IoT and edge devices, capable of capturing data from their surroundings, are becoming increasingly popular. However, the onboard analysis of the acquired data is usually limited by their computational capabilities. Consequently, the most recent and accurate deep learning technologies, such as Vision Transformers (ViT) and their hybrid (hViT) versions, are typically too cumbersome to be exploited for onboard inferences. Therefore, the purpose of this work is to analyze and investigate the impact of efficient ViT methodologies applied to the monocular depth estimation (MDE) task, which computes the depth map from an RGB image. This task is a critical feature for autonomous and robotic systems in order to perceive the surrounding environment. More in detail, this work leverages innovative solutions designed to reduce the computational cost of self-attention, the fundamental element on which ViTs are based, applying this modification to METER architecture, a lightweight model designed to tackle the MDE task which can be further enhanced. The proposed efficient variants, namely Meta-METER and Pyra-METER, are capable of achieving an average speed boost of 41.4% and 34.4% respectively, over a variety of edge devices when compared with the original model, while keeping a limited degradation of the estimation capabilities when tested on the indoor NYU dataset.
Dettaglio pubblicazione
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pages 383-394 (volume: 14365)
Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task (04b Atto di convegno in volume)
Schiavella C., Cirillo L., Papa L., Russo P., Amerini I.
ISBN: 9783031510229; 9783031510236
keywords