EdgeTAM: On-Device Track Something Model: Difference between revisions

Latest revision as of 05:28, 12 September 2025

On top of Segment Anything Model (SAM), SAM 2 additional extends its functionality from picture to video inputs by a reminiscence financial institution mechanism and obtains a outstanding performance compared with earlier methods, making it a basis mannequin for video segmentation activity. In this paper, we purpose at making SAM 2 rather more efficient in order that it even runs on cellular devices whereas maintaining a comparable efficiency. Despite a number of works optimizing SAM for better effectivity, we discover they don't seem to be ample for SAM 2 because they all give attention to compressing the image encoder, while our benchmark exhibits that the newly introduced reminiscence attention blocks are additionally the latency bottleneck. Given this observation, we suggest EdgeTAM, which leverages a novel 2D Spatial Perceiver to scale back the computational price. Particularly, the proposed 2D Spatial Perceiver encodes the densely stored frame-level reminiscences with a lightweight Transformer that comprises a hard and fast set of learnable queries.

Provided that video segmentation is a dense prediction task, we discover preserving the spatial structure of the reminiscences is crucial in order that the queries are split into world-degree and patch-degree groups. We also propose a distillation pipeline that additional improves the efficiency with out inference overhead. DAVIS 2017, MOSE, SA-V val, and SA-V check, while running at sixteen FPS on iPhone 15 Pro Max. SAM to handle each image and video inputs, with a memory bank mechanism, and is trained with a new massive-scale multi-grained video tracking dataset (SA-V). Despite attaining an astonishing efficiency in comparison with previous video object segmentation (VOS) models and allowing extra various consumer prompts, SAM 2, as a server-side basis model, is not efficient for on-system inference. CPU and NPU. Throughout the paper, we interchangeably use iPhone and iPhone 15 Pro Max for simplicity.. SAM for iTagPro support higher effectivity only consider squeezing its image encoder because the mask decoder is extraordinarily lightweight. SAM 2. Specifically, SAM 2 encodes previous frames with a reminiscence encoder, and these body-level reminiscences along with object-stage pointers (obtained from the mask decoder) serve because the memory bank.

These are then fused with the features of present body by way of reminiscence consideration blocks. As these recollections are densely encoded, this results in a huge matrix multiplication through the cross-attention between current frame iTagPro features and reminiscence features. Therefore, regardless of containing relatively fewer parameters than the image encoder, the computational complexity of the memory attention just isn't inexpensive for on-device inference. The hypothesis is additional proved by Fig. 2, the place reducing the variety of memory consideration blocks nearly linearly cuts down the general decoding latency and within every memory attention block, eradicating the cross consideration offers the most significant speed-up. To make such a video-primarily based monitoring model run on gadget, in EdgeTAM, we look at exploiting the redundancy in movies. To do this in apply, we suggest to compress the raw body-degree reminiscences before performing memory consideration. We start with naïve spatial pooling and observe a significant performance degradation, particularly when utilizing low-capacity backbones.

However, naïvely incorporating a Perceiver also leads to a extreme drop in efficiency. We hypothesize that as a dense prediction process, the video segmentation requires preserving the spatial construction of the reminiscence financial institution, which a naïve Perceiver discards. Given these observations, we propose a novel lightweight module that compresses body-stage memory function maps while preserving the 2D spatial construction, named 2D Spatial Perceiver. Specifically, we split the learnable queries into two teams, the place one group features similarly to the unique Perceiver, where every question performs international attention on the input options and outputs a single vector because the body-level summarization. In the opposite group, the queries have 2D priors, iTagPro features i.e., each question is barely accountable for compressing a non-overlapping local patch, thus the output maintains the spatial construction while reducing the entire number of tokens. Along with the structure improvement, we additional propose a distillation pipeline that transfers the data of the highly effective instructor SAM 2 to our pupil model, which improves the accuracy at no cost of inference overhead.

We discover that in each levels, aligning the features from image encoders of the original SAM 2 and our environment friendly variant benefits the performance. Besides, we further align the characteristic output from the memory attention between the instructor SAM 2 and our scholar mannequin within the second stage in order that in addition to the picture encoder, best bluetooth tracker reminiscence-related modules can also obtain supervision alerts from the SAM 2 instructor. SA-V val and test by 1.Three and 3.3, respectively. Putting collectively, we suggest EdgeTAM (Track Anything Model for Edge devices), that adopts a 2D Spatial Perceiver for efficiency and data distillation for accuracy. Through comprehensive benchmark, we reveal that the latency bottleneck lies within the memory attention module. Given the latency evaluation, we suggest a 2D Spatial Perceiver that significantly cuts down the reminiscence attention computational price with comparable efficiency, which will be integrated with any SAM 2 variants. We experiment with a distillation pipeline that performs characteristic-sensible alignment with the original SAM 2 in each the picture and video segmentation stages and observe performance enhancements with none additional price throughout inference.