Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking?

Publisher:
ACM
Publication Type:
Conference Proceeding
Citation:
MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4949-4957
Issue Date:
2021-10-17
Filename Description Size
3474085.3475365.pdfPublished version8.32 MB
Adobe PDF
Full metadata record
There has been an increasing emphasis on building large-scale datasets as the driver of deep learning-based trackers' success. However, accurately annotating tracking data is highly labor-intensive and expensive, making it infeasible in real-world applications. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms' performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. AMMC first cuts out the tracked targets and performs a sequence of transformations to simulate the possible change by object motion. Then the transformed targets are pasted on the inpainted background images and further conjointly augmented to mimic variability caused by camera motion. Compared with standard augmentation methods, AMMC explicitly considers tracking data characteristics, which synthesizes more valid data for object tracking. We extensively evaluate our approach with two popular trackers on the FAT datasets. Experiments show that our method allows these trackers to even trained on a dataset requiring much less annotation to achieve comparable or even better performance to those on the full-annotation dataset. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.
Please use this identifier to cite or link to this item: