Video-to-Image Casting: A Flatting Method for Video Analysis

Publisher:
ACM
Publication Type:
Conference Proceeding
Citation:
MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4958-4966
Issue Date:
2021-10-17
Filename Description Size
3474085.3475424.pdfPublished version6.38 MB
Adobe PDF
Full metadata record
Previous mainstream video analysis methods, especially 3D CNNs-based models, mainly aim to transfer frameworks from the image domain to the video domain, and they follow the regime which has been succeeded in image processing, i.e., large-scale benchmarks and deep networks. However, processing videos is still time-consuming due to the increased computational cost. In this paper, we propose to flat the video and construct a Spatio-temporal Image (STI), i.e., squeezing the temporal dimension into a spatial plane. To pursuit the video-level modeling and efficient architecture, we devise a Collective Convolution (CoConv) operation to replace the 2D convolution. With the holistic sampling strategy, this novel operation can extract the video-level spatio-temporal representation. Moreover, we ensure that each CoConv operation has the same number of parameters as the original 2D filter, thus we can utilize a 2D network equipped with CoConv to analyze videos without additional computations. To verify the effectiveness of our method for the general video analysis, we evaluate it on three typical tasks, i.e., supervised action recognition, self-supervised action recognition, and dynamic texture recognition. Extensive experimental results show that our method can achieve comparable or state-of-the-art performances on these benchmarks while using much fewer computations compared with its 3D counterpart.
Please use this identifier to cite or link to this item: