Predictive video captioning for road driving scenes

At a relatively high level of understanding humans have the amazing ability to anticipate the future based on current events. Given a short video of a scene, humans can easily predict and describe what will most likely happen next. For example, after observing that “The red traffic light is turned on”, the next event is "car brake-lights will most likely turn on, cars will slow to a stop, pedestrians may cross the street, etc".

Being able to predict the future is crucial to improving road safety and paving the way for autonomous driving. This is a challenging problem since the algorithm needs to understand the semantic information of the currents scene frames, model the spatial-temporal interactions of the objects, accurately predict the most likely next events in the video and then generate a sentence describing the next scene.

Predictive frame captioning is related to several lines of research in computer vision, including image/video captioning, future prediction, and moment/event detection in videos. There has also been a significant amount of work on captioning in the video domain video [1]. The method in [1] utilizes LSTM units in an encoder-decoder fashion for the video captioning task. It extracts both appearance and optical flow features of frames, then feeds them through their proposed model to generate captions. Inspired by the success of self-attention [2] and transformer networks, [3,4] propose end-to-end video dense captioning systems by establishing a more explicit relationship between visual and textual modalities. There have been proposed approaches that aim to predict the future in a pixel-level space. Using generative adversarial networks, [5] has established a framework to predict the next frame(s) directly in RGB space based on a prior video sequence. Other works [6,7] have considered predicting future semantic or instance segmentation of future frames. [8] was the first to attempt to predict the future caption of a video by first predicting the next event in the semantic space of constitutional features, then fusing contextual information into those features, and feeding them to a captioning module.



[1] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4534–4542. IEEE, 2015

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008. Curran Associates, Inc., 2017

[3] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8739–8748. IEEE, 2018

[4] Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

[5] Sandra Aigner and Marco Korner. Futuregan: Anticipating ¨ the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing autoencoder gans. arXiv preprint arXiv:1810.01325, 2018.

[6] Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 584–599. Springer, 2018.

[7] Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 648–657. IEEE, 2017.

[8] Hosseinzadeh, Mehrdad, and Yang Wang. "Video Captioning of Future Frames." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021.