Google's VideoBERT Can Predict What Happens next in Videos

Humans can hands observe events and predict or anticipate events that could probably happen in the near future but this predictive behavior has e'er been difficult for AI. Well, not anymore. Researchers at Google have proposed VideoBERT, a self-supervised system that is able to perform predictions from unlabeled videos.

"Oral communication tends to be temporally aligned with the visual signals, and can exist extracted by using off-the-shelf automatic voice communication recognition (ASR) systems, and thus provides a natural source of self-supervision.", wrote Google researchers in a blog post.

VideoBERT makes use of Google'due south BERT to learn the details of the video. Notably, BERT(Bidirectional Encoder Representations from Transformers) is the cut-edge model used by Google for natural linguistic communication based applications.

Google used image frames combined with automatic speech recognition judgement outputs to convert them into visual tokens of 1.5-2d duration. These visual tokens are and so concatenated with the word tokens. The missing tokens were filled out by using the VideoBERT model.

The blog explains how the researchers trained VideoBERT on over ane million instructional videos on cooking, gardening, and vehicle repair. The researchers also verify the outputs of VideoBERT to evaluate the accuracy of the model.

google videobert

According to the researchers, VideoBERT was able to predict that a bowl of flour and cocoa powder may be baked in an oven and may turn to a credibility or cupcake. The blog postal service also notes that VideoBERT frequently misses out on fine-grained visual information like smaller objects and subtle motions.

"Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot activeness classification and recipe generation, only the learned temporal representations also transfer well to various downstream tasks, such every bit action anticipation.", ended the researchers.

So, what are your thoughts on VideoBERT? Permit the states know in the comments.