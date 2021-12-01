



A team of scientists from Google Research, Alan Turing Institute, and Cambridge University recently announced a new state-of-the-art (SOTA) multimodal transformer for AI.

In other words, they teach AI how to listen and see at the same time.

In advance: You’ve probably heard about transformer AI systems such as GPT-3. Basically, it processes and classifies data from certain types of media streams.

The current SOTA paradigm requires multiple AI models to run simultaneously when analyzing video data.

You need a model trained with video and another model trained with audio clips. As this is a system where the human ear and eyes are completely different (still connected), the algorithms needed to process different types of audio are usually used to process video. This is because it is different from the algorithm.

According to the team’s paper:

Despite recent advances across different domains and tasks, the current state-of-the-art method trains individual models with different model parameters for each task at hand. In this task, we’ll show you a simple and effective way to train a single integrated model that achieves competitive or cutting-edge results when it comes to image, video, and audio classification.

Background: What’s amazing here is not only that we were able to build a multimodal system that allows teams to handle related tasks at the same time, but as a result, we surpassed the current SOTA model focused on a single task. ..

Researchers dubbed their system PolyVit. And, according to them, there is virtually no competition right now.

By co-training different tasks in a single modality, you can improve the accuracy of individual tasks and achieve cutting-edge results with five standard video and audio classification datasets. Collaborative training of PolyViT with multiple modality and tasks yields more parameter-efficient models and learns expressions that are generalized across multiple domains.

In addition, we show that implementing collaborative training is easy and practical because we don’t have to adjust hyperparameters for each dataset combination, but we can easily adapt hyperparameters from standard single-task training. can do.

Quick Take: This can be a big problem for the business world. One of the biggest issues facing companies looking to implement an AI stack is compatibility. There are literally hundreds of machine learning solutions out there, and there is no guarantee that they will work together.

This results in an exclusive implementation where IT leaders stick to a single vendor for compatibility, or a mixed-and-match approach with more headaches than normal value.

The paradigm of standardized multimodal systems is a godsend for tired managers.

Of course, since this is an early study from preprinted paper, there is no reason to believe that we will often see this soon widely implemented.

But it’s a great step towards a universal classification system, and it’s pretty exciting.

