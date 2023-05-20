



Last week at Google’s annual conference on new products and technologies, the company announced changes to its major AI offerings. The Bard chatbot will soon be able to describe images, similar to OpenAIs GPT-4. While it may seem like a small update, this enhancement is part of a quiet revolution in how businesses, researchers and consumers develop and use AI, bringing this technology to the written word and various media. Not only does it push beyond remixing to, but it pushes towards the loftier goal of the rich. and a deeper understanding of the world. ChatGPT is 6 months old and already starting to look outdated.

This program and its relatives known as large language models mimic intelligence by predicting which words are statistically likely to follow in a sentence. Researchers believe that forcing machines to type more words in different configurations will lead to better predictions and smarter programs. We have trained these models in This text-maximalist approach to AI development has been mainstream for years, especially among the most publicly available corporate products.

But language-only models like the original ChatGPT are now being superseded by machines that can process images, sounds, and even sensory data from robots. The new approach, which may reflect a more human understanding of intelligence, is an early attempt to approximate how children exist in the world and learn by observing it. It could also enable companies to build AI that can do more and package it into more products.

GPT-4 and Bard aren’t the only programs with these extensions. Also last week, Meta released a program called ImageBind that processes text, images, audio, depth information, infrared radiation, motion and location information. Google’s recent PaLM-E has been trained on both verbal and robot sensory data, and the company heralds new, more powerful models beyond text. Microsoft has its own model trained on words and images. Text-to-image generators such as his DALL-E 2, which took the internet by storm last summer, are trained on images with captions.

These are known as multimodal models, where text is one modality and image is another, and many researchers hope they will take AI to new heights. The grandest future is one where AI isn’t limited to writing boilerplate essays or helping people on Slack. Search the Internet without hoaxing, animate videos, guide robots, or create your own websites (based on rough concepts drawn by humans, as GPT-4 demonstrated). you could create.

A multimodal approach could theoretically solve the central problem of language-only models. In other words, you can order words fluently, but have trouble connecting them to concepts, ideas, objects, or events. Melanie Mitchell, an AI researcher and cognitive scientist at the Santa Fe Institute, told me that when they talk about traffic jams, they don’t experience traffic jams any more than they are associated with other languages. The data may include videos of traffic jams, and there is much more that can be collected. Learning from more types of data could enable AI models to assume and interact with physical environments, develop near-common sense, and even address manufacturing problems. If the model understood the world, it might be less likely to invent something about it.

Promoting multimodal models is nothing new. Google, Facebook, and other companies introduced automated image captioning systems nearly a decade ago. But some key changes in AI research have made cross-domain approaches more possible and promising in recent years, Jing Yu Koh, who studies multimodal AI at Carnegie Mellon University, told me. . For decades, completely different techniques were used in areas of computer science such as natural language processing, computer vision, and robotics, but now they all use a programming technique called deep his learning. I’m here. As a result, the code and approaches of both are more similar, making it easier to integrate the models with each other. And internet giants such as Google and Facebook are culling increasingly large image and video data sets, and computers are becoming powerful enough to handle them.

There are also practical reasons for this change. No matter how incomprehensibly large the Internet may seem, there is a finite amount of text available for training AI. And Carnegie Mellon University computer scientist Daniel Fried said there are practical limits to how big and unwieldy these programs can get and how much computing power they can use. Researchers are beginning to work beyond text in hopes of further enhancing their models with the data they can collect. In fact, thanks in part to this week’s Senate testimony, OpenAIs CEO and industry billboard of sorts Sam Altman said ChatGPT is the fastest-growing platform in the era of scaling text-based models. It is likely that it will be just a few months after it was reported. A consumer app that will go down in history.

It’s debatable how much more multimodal AI can understand the world than ChatGPT, and how fluent, if any, the language it can understand. Many programs perform better than language-only programs, especially on tasks involving images and 3D scenarios, such as describing pictures and imagining the outcome of texts in other areas, but not by much. is not. In a technical report accompanying GPT-4, OpenAI researchers reported that adding vision barely improved performance on standardized tests. This model also continues the hallucinations of confidently making false statements that are absurd, subtly wrong, or overtly demeaning. In fact, Google’s PaLM-E performed worse on the language task than his language-only PaLM model, presumably because adding the sensory information of the robot could yield some linguistic This is probably due to the trade-off of losing Still, such research is in its early stages and may improve over the next few years, Fried said.

We are far from a true imitator of how people think. Whether these models reach human-level intelligence is unlikely, Mitchell told me, given the types of architectures currently in use. Even if programs like Metas ImageBind can process images and sounds, humans learn by interacting with other people, have long-term memory, and grow from experience. And the artificial and artificial methods, just to name a few, are the product of millions of years of evolution. Organic intelligence does not match.

And just as throwing more textual data into an AI model didn’t solve the age-old problem of bias and fabrication, throwing more kinds of data into a machine won’t necessarily solve it. not. Programs that capture biased images as well as biased text still produce detrimental output across more media. For example, text-to-image models like stable diffusion have been shown to perpetuate racist and sexist biases, such as associating black faces with the word thug. Opaque infrastructure and training data sets make software difficult to regulate and audit. As AI needs to collect more types of data, the potential for labor and copyright infringement may increase.

Multimodal AI can be even more susceptible to certain types of manipulation, such as changing key pixels in an image, than language-savvy models alone, Mitchell said. Some form of fabrication is likely to continue, perhaps even more compelling and dangerous, as AI visually imagines hallucinations that trigger a fake-image-scale scandal of Donald Trump’s arrest. . Coe said he doesn’t think multimodality is a silver bullet for many of these problems.

Intelligence aside, multimodal AI could be a better business proposition. Language models are already in a gold rush in Silicon Valley. Before the boom in multi-modality companies, OpenAI reportedly expected to hit $1 billion in revenue by 2024. Several recent analyzes project that ChatGPT will add tens of billions of dollars to Microsoft’s annual revenue within a few years.

Multimodalization might be like exploring Eldorado. Such programs offer customers more features than his simple text-only ChatGPT, such as explaining images and videos, interpreting and creating diagrams, and a more useful personal assistant. Multimodal AI helps consultants and venture capitalists create better slide decks, improve existing fragile image and environment description software for the blind, and speed up the processing of cumbersome electronic medical records. , and could help guide us along the road rather than as a map. We observe the buildings around us.

Applications in robotics, self-driving cars, medicine, etc. are easy to envision if not realized, like a golden city that justifies its conquest even if it turns out to be mythical. . You don’t need to produce obviously more intelligent machines for multimodality to take hold. We just need to make more things that are clearly profitable.

