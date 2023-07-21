



Google Bard accepts images on prompt

Google’s Large Language Model (LLM) chatbot Bard recently announced the ability to accept image prompts and make them multimodal. This compares to a similar feature recently released from Microsoft’s Bing Chat utilizing OpenAIs GPT-4.

A Bing multimodality review concluded that Bing has excellent image context and content recognition, captioning and classification, but lacks the ability to perform task-specific object localization and detection tasks.

In this article, we’ll examine how Bards’ image input performs, how it compares to GPT-4, and how it’s supposed to work.

Testing the bard image feature

Using the same test that was run on Bing Chat, we asked Bard questions using three different datasets from the Roboflow Universe to assess Bard’s performance.

> Read our review of Bing/GPT-4s multimodal capabilities for more information on how the experiment was conducted.

Count people with Google Bard

In this task, we asked Bard to count the number of people present in an image using the Hard Hat Workers dataset and determined its performance on the counting task.Unfortunately, Byrd couldn’t count the images of people

This highlights a striking difference between Bards’ and Bing’s capabilities in how Bing treats humans. Both go to great lengths to prevent human faces from being used as inputs to their models. Bing selectively blurs faces, while Bard completely rejects input of images containing human faces.

Google is careful not to respond to human images, which also hinders Bards’ usability to some extent. Bard not only rejects images in which humans are the primary subject, but it also tries to reject images in which humans are present, greatly limiting the number of images that can be used.

Count objects with Bard

In this task, we used the apples dataset and asked Bard to count the number of apples present in the image. Extending this to three different prompts, it becomes increasingly difficult to assess the bard’s quantitative and qualitative reasoning skills, as well as their ability to format data in a structured way.

Bard was able to complete this task, but the results were not impressive.

Bard had great difficulty communicating the number of objects in the image, but the difficulty was exacerbated when asked to structure the data or sort the data by qualitative features.

Can Bard understand images from ImageNet?

In this task, you present Bard with a set of images from ImageNet, an image classification benchmark dataset, and ask him to label them.

Exactly matching labels are given a score of 100%, and nonexactly matching assigned labels are given a semantic similarity score (semantic similarity) between 0 and 100%.

In this regard, Bard performed incredibly well, achieving an average of 92.8%, 5 perfect matches and low variability, demonstrating its ability to consistently and accurately detect and convey image content. We haven’t tested Bard on the full dataset, but the performance here is quite impressive when compared to the results of state-of-the-art models.

ImageNet Test Graph Results How Bard and Bing/GPT-4 Compare

After previously running the same test with Bing Chat powered by GPT-4, we compiled and compared the performance of both LLMs.

One notable comparison is between Bing and Bard in the object counting task. Bard was able to complete some of the assigned tasks, but consistently performed poorly overall and compared to Bing. Unlike Bing, Bard struggled more when tasked with structuring data and classifying counts based on qualitative characteristics.

On the other hand, on the ImageNet classification/captioning task, Bard performed slightly better than Bing, outperforming Bing by 6.29%. Nevertheless, Bard performed worse than Bing overall, even excluding the failed people counting task.

Thoughts on how the bard works

After conducting tests, I researched how it worked and guessed how it worked.

As Google stated in its release notes, Bards’ new image input feature isn’t strictly a single multimodal model. Rather, it is based on Google Lens, which uses multiple Google features in combination. Many Google services such as search, translate, and shopping are integrated.

Although unconfirmed, it is believed that the Google Clouds Vision API is used. This API works similarly to many of Google Lens’ features, such as excellent OCR accuracy, the ability to identify image content and context, and the ability to extract and label text based on image content.

Example images from the apple dataset used in Google’s Cloud Vision

As can be seen in the sample images, this could partially explain Bard’s inaccuracies in the test of recognizing 1 apple, 1 fruit, 1 container, and 1 basket.

Conclusion

After experimenting with Bard, computer vision tasks are not yet a strong use case. As we concluded with the Bings chat feature, Bard’s primary use case is likely to be direct consumer use rather than computer vision tasks. Complementing image contextual information with general knowledge of LLM and other Google features can be a very useful tool for searching and finding general information.

Beyond that, similar to GPT-4, Bard has been shown to perform incredibly well on image captioning and classification tasks without training, so Bard’s use in an industrial or developer context could be used for zero-shot image-to-text conversion, general image classification, and classification.

Models such as Bard have a lot of powerful generalized information. However, performing inference can be expensive due to the computations Google must perform to return results. The best use case for developers and enterprises might be to take the information and power of these large multimodal models and train smaller, leaner models like Autodistill can do.

