



Posted by: Sanjay Subramanian, PhD Student, UC Berkeley, Arsha Nagrani, Research Scientist, Google Research, Perception Team

Visual Question Answering (VQA) is a machine learning task that requires a model to answer questions about an image or set of images. Traditional VQA approaches require large amounts of labeled training data consisting of thousands of human-annotated question-answer pairs associated with images. Recent advances in large-scale pre-training have led to the development of VQA methods that perform well with less than 50 training examples (few shots) and no human annotated VQA training data (zero shots). However, there is still a large performance gap between these methods and his VQA methods, which are fully supervised and state-of-the-art such as MaMMUT and VinVL. In particular, few-shot methods struggle with spatial inference, counting, and multi-hop inference. Furthermore, few-shot methods have generally been limited to answering questions about a single image.

To improve the accuracy of VQA examples involving complex reasoning, Modular Visual Question Answering with Code Generation, presented at ACL 2023, provides a frame that uses programmatic synthesis to answer visual questions. Introducing CodeVQA, a work of art. Specifically, given a question about an image or set of images, CodeVQA generates a Python program (code) with simple visual functions that can process the image, and runs this program to determine the answer. . In a few-shot setting, we show that CodeVQA outperforms previous studies by about 3% on his COVR dataset and 2% on the GQA dataset.

Code VQA

The CodeVQA approach uses a Code Description Large Language Model (LLM) such as PALM to generate Python programs (code). Help LLMs correctly use visual functions by creating prompts consisting of descriptions of these functions and less than 15 example “in-context” visual questions paired with relevant Python code. I will guide you. To select these examples, we compute the embeddings of the input question and all the questions (a randomly selected set of 50) that annotated the program. Then select the question that is most similar to your input and use it as an in-context example. Given a prompt and a question to answer, LLM generates a Python program that represents the question.

Instantiate the CodeVQA framework with three visual functions: (1) query, (2) get_pos, and (3) find_matching_image.

Queries that answer questions about a single image are implemented using a few-shot plug-and-play VQA (PnP-VQA) method. PnP-VQA uses BLIP (an image caption transform transformer pretrained on millions of image-caption pairs) to generate captions and feeds these to an LLM that outputs answers to questions. Get_pos is an object localizer that takes an object description as input and returns its position in the image, implemented using GradCAM. Specifically, the description and image pass through a BLIP joint text-image encoder that predicts the image-text match score. GradCAM takes the gradient of this score over image features to find the regions most relevant to the text. Find_matching_image is used in multiple-image questions to find the image that best matches a given input phrase. It is implemented using a BLIP text and image encoder to compute text embeddings for phrases and image embeddings for each image. Then the dot product of the text embedding and each image embedding represents the relevance of each image to the phrase, and we choose the image that maximizes this relevance.

All three features can be implemented using models that require little annotation, such as text-image-text pairs collected from the web or a few VQA samples. Moreover, beyond these features, the CodeVQA framework can be easily generalized to other features that users may implement, such as object detection, image segmentation, and knowledge base searching.

A diagram of the CodeVQA method. First, a large language model generates a Python program (code) that calls a visual function that represents the question. This example uses a simple VQA method (query) to answer part of the question and uses an object localizer (get_pos) to find the location of the mentioned object. The program then combines the outputs of these functions to produce the answer to the original question.result

The CodeVQA framework correctly generates and executes Python programs for single-image questions as well as multi-image questions. For example, given two images of him, each with two pandas, the question might arise: “Is it true that there are four pandas of him?” In this case, LLM translates the count question for image pairs into a program that retrieves (using a query function) the number of objects per image. It then adds the counts for both images to calculate the total count and compares it with the numbers in the original question to give a yes or no answer.

Evaluate CodeVQA on three visual inference datasets: GQA (single image), COVR (multiple images), and NLVR2 (multiple images). For GQA, 12 in-context samples are provided for each method, and for COVR and NLVR2, 6 in-context samples are provided for each method. The table below shows that CodeVQA consistently improves over the baseline few-shot VQA method for all three datasets.

Method GQA COVR NLVR2 Few shots PnP-VQA 46.56 49.06 63.37 CodeVQA 49.03 54.11 64.04 Results for GQA, COVR, and NLVR2 datasets. We show that CodeVQA consistently improves over Few-Shot PnP-VQA. The metric is the exact match accuracy, that is, the percentage of examples where the predicted answer exactly matches the ground truth answer.

In GQA, we found CodeVQA’s accuracy to be about 30% higher than baseline on spatial reasoning questions, 4% higher on “and” questions, and 3% higher on “or” questions. The third category contains multi-hop questions such as “Is there a salt shaker or skateboard in the picture?” and the generated program is shown below.

img = open_image(“Image13.jpg”) Salt_shakers_exist = query(img, “Do you have salt shakers?”) kateboards_exist = query(img, “Do you have skateboards?”) if Salt_shakers_exist == “yes” or skateboards_exist == “yes”: answer = “yes” else: answer = “no”

For COVR, we find that the higher the number of input images, the higher the gain over the baseline for CodeVQA, as shown in the table below. This trend shows that decomposing the problem into single-image questions is beneficial.

Number of images Method 1 2 3 4 5 Few shots PnP-VQA 91.7 51.5 48.3 47.0 46.9 CodeVQA 75.0 53.3 48.7 53.2 53.4 Conclusion

Introducing CodeVQA, a few-shot visual question-answering framework that relies on code generation to perform multi-step visual reasoning. Interesting directions for future work include extending the set of modules used and creating a similar framework for visual tasks beyond VQA. Visual language models, such as those used in our visual functions, have been shown to exhibit social bias, so caution should be exercised when considering whether to deploy systems such as CodeVQA. Please be careful. At the same time, compared to the monolithic model, CodeVQA offers additional interpretability (by Python programs) and controllability (by changing prompts or visual functions), useful in production systems.

Acknowledgments

The research is a collaboration between the University of California, Berkeley Artificial Intelligence Laboratory (BAIR) and Google Research, with Sanjay Subramanian, Medihini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Conducted by Dan Klein. .

