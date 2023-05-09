



Imagine sitting on a park bench and someone taking a walk. Scenes can change constantly as a person walks, but the human brain can translate that dynamic visual information into a more stable representation over time. This ability, known as perceptual correction, helps predict pedestrian trajectories.

Unlike humans, computer vision models typically do not exhibit perceptual accuracy and learn to represent visual information in highly unpredictable ways. However, machine learning models with this ability may be able to more accurately estimate how objects and people move.

Researchers at MIT have found that certain training methods help computer vision models learn more perceptually straight representations, just like humans do. Training involves showing a machine learning model millions of examples so that it can learn a task.

Researchers have found that training a computer vision model using a technique called adversarial training makes the model less sensitive to small errors added to the image and improves the perceptual linearity of the model.

The team also found that perceptual accuracy was affected by the task the model was trained and run on. A model trained to perform an abstract task, such as classifying an image, performs better perceptually than a model trained to perform a more fine-grained task, such as assigning every pixel in an image to a category. learn straight expressions.

For example, a node in the model has an internal activation representing a dog, and the model can detect a dog when it sees an image of a dog. A perceptually straight representation retains a more stable dog representation even when there are small changes in the image. This makes it more robust.

By better understanding perceptual accuracy in computer vision, researchers hope to uncover insights that will help develop models that make more accurate predictions. For example, this property could improve the safety of self-driving cars that use computer vision models to predict the trajectory of pedestrians, cyclists, and other vehicles.

One of the key messages here is that taking inspiration from biological systems, such as human vision, can give us insight into why certain things work the way they do, as well as how neural It means it can inspire ideas for improving the network. A postdoctoral fellow at the Massachusetts Institute of Technology (MIT) and computer he is co-author of a paper investigating perceptual correctness in vision.

Co-lead author Anne Harrington with DuTell is a graduate student in the Department of Electrical Engineering and Computer Science (EECS). Postdoc Ayush Tewari. Graduate student Mark Hamilton. Woven Planet Research his manager, Simon Stent. Ruth Rosenholtz, principal research scientist in the Brain and Cognitive Sciences Division and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). Senior author William T. Freeman is a professor of electrical engineering and computer science at Thomas and Gerd Perkins and a member of CSAIL. This research is being presented at an international conference on learning representations.

studying correction

After reading a 2019 paper on human perceptual linearity by a team of researchers at New York University, DuTell, Harrington, and their colleagues wondered if the property could be useful in computer vision models as well.

They set out to determine whether different kinds of computer vision models straighten the visual representations they learn. They fed each model frame of video and examined representations at various stages of the learning process.

A model is straight if its representation changes in a predictable way across the frames of the video. Finally, the output representation should be more stable than the input representation.

A representation can be thought of as a very curved line at first. A straightening model can take that curve from the video and straighten it through that processing step, DuTell explains.

Most models they tested did not straighten. Of those few, those who straightened most effectively were trained for the classification task using a technique known as adversarial training.

Adversarial training subtly modifies images by slightly changing each pixel. Humans may not notice the difference, but these small changes can trick machines into misclassifying images. Adversarial training makes the model more robust, so it cannot be fooled by these operations.

Adversarial training teaches models to react less to small changes in images, which helps them learn more predictable representations over time, explains Harrington.

People already had the idea that adversarial training might help make models more human. It was interesting to see that carry over to another trait people hadn’t tested before, she says.

But the researchers found that models trained adversarially learn to straighten out only when trained for a wide range of tasks, such as classifying entire images into categories. Models tasked with segmentation that labels every pixel as a particular class did not straighten out, even when trained adversarially.

Consistent classification

Researchers tested these image classification models by showing them videos. They found that models that learned more perceptually straight representations tended to classify objects in videos more consistently and correctly.

To me, it’s surprising that these adversarially trained models, having never seen a video or trained on temporal data, still show some degree of correction.

The researchers don’t know exactly what the adversarial training process does to straighten computer vision models, but their results suggest that stronger training schemes straighten the model. , she explains.

Building on this work, the researchers hope to use what they have learned to create new training schemes that explicitly endow the model with this property. They also want to dig deeper into adversarial training to understand why this process helps straighten the model.

From a biological point of view, adversarial training doesn’t necessarily make sense. It’s not how humans understand the world. Harrington says there are still many questions about why this training process seems to help the model behave more like a human.

Understanding the representations learned by deep neural networks is important for improving properties such as robustness and generalization, said Bill Lotter, an assistant professor at the Dana-Farber Cancer Institute and Harvard Medical School. However, he was not involved in this study. Harrington et al. We conducted an extensive evaluation of how the computer vision model’s representation changes over time when processing natural video, and found that the curvature of these trajectories varies greatly with model architecture, training properties, and task. is showing. These findings inform the development of improved models and also provide insight into biological visual processing.

This paper confirms that straightening natural videos is a rather unique property displayed by the human visual system. This provides an interesting relationship with another aspect of human perception. It’s its robustness to a variety of image transformations, whether natural or artificial, says Olivier Hnaff, a research scientist at DeepMind. That even adversarially trained scene segmentation models do not straighten their input raises an important question for future research: Do humans parse natural scenes in the same way as computer vision models? ? How can we represent and predict the trajectories of moving objects while remaining sensitive to spatial details? lays the foundations of perceptual theory.

This research is funded in part by the Toyota Research Institute, an MIT CSAIL METEOR Fellowship, the National Science Foundation, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.

