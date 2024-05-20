



Theory of mind is a characteristic of emotional and social intelligence that allows us to infer people's intentions and relate to and empathize with each other. Most children acquire this type of skill between the ages of 3 and 5.

Researchers have used two families of large-scale language models, OpenAI GPT-3.5 and GPT-4, and three versions of Metas Llama to improve human performance, including identifying false beliefs, recognizing errors, and understanding essence. We tested it with a task designed to test theory of mind. It is not said directly, but implicitly. They also tested 1,907 human participants to compare a range of scores.

The team conducted five types of tests. The first hint task is designed to measure the ability to infer another person's true intentions through indirect comments. The second false belief task assesses whether someone can infer that another person could reasonably be expected to believe something that they know does not happen to be true. . Another test measured the ability to recognize when someone is making a mistake. The fourth test, on the other hand, consists of telling a strange story in which the main character does something unusual, in order to assess whether someone can explain the contrast between what was said and what was actually done. I was there. meaning. It also included a test to see if people could understand sarcasm.

The AI ​​model was given each test 15 times in separate chats to allow it to process each request independently, and its answers were scored in the same way as humans. The researchers then tested human volunteers and compared the two sets of scores.

Both versions of GPT performed above average for humans on tasks involving indirect requests, misdirections, and false beliefs, but GPT-4 outperformed humans on tests of irony, innuendo, and bizarre stories. exceeded. The performance of his three models in Llama 2 was below human average.

However, while Llama 2, the largest of the three meta-models tested, outperformed humans when it came to recognizing fraud scenarios, GPT consistently returned incorrect responses. The authors believe this is due to GPT's general reluctance to draw conclusions about opinions. This is because many of the models answered that they did not have enough information to answer one way or another.

