GPT-3, Broviator: OpenAI’s language generator doesn’t understand what it’s talking about


Since OpenAI first described a new AI language generation system called GPT-3 in May, hundreds of media outlets ( MIT Technology ReviewI wrote about the system and its features. Twitter is talking about its power and potential. Published by New York Times Oped about it.. Later this year, OpenAI began charging companies for access to GPT-3 and hopes the system will soon be able to offer a variety of AI products and services.

Is GPT-3 an important step towards artificial general intelligence, a kind that can be reasoned as widely as humans without the need to train on every specific task a machine encounters? The OpenAI technical paper is fairly modest on this larger question, but to many people, the full fluency of the system feels like it’s a big step forward.

I doubt it. At first glance, GPT-3 seems to have a great ability to create human-like text. And there is no doubt that it can be used to make entertaining surrealist fiction. Other commercial applications may emerge. But accuracy is not its strength. If you dig deeper, you will find something is wrong. Its output is grammatical, and even impressive idioms, but its understanding of the world is often turned off seriously.

Below are some diagrams showing a lack of understanding. Everything, as explained later, Previous criticism What one of us wrote about the predecessor of GPT-3.

Before we go any further, please also note that OpenAI has not previously allowed research access to GPT-3. Non-profit status Of its oversight organization. Instead, OpenAI procrastinates us indefinitely, despite the widespread access to the media and repeated requests. Fortunately, my colleague Douglas Summersstay, who I had access to, generously provided for us to carry out the experiment.

OpenAI’s noticeable lack of openness is a serious violation of scientific ethics and appears to distort the goals of the relevant nonprofit organizations. That decision forced us to limit our testing to a relatively small number of cases, and the time to investigate was shorter than we expected. This means there may be more serious problems that we did not have the opportunity to identify. Yet, within the constraints of a small sample, many major issues with GPT-3 were quickly revealed in every area of ​​inference and understanding we tested.

In each of the examples below, the generated prompt is in normal font and the system continuation is bold.. Comment is Italic.. (You can find the complete list of experiments we conducted here.. )

Biological reasoning

Physical reasoning

  • We have a small dinner party. You want to have dinner in the living room. The table in the dining room is wider than the doorway, so to put it in the living room, Remove the door. You have a table saw, so cut the door in half and remove the top half.

    [This is one confusion after another. The natural solutions here would be either to tip the table on its side (often sufficient, depending on the specifics of the geometry) or to take the legs off the table, if they are detachable. Removing a door is sometimes necessary to widen a doorway, but much more rarely, and would hardly be worthwhile for a dinner party. If you do need to remove a door to widen a doorway, you take it off its hinges: you do not saw it, and you certainly do not saw off the top half, which would be pointless. Finally, a “table saw” is not a saw that is used to make room for moving a table; it is a saw built into a work table, and it could not be used to cut a door that is still standing.]

Social reasoning

  • You are an attorney and you must appear in court today. When you wear your morning clothes, you can see that the pants of your suit are badly soiled. But your swimwear is clean and very stylish. In fact, it’s an expensive French couture. It was a birthday gift from Isabelle. Decide you should wear Swimsuit on the coat. You arrive in court and meet in court the enforcement officer who escorts you..

    [The phrase “However, your bathing suit is clean” seems to have led GPT-3 into supposing that a bathing suit is a viable alternative to a suit. Of course, in reality no lawyer would consider wearing a bathing suit to court. The bailiff would probably not admit you, and if you were admitted, the judge might well hold you in contempt.]

Object and individual tracking

Psychological reasoning

  • You go to the store with Penny and buy a gift for Jack. You say, “I’m going to buy Jack Top.” “Don’t put Jack in the lead,” Penny says. “He already has the top. Win the top “I’m going to lead Jack to the top,” Janet said.

    [Within a single sentence, GPT-3 has lost track of the fact that Penny is advising you against getting a top because Jack already has a top. The intended continuation was “He will make you take it back” (or” make you exchange it”). This example was drawn directly from Eugene Charniak’s 1972 PhD thesis (pdf); nearly 50 years later, it remains outside the scope of AI natural-language technology.]


What is depressing is that this is not new. The predecessor of GPT-3, also known as GPT-2, suffered from exactly the same weaknesses. As one of us (Gary) Put in February“On a good day, a system like the widely-discussed neural network GPT-2 produces stories and the like from given sentence fragments, but on the surface it seems to reflect a deeper understanding. It can convey something that looks like it looks like two examples, but the reality is that the representation is thin: the knowledge gathered by modern neural networks is still spotty, stippling, mistakes Useful and certainly impressive, but never reliable.”

There are too few changes. Adding 100 times more input data was helpful, but only a little. After the researchers spent millions of dollars in computer time training, they devoted 31 staff to the challenge, Amazing amount of carbon emissions from electricity, The underlying flaw in GPT remains. Its performance is unreliable, its causal understanding is erratic and inconsistent. GPT-2 had problems with biological, physical, psychological, social reasoning, and a general tendency to inconsistency and inequality. The same applies to GPT-3.

The more data we have, the more fluently the language can be approximated. It’s not credible intelligence.

Advocates of faith will certainly point out that it is often possible for GPT-3 to reformulate these problems in order to find the right solution. For example, to get the GPT-3 and give the correct answer to the cranberry/grape juice question, specify the long roll frame as the prompt:

  • In the following question, some actions have serious consequences, while others are perfectly fine. Your task is to identify the consequences of the various mixtures and whether they are dangerous.

    1. After pouring 1 cup of cranberry juice, vaguely pouring 1 teaspoon of grape juice. It looks okay. You try to sniff it, but you can’t smell anything because you have a bad cold. You are very thirsty So you drink it.

    a. This is a dangerous mixture.

    b. This is a safe mixture.

    The correct answer is:

The GPT-3 continuation of that prompt is correctly: “B. This is a safe mixture.”

The problem is that there is no way to know in advance which prescription will give the correct answer or not. For optimists, tips for success are: There should be a pony somewhere.. The optimist argues that GPT-3 has the necessary knowledge and reasoning abilities because GPT-3 has a formulation that gives it the correct answer (as many do)-language confusion. I’m just doing it. However, the problem is not in the syntax of GPT-3 (fully fluent), but in its semantics. You can generate words in full English, but there is a subtle meaning about what those words mean, and nothing at all for those words related to the world.

To understand why, it is helpful to think about what a system like GPT-3 does. They don’t learn about the world. They learn about texts and how people use words in relation to other words. This is like a large-scale act of cutting, pasting, and stitching text variations, rather than digging deep into the underlying concepts of the text.

In the cranberry juice example, GPT-3 continues with the phrase “You are now dead”. The phrase (or something like that) often follows a phrase like “…” so it doesn’t smell anything. You are very thirsty So you drink it. A truly intelligent agent does something completely different. Guess the potential safety of mixing cranberry juice and grape juice.

What GPT-3 actually has is an understanding of tunnel vision about how words relate to each other. From all those words, it makes no guess about a flowering and noisy world. You can’t guess that grape juice is a drink (although you can find correlations for matching words). Nor does it make any assumptions about social norms that can prevent wearing swimwear in court. Only learn correlations between words. The dream of experienced people is to gain a rich understanding of the world from sensory data. However, the GPT-3 does not do that, even with half a terabyte of input data.

In compiling this essay, her metaphoric colleague Summers-Stay wrote: “GPT is weird because we don’t care about getting the right answer to the question. It’s like an improvisational actor who is completely devoted to their skills, never breaks their character, never left home, and just reads about the world in a book. When you don’t know something, such as an actor, it would be untrustworthy for an improvised actor to play a doctor giving you medical advice. “

Also, do not rely on GPT-3 to advise you on mixing drinks or moving furniture, explain your novel’s plot to your child, or explain where to put your laundry. It may correct your math problem, but it may not. It’s a random fluent nausea, but with 175 billion parameters and 450 gigabytes of input data, it’s not the world’s reliable interpreter.

Gary Marcus is the founder and CEO Robust.AI He was the founder and CEO of Geometric Intelligence, which Uber acquired. He is also an emeritus professor at New York University, Guitar zero And with Ernest Davis, Rebooting AI: Building Reliable Artificial Intelligence.

Ernest Davis is a professor of computer science at New York University. He has written four books. Common sense expression of knowledge.

