A close look reveals that the newest systems, including DeepMind’s much-hyped Gato, are still stymied by the same old problems
To the average person, it must seem as if the field of artificial intelligence is making an immense progress. According to the press releases, and some of the more gushing media accounts, OpenAI’s DALL-E 2 can seemingly create spectacular images from any text; another OpenAI system called GPT-3 can talk about just about anything; and a system called Gato that was released in May by DeepMind, a division of Alphabet, seemingly worked well on every task the company could throw at it. One of DeepMind’s high-level executives even went so far as to brag that in the quest for artificial general intelligence (AGI), AI that has the flexibility and resourcefulness of human intelligence, “The Game is Over!” And Elon Musk said recently that he would be surprised if we didn’t have artificial general intelligence by 2029.
Don’t be fooled. Machines may someday be as smart as people, and perhaps even smarter, but the game is far from over. There is still an immense amount of work to be done in making machines that truly can comprehend and reason about the world around them. What we really need right now is less posturing and more basic research.
To be sure, there are indeed some ways in which AI truly is making progress—synthetic images look more and more realistic, and speech recognition can often work in noisy environments—but we are still light-years away from general-purpose, human-level AI that can understand the true meanings of articles and videos, or deal with unexpected obstacles and interruptions. We are still stuck on precisely the same challenges that academic scientists (including myself) have been pointing out for years: getting AI to be reliable and getting it to cope with unusual circumstances
Take the recently celebrated Gato, an alleged jack of all trades, and how it captioned an image of a pitcher hurling a baseball. The system returned three different answers: “A baseball player pitching a ball on top of a baseball field,” “A man throwing a baseball at a pitcher on a baseball field” and “A baseball player at-bat and a catcher in the dirt during a baseball game.” The first response is correct, but the other two answers include hallucinations of other players that aren’t seen in the image. The system has no idea what is actually in the picture as opposed to what is typical of roughly similar images. Any baseball fan would recognize that this was the pitcher who has just thrown the ball, and not the other way around—and although we expect that a catcher and a batter are nearby, they obviously do not appear in the image.
A baseball player pitching a ball
on top of a baseball field.
A man throwing a baseball at a
pitcher on a baseball field.
A baseball player at-bat and a
catcher in the dirt during a
baseball game
Likewise, DALL-E 2 couldn’t tell the difference between a red cube on top of a blue cube and a blue cube on top of a red cube. A newer version of the system, released in May, couldn’t tell the difference between an astronaut riding a horse and a horse riding an astronaut.
When systems like DALL-E make mistakes, the result is amusing, but other AI errors create serious problems. To take another example, a Tesla on autopilot recently drove directly towards a human worker carrying a stop sign in the middle of the road, only slowing down when the human driver intervened. The system could recognize humans on their own (as they appeared in the training data) and stop signs in their usual locations (again as they appeared in the trained images), but failed to slow down when confronted by the unusual combination of the two, which put the stop sign in a new and unusual position.
Unfortunately, the fact that these systems still fail to be reliable and struggle with novel circumstances is usually buried in the fine print. Gato worked well on all the tasks DeepMind reported, but rarely as well as other contemporary systems. GPT-3 often creates fluent prose but still struggles with basic arithmetic, and it has so little grip on reality it is prone to creating sentences like “Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation,” when no expert ever said any such thing. A cursory look at recent headlines wouldn’t tell you about any of these problems.
The subplot here is that the biggest teams of researchers in AI are no longer to be found in the academy, where peer review used to be the coin of the realm, but in corporations. And corporations, unlike universities, have no incentive to play fair. Rather than submitting their splashy new papers to academic scrutiny, they have taken to publication by press release, seducing journalists and sidestepping the peer review process. We know only what the companies want us to know.
In the software industry, there’s a word for this kind of strategy: demoware, software designed to look good for a demo, but not necessarily good enough for the real world. Often, demoware becomes vaporware, announced for shock and awe in order to discourage competitors, but never released at all.
Chickens do tend to come home to roost though, eventually. Cold fusion may have sounded great, but you still can’t get it at the mall. The cost of AI is likely to be a winter of deflated expectations. Too many products, like driverless cars, automated radiologists, and all-purpose digital agents, have been demoed, publicized—and never delivered. For now, the investment dollars keep coming in on promise (who wouldn’t like a self-driving car?), but if the core problems of reliability and coping with outliers are not resolved, the investment will dry up. We will be left with powerful deepfakes, enormous networks that emit immense amounts of carbon, and solid advances in machine translation, speech recognition, and object recognition, but too little else to show for all the premature hype.
Deep learning has advanced the ability of machines to recognize patterns in data, but it has three major flaws. The patterns that it learns are, ironically, superficial, not conceptual; the results it creates are difficult to interpret; and the results are difficult to use in the context of other processes, such as memory and reasoning. As Harvard computer scientist Les Valiant noted, “The central challenge [going forward] is to unify the formulation of … learning and reasoning.” You can’t deal with a person carrying a stop sign if you don’t really understand what a stop sign even is.
For now, we are trapped in a “local minimum” in which companies pursue benchmarks, rather than foundational ideas, eking out small improvements with the technologies they already have rather than pausing to ask more fundamental questions. Instead of pursuing flashy straight-to-the-media demos, we need more people asking basic questions about how to build systems that can learn and reason at the same time. Instead, current engineering practice is far ahead of scientific skills, working harder to use tools that aren’t fully understood than to develop new tools and a clearer theoretical ground. This is why basic research remains crucial.
That a large part of the AI research community (like those that shout “Game Over”) doesn’t even see that is, well, heartbreaking.
Imagine if some extraterrestrial studied all human interaction only by looking down at shadows on the ground, noticing, to its credit, that some shadows are bigger than others, and that all shadows disappear at night, and maybe even noticing that the shadows regularly grew and shrank at certain periodic intervals—without ever looking up to see the sun or recognizing the three-dimensional world above.