Twitter explosion: Google’s brain engineer’s deep reinforcement learning dismissal（1）

Depth-enhancement learning is arguably the hottest direction in the field of artificial intelligence, and its reputation has become so strong that it is inseparable from the success of AlphaGo and AlphaZero with the DeepMind team.

Reinforcement learning itself is a very general artificial intelligence paradigm, intuitively making people feel very suitable to simulate various decision tasks. When it is combined with a deep neural network, as long as it gives full layer and neurons, it can be approximated by any function that approximates the non-linear function approximation model. That feeling is to heaven!

However, this article actually wants to tell everyone that deep reinforcement learning is a big hole, don't worry! Its success stories are actually not many, but each one is so famous that people who do not know it have a great illusion about it, overestimating ability and underestimating difficulty. Alex Irpan is currently a software engineer on Google's brain robot team. He received a bachelor's degree in computer science from Berkeley, and he was involved in research at the Berkeley AI Research (BAIR) Lab. The tutor is a deep-improvement learning leader Pieter. Abbeel.

This article is very popular on Twitter. After reading the English text, Zhihu net friend Frankenstein said that “it’s almost impossible to read this article and it’s like a long night. I'm really excited, but I’m not able to restrain myself for a while. This article is the most profound reinforcement study I’ve seen since I entered the pit. A good summary of the stage is strongly recommended as the first lesson of in-depth reinforcement learning. After reading, everybody should carefully consider whether or not to enter the pit.”

The AI frontline felt that it was necessary to introduce this article to more partners. Although the title is written to persuade the dismissal of the text, it is not really necessary to completely dismiss everyone, but it is hoped that everyone will look more calmly at the current progress of deeper reinforcement learning.

The full text is as follows:

Write in front

Once on Facebook, I made this statement:
At any time, if someone asks me if reinforcement learning can solve their problems, I will tell them, no. And in 70% of cases I was right.

Deep reinforcement learning is surrounded by a lot of hype. This is not for no reason! Reinforcement learning is a wonderful paradigm. In principle, a robust and high-performance RL system should be able to perform all tasks. Combining this paradigm with the deep learning experience strength is clearly the best match. The depth RL seems to be one of the closest systems to the Artificial General Intelligence (AGI), and the dream of building AGI has consumed billions of dollars in funding.

Unfortunately, deep reinforcement learning still does not work.

Of course, I believe it is useful. If I do not believe in reinforcement learning, I will not study it. However, there are many problems in the research process, and most of them are difficult to solve. Behind the glamorous demonstration of agent agents is the innumerable blood, sweat and tears that we cannot see.

I have seen many times that people are tempted by research in this field. They only tried deep reinforcement learning for the first time, and they just had no problems, so they underestimated the difficulty of depth RL. If you do not experience failure, they will not know how hard it is. Depth RL will continue to hit them until they learn how to set realistic research expectations.

This is not anyone's fault but a systemic issue. If the result is good, it is natural to say anything, but if the result is not very good, it wouldn't be easy to write the same wonderful story. The problem is that the negative results are the ones most frequently encountered by researchers. In some ways, negative cases are actually more important than positive ones.

In the following content, I will explain why depth RL has no effect, under what circumstances it can play a role, and how it will become more reliable in the future. I didn't write this article to stop people from researching depth RL. I do this because I believe that if everyone finds the problem together, it will be easier to make progress on these issues. If everyone gets together to discuss these issues, and not everyone falls into the same pit again and again, it will be easier to get consensus.

I hope to see a better development of the depth of RL research and more people are entering this field. But I also hope people understand what they are about to face.

In this article, "enhanced learning" and "deepening reinforcement learning" are used interchangeably. What I criticize is that deepening the empirical behavior of learning is not a criticism of reinforcement and learning as a whole. The papers I quoted usually use deep neural networks as agents. Although empirical criticism may apply to linear RL or tabular RL, I am not sure whether they can be extended to smaller issues. In a huge, complex, high-dimensional environment, good function approximation is very necessary. The good application prospect of RL in this environment has promoted the hype to depth RL. This kind of hype is one of the problems that needs to be solved.

The direction of this article is from pessimism to optimism. Although the article is long, if you read it patiently, I believe you will certainly gain something.

The following first describes the failure of the depth RL case.

The sample efficiency of deep reinforcement learning may be extremely low

The most classic experimental benchmark for deep reinforcement learning is Atari. In the well-known Deep Q-Networks paper, if you combine Q-learning, a properly sized neural network, and some optimization techniques, you can make the network reach or even exceed the human level in some Atari games.

Atari games run at 60 frames per second. Can you estimate how advanced the DQN is to reach the human level? How many frames are needed?

The answer to this question depends on the type of game. Let's take a look at Deepmind's most recent paper, Rainbow DQN (Hessel et al, 2017). This paper uses model simplification to study several improvements to the original DQN structure, demonstrating that combining all improvements can achieve the best results. In the 57 Atari game tests, the model surpassed human performance in 40 games. The following figure shows the results of the experiment.

The Y-axis is the “human normalized fractional median”, which is obtained by training 57 DQNs (one for each Atari game) and normalizing the performance of the agent with human performance as 100%. Based on this, the median number of game performances for 57 games was plotted. At 18 million frames, Rainbow DQN exceeded the 100% threshold. This equates to approximately 83 hours of gaming experience, plus the time required to train the model, and most people only need a few minutes to play an Atari game.

Please note that 18 million frames are actually very few, considering that the previous method (Distributional DQN, Bellemare et al, 2017) requires 70 million frames to reach human-like levels, which is four times the time required by Rainbow DQN. As for (Nature DQN, Mnih et al, 2015), it can never achieve 100% performance even after 200 million frames.

The planning error theory thinks that it usually takes longer to complete something than you think. Reinforcement learning has its own planning fallacy - learning a policy requires more samples than you think.

This problem does not only appear in Atari games. The second most common benchmark experiment is MuJoCo, a set of tasks set up in the MuJoCo physics simulator. In these tasks, the input state is usually some of the positions and speeds of each joint of the simulated robot. Even if you do not solve visual problems, learning these benchmarks requires 10^5 to 10^7 steps. For such a simple environment, this figure can be said to be a great deal of experience.

DeepMind's Parkour Paper (Heess et al, 2017) trained over 100 hours with 64 machines to learn about politics. The paper did not say what the "machine" is, but I think it should refer to a CPU.

Video 1: https://v.qq.com/x/page/b0566s976mm.html

The results of the paper are cool. When I first saw this paper, I did not expect the depth of RL to learn these running gait.

At the same time, however, the fact that 6400 CPUs are needed is a bit frustrating. Not because I did not expect it to take so much time, but the depth RL is still several orders of magnitude lower than the actual application level sample.

Here is a clear contrast: If we directly ignore the sample efficiency? There are several situations where it is easy to get experience, such as games. However, in the absence of this point, RL is facing a difficult battle, and unfortunately, most of the situations in the real world fall into this category.

If you only care about the final performance, many problems are more suitable to be solved by other methods

When looking for a solution to any research problem, there are often trade-offs between different goals. The goal of optimization can be a truly good solution, or it can be a good research contribution. The best situation is to obtain a good solution while making good research contributions, but it is difficult to find a problem that meets this standard and is feasible.

If purely for good performance, the depth RL's record may not be so bright, because it is always defeated by other methods. This is a video of the MuJoCo robot, controlled using online trajectory optimization. The correct action is calculated in near real-time, online, and without offline training. And it runs on hardware in 2012 (Tassa et al, IROS 2012).

Video 2: https://v.qq.com/x/page/c0566e0m0vp.html

I think this paper can be compared with that Parkour paper. What is the difference between the two?

The difference is that Tassa et al. use model predictive control to plan using real-world models (physical models). The modelless RL is not so designed, so it is more difficult. On the other hand, if planning a model is helpful, why complicate the problem and train a RL strategy?

Similarly, with the ready-made Monte Carlo Tree Search (MCTS), it is easy to overtake DQN's performance in Atari games. The following is the benchmark figure given in the Guo et al, NIPS 2014 paper. They compare the score of a trained DQN with the score of the UCT agent (UCT is the standard MCTS we use now).

This is not a fair comparison because DQN does not have a search step, and MCTS can search using a real model (Atari simulator). However, sometimes people do not care whether the comparison is fair or not. People only care about its effectiveness. If you are interested in a comprehensive assessment of UCT, you can look at the original (Arcade Learning Environment, Bellemare et al, JAIR 2013) paper appendix.

Reinforcement learning can theoretically be used for anything, including an environment where the model is unknown. However, this universality comes at a price: It is difficult for algorithms to use specific information to help learning. This forces people to use more scary samples to learn the laws that would have been obtained by hard coding.

Countless experiences show that except for a few cases, the algorithm for a specific domain is faster and better than that for reinforcement learning. If you only research depth RL for depth RL, this is not a problem, but every time I compare RL with any other method, I feel very frustrated, without exception. One of the reasons I like AlphaGo very much is because it represents an unquestionable victory in depth RL, and this kind of victory is not common.

This makes it harder and harder for my layman to explain why the problems I studied are cool, difficult and interesting because they often don't have the background or experience to understand why these problems are so difficult. There is a huge "interpretation gap" between what people think of depth RL can do and what it can really do. I am currently studying robots. When you talk about robots, most people think of the company is Boston Dynamics.

Video 3: https://v.qq.com/x/page/t05665fb4lk.html

They can not use reinforcement learning. If you look at the company's paper, you will find that the paper's keywords are time-varying LQR, QP solution, and convex optimization. In other words, their main use is still classic robotics technology, and the actual results prove that if they use these classical techniques, they can work well.

Reinforcement learning usually requires reward functions

Reinforcement learning generally assumes that there is a reward function. Usually given directly, or manually adjusted offline, and then constantly revised during the learning process. I say "usually" because there are exceptions, such as imitating learning or reversing RL, but most RL methods honor the reward function.

More importantly, to make RL do the right thing, the reward function must know exactly what you want. RL has a tendency to fit a reward function that can lead to unexpected results. This is why Atari games are a good benchmark. Not only is it easy to get a large number of samples from it, but also because the goal of each game is to score as much as possible, so you never have to worry about how to define reward functions, and each system uses the same reward function.

This is why the MuJoCo mission is very popular. Because they run in the simulator, you clearly know all the object states, which makes the design of reward functions much easier.

In the Reacher task, it is necessary to control two arms connected to the center point. The task goal is to move the arm to the designated position. Here is a video of a successful strategy.

Video 4: https://v.qq.com/x/page/z0566wbrs89.html

Since all positions are known, the reward function can be defined as the distance from the end of the arm to the target, plus a small control loss. In theory, in the real world, if you have enough sensors to get a precise location in your environment, you can do the same.

For its part, the need to reward a function is nothing, unless...

Reward function design is difficult

It is not difficult to add a reward function. The difficulty is how to design a reward function so that it encourages the behavior you want while still being learnable.

In the HalfCheetah environment, a bipedal robot is confined to a vertical plane and can only run forward or backward.

Video 5: https://v.qq.com/x/page/y05669lu3sf.html

The task goal is to learn the gait of running. The bonus function is the speed of HalfCheetah.

This is a form reward, which means that when the state comes closer to the goal, it increases the reward. The other is a sparse reward, in which rewards are awarded when the goal is achieved, and no rewards are awarded in other states. Planned rewards are often easier to learn because even if the strategy does not find a complete solution to the problem, the system will provide positive feedback.

Unfortunately, planning rewards can affect learning, leading to an eventual behavior that doesn't match what you want. A good example is Dragon Boat Racing (see the OpenAI blog for details). The goal is to finish the game. It is conceivable that the sparse reward function will reward +1 in the event that the game is completed within a given time, and 0 in other cases.

However, the reward given by the game is that if you reach the checkpoint, you will be awarded points, and the collection items will also be added, and the scores for collecting the items will be more than those obtained by completing the competition. Under such a reward function, the agent of the RL system can get the highest score even if the game is not completed. As a result, it has also led to many unexpected behaviors. For example, an agent crashes into a ship, catches fire, or even goes in the opposite direction, but it scores higher than simply completing the game.

Video 6: https://v.qq.com/x/page/f05666di8ke.html

RL algorithm is a continuum, they can understand more or less their environment. The most common model-free RL is almost the same as black box optimization. These methods can only solve the optimal strategy in MDP (Markov decision process). The agent only knows which one gives a + 1 bonus, which one does not, and the rest needs his own learning. As with black box optimization, the problem is that for the agent, any +1 reward is good, even if the +1 reward is not for the right reason.

A classic non-RL example is when a genetic algorithm is used to design a circuit. A non-connected logic gate appears in the resulting circuit.

All gray cells must have the correct behavior, including the upper left corner, even if it is not connected to any cell. (from "An Evolved Circuit, Intrinsic in Silicon, Entwined with Physics")

More examples can be found in the Salesforce 2017 blog, where the goal is a textual summary. Their benchmark models are trained with supervised learning, and then they are evaluated using the automatic metrics ROUGE. ROUGE is not divisible, but RL can solve non-differentiated rewards, so they try to optimize ROUGE directly with RL. A very high ROUGE score was obtained, but the actual effect was not very good. Below is an example:
Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And.
So although the RL model got the highest ROUGE score:

They ended up using another model.

There is another interesting example: the paper of Popov et al, 2017, also known as "Lego Building Paper". The author uses a distributed DDPG to learn gripping strategies. The goal is to grab the red block and stack it on top of the blue block.

They completed this task but encountered a complete failure in the middle. For the initial lifting action, they give bonuses based on the height of the red block. The reward function is defined by the z-axis coordinate of the bottom surface of the red block. One of the failure modes is that the strategy learned is to reverse the red block instead of picking it up.

Video 7: https://v.qq.com/x/page/r0566y150ya.html

Obviously this is not the solution the researchers expected, but RL did not care. From a reinforcement learning point of view, it is rewarded for flipping a block, so it will flip to the end.

One way to solve this problem is to give sparse rewards, and only the robot will reward the block after it is stacked. Sometimes this works because sparse rewards are learnable. But usually it is useless, because the lack of positive reinforcement will complicate everything.

Another way to solve this problem is to carefully adjust the reward function, add new reward conditions, and adjust the existing reward coefficient until the behavior you want to learn appears in the RL algorithm. This method has the potential to make RL research a victory, but it is a very unfulfilled battle. I never think that I have learned anything from it.

One of the reward functions of the "Lego Block" paper is given below.

I don't know how much time they spent on designing this reward function, but based on the number of terms and the number of different coefficients, I guess it is "a lot."

In the exchange with other RL researchers, I heard some anecdote because of the novel behavior caused by the improper definition of reward function.

A colleague is teaching an agent through a room. If the agent leaves the boundary, this event is terminated. He did not impose any punishment on this end of the incident. The results of the last learned policy is self-destructive, because negative rewards are too rich, and positive rewards are difficult to obtain, in the agent's view, the rapid death ended with 0 rewards, more desirable than long-term activities may cause negative rewards .

A friend is training a simulated robotic arm to reach a point above the table. This point is defined by the table, and the table is not fixed to anything. The agent learns to slam the table and turn the table over so that it is also moved to the target point. The target point just fell to the end of the robot arm.

A researcher used RL to train a simulated robotic arm to pick up a hammer and nail it in. Initially, rewards were defined as the depth to which nails were nailed into the hole. As a result, the robot did not pick up the hammer but instead nailed it with its own limbs. The researchers then added bonus items to encourage the robot to pick up the hammer and then retrain the strategy. Although they were given the strategy of picking up the hammer, the robot just dropped the hammer on the nail instead of using the hammer.
It is true that these are "listening to others", but these acts sound credible. I have failed many times in the RL experiment, so I will not doubt it again.

I know that people like to use the story of the paperclip optimizer to alarmist. They always like to guess some special AGI to make up such a story. There are many real-life failures every day. We have no reason to imagine such stories out of thin air.

Even if the reward function is well designed, it is difficult to avoid the local optimal solution.

The RL examples given earlier are sometimes referred to as "reward cheating." Under normal circumstances, AI will maximize the reward function through normal means. However, in some cases, AI can directly adjust the reward function to the maximum value by means of cheating. Sometimes a smart, ready-made solution can be rewarded more than the answer expected by the reward function designer.

Reward cheating is an exception, and the more common situation is due to exploration-utilization trade-offs resulting in local optimal solutions.

The following is a video of the implementation of the normalized advantage function. Learn in the HalfCheetah environment.

Video 8: https://v.qq.com/x/page/g05663mtu9m.html

For outsiders, this robot is stupid. But this is because we stand in the third party's perspective, and we know that running with our feet is much better. But RL doesn't know! It receives the state, then acts, and then knows that it received a positive reward. That's it.