Deep Reinforcement Learning part 1 — Hello World !!!!


Duy Anh Nguyen
17 min readJul 2, 2020


Let’s talk about the nature of learning. We are not born knowing much. Over the curse of our lifetimes, we slowly gain an understanding of the world through interaction. We learn about cause and effect or how the world responds to our actions. Once we have an understanding of how the world works, we can use our knowledge to accomplish specific goals. we’ll take a stab at attaining a scientific understanding of how this learning from interaction happens. Specifically, we’ll take a computational approach called reinforcement learning or RL for short. Since the world is quite complicated, we’ll simplify the world to study environments with well-defined rules and dynamics, we’ ll then construct algorithms to teach an individual in this simple world to learn from interaction. We’ll study many of these algorithms to understand the strengths and limitations of each. We’ll begin with the basics to attain a solid foundational understanding of the basic structure behind these learning algorithms.

The applications of reinforcement learning are numerous and diverse. Ranging from self-driving cars to board games. So it’s pretty amazing that it was possible to teach an artificially intelligent agent how to play. More recently, progress was made on a game that is much more complicated. Maybe you’ve heard of AlphaGo. An AI agent trained to beat professional Go players. It’s said that there are more configurations in the game than there are atoms in the universe.


RL is also used to play video games such as Atari Breakout. The AI agent is given no prior knowledge of what a ball is or what the controls do. It only sees the screen and its score. Then through interacting with the game, with testing out the various controls, it’s able to devise a strategy to maximize its score.


Jumping to a completely different domain, RL is also used in robotics. For instance, it’s been used to teach robots to walk. The idea is that we can give the robot time to test out its new legs to see what works and what doesn’t work for staying upright. Then we can create an algorithm to help it learn from that gained experience, so it’s able to walk like a pro. But why teach a robot to walk when you can teach it to drive? RL is used successfully in self-driving cars, ships, and airplanes. It’s even been used in finance, biology, telecommunication, and inventory management among other things.


Image the cute puppy set the stage as an agent who learns from trial and error how to behave in an environment to maximize reward. But what do we mean when we talk about reinforcement learning in general? Well, you might be surprised to hear that not much changes when we trade this puppy for a self-driving car, a robot, or a more in general, reinforcement learning agent. In particular, the RL framework is characterized by an agent learned to interact with its environment.

We assume that time evolves and discrete timesteps. At the initial timestep, the agent observes the environment. You can think of this observation as a situation that the environment presents to the agent. Then, it must select an appropriate action in response. Then at the next timestep in response to the agents action, the environment presents a new situation to the agent. At the same time the environment gives the agent a reward which provides some indication of whether the agent has responded appropriately to the environment. Then the process continues where at each timestep the environment sends the agent an observation and reward. And in response, the agent must choose an action. In general, we don’t need to assume that the environment shows the agent everything he needs to make well-informed decisions.

some added notation where we again start from the very beginning at timestep zero. The agent first receives the environment state which we denote by S0, where zero stands for a timestep zero of course. Then, based on that observation the agent chooses an action, A0, at the next timestep, in this case, it timestep one and that’s a direct consequence of the agent’s choice of action, A0. And the environments previous state, S0, the environment transitions to a new state, S1, and gives some reward, R1, to the agent. The agent then chooses an action, A1. At timestep two, the process continues where the environment passes the reward in state. Then the agent responds with an action and so on. Whereas the agent interacts with the environment, this interaction is manifest as a sequence of states, actions, and rewards. That said, the reward will always be the most relevant quantity to the agent. To be specific, any agent has the goal to maximize expected cumulative reward or the some of the rewards attained over all timesteps. In other words, it seeks to find the strategy for choosing actions with the cumulative reward is likely to be quite high.

Episodic vs. Continuing Tasks

say we’re teaching an agent to play a game. Then, the interaction ends when the agent wins or loses. Or we might be running a simulation to teach a car to drive. Then, the interaction ends if the car crashes. Of course, not all reinforcement learning tasks have a well-defined ending point but those that do are called episodic tasks.

And in this case, we’ll refer to a complete sequence of interaction from start to finish as an episode. When the episode ends, the agent looks at the total amount of reward it received to figure out how well it did. It’s then able to start from scratch as if it has been completely reborn into the same environment but now with the added knowledge of what happened in its past life. In this way, as time passes over its many lives, the agent makes better and better decisions and you’ll see this for yourself in your coding implementations. Once your agents have spent enough time getting to know the environment, they should be able to pick a strategy where the cumulative reward is quite high. In other words, in the context of a game playing agent, it should be able to achieve a higher score. So episodic tasks are tasks with a well-defined ending point. We’ll also look at tasks that go on forever, without end. And those are called continuing tasks. For instance, an algorithm that buys and sells stocks in response to the financial market would be best modeled as an agent in the continuing tasks.

The Reward Hypothesis

We’ve discussed the diverse applications of Reinforcement Learning. Each has a defining agent and environment, and each agent has a goal. Ranging from a car learning to drive itself to an agent learning to play Atari games. It’s truly amazing that all of these very different goals can be addressed with the same theoretical framework. So far, we’ve made sense of the idea of reward from the perspective of a puppy that interacts with its owner. In this case, the state did in the timestep was the command that the owner communicated to the puppy, the action was the puppy’s response, and the reward was just the number of treats. And like a good Reinforcement Learning Agent, the puppy seeks to maximize that reward. In this case, the idea of reward comes naturally.

And it lines up well with the way we think about teaching a puppy. But in fact, the Reinforcement Learning Framework has any and all agents formulate their goals in terms of maximizing expected cumulative reward.

It’s important to note that the word “Reinforcement” and “Reinforcement Learning” is a term originally from behavioral science. It refers to a stimulus that’s delivered immediately after behavior to make the behavior more likely to occur in the future. The fact that this name is borrowed is no coincidence. In fact, it’s an important to defining hypothesis and reinforcement learning that we can always formulate an agents goal, along the lines of maximizing expected cumulative reward. And we call this hypothesis, the “Reward Hypothesis”. If this still seems weird or uncomfortable to you, you are not alone.

Goals and Rewards

So, I’d like to talk to you about some research that I find particularly interesting. And I think it’s a great example to illustrate the reward hypothesis that was introduced in the previous video. Google DeepMind recently addressed the problem of teaching a robot to walk. Among other problem domains, they worked with a physical simulation of a humanoid robot and they managed to apply some nice reinforcement learning to get great results. In order to frame this as a reinforcement learning problem, we’ll have to specify the state’s actions and rewards. We’ll dedicate two videos to this example and we’ll begin by detailing the actions. These are the decisions that need to be made in order for the robot to walk. Now, the humanoid has several joints, and the actions are just the forces that the robot applies to its joints in order to move. Because the robot has an intelligent method for deciding these forces at every point in time, that will be sufficient to get it walking. And what about the states?

The states are the context provided to the agent for choosing intelligent actions. In this context, the state at any point in time contain the current positions and velocities of all of the joints, along with some measurements about the surface that the robot was standing on. These measurements captured how flat or inclined the ground was, if there was a large step along the path and so on. The researchers at Google DeepMind also added contact sensor data, so that it could determine if the robot was still walking or if it had fallen over.

So far, we’ve been trying to frame the idea of a humanoid learning to walk in the context of reinforcement learning. We’ve detailed the states in actions, and we still need to specify the rewards. And the reward structure from the DeepMind paper is surprisingly intuitive. This line is pulled from the appendix of the DeepMind paper, and describes how the reward is decided at every time step? Each term communicates to the agent some part of what we’d like it to accomplish. So let’s look at each term individually.

To begin, at every time step, the agent receives a reward proportional to its forward velocity. So if moves faster, it gets more reward, but up to a limit. Here denoted Vmax, but it’s penalized by an amount proportional to the force applied to each joint. So if the agent applies more force to the joints, then more reward is taken away as punishment. Since the researchers also wanted the humanoid to focus on moving forward, the agent is also penalized for moving left, right, or vertically. It was also penalized if the humanoid moved its body away from the center of the track. So the agent will try to keep the humanoid as close to the center as possible. At every time step, the agent also receives some positive reward if the humanoid has not yet fallen. They frame the problem as an episodic task where if the human falls, then the episode is terminated. At this point, whatever cumulative reward the agent had at that time point is all it’s ever going to get. In this way, the reward signal is designed, so if the robot focused entirely on maximizing this reward, it would also coincidentally learn to walk. To see this, first note that if the robot falls, the episode terminates. And that’s a missed opportunity to collect more of this positive reward. And in general, if the robot walks for ten time steps, that’s only 10 opportunities to get reward. And if it stays walking for 100, that’s a lot more time to collect more reward. So if we get the reward in this way, the agent will try to keep from falling for as long as possible. Next, since the reward is proportional to the forward velocity, this will ensure the robot also feels pressured to walk as quickly as possible, in the direction of the walking track, but it also makes sense to penalize the agent for applying too much force to the joints. we could end up with a situation where the humanoid walks to erratically. By penalizing large forces, we can try to keep the movements more smooth and elegant. Likewise, we want to keep the agent on the track and moving forward. Otherwise, who knows where it could end up walking off to. Of course, the robot can’t focus just on walking fast, or just on moving forward, or only on walking smoothly, or just on walking for as long as possible.

Cumulative Reward

We’ve seen that the reinforcement learning framework gives us a way to study how an agent can learn to accomplish a goal from interacting with its environment. This framework works for many real world applications and simplifies the interaction into three signals that are passed between agent and environment. The state signal is the environment’s way of presenting a situation to the agent. The agent then responds with an action which influences the environment. And the environment responds with the reward which gives some indication of whether the agent has responded appropriately to the environment. Also built into the framework is the agent’s goal which is to maximize cumulative reward. But what exactly does this mean and how does the agent accomplish this? Towards its goal, what do you think? Could the agent just maximize the reward and each time step? The short answer to that question is, no. But I think a long answer would be a lot more satisfying. So let’s try to understand this with the walking robot example. Remember that in this case, the goal of the robot was to stay walking forward for as long and as quickly as possible while also exerting minimal effort. In this case, if the robot tried to maximize the reward it received at a single time step, that would look like trying to move as quickly as possible with as little effort without falling immediately. That could work well in the short term but it’s possible, for instance, that the agents movement gets it moving quickly without falling initially. But that first movement was de-stabilising enough that it doomed the agent to fall in a short time. In this way, if the agent focused on individual time steps, it could learn actions that maximize initial rewards. But then the episode terminates quite quickly. And so the cumulative reward is quite small. And still worse, in this case, the agent will not have learned to walk. In this example then, it’s clear that the agent cannot focus on individual time steps and instead, needs to keep all time steps in mind. But this also holds true for reinforcement learning agents in general. Actions have short and long term consequences and the agent needs to gain some understanding of the complex effects its actions have on the environment. Along these lines in the walking robot example, the agent always has reward at all time steps in mind, it will learn to choose movement designed for long term stability. So in this way, the robot moves a bit slowly to sacrifice a little bit of reward but it will payoff because it will avoid falling for longer and collect higher cumulative reward.

We’ve discussed how an agent might choose actions with the goal of maximizing expected return but we need to dig a bit deeper. For instance, consider our puppy agent, how does he predict how much reward he could get at any point in the future? Puppies can live for decades. Can he really be expected to have just as much of an idea of how much reward he’ll get now as he does five years from now? Does it make more sense to consider that it’s not entirely clear what the future holds especially if the puppy is still learning, proposing, and testing hypotheses and changing his strategy? It’s unlikely that he’ll know one thousand times steps in advance what his reward potential is likely to be. In general, the puppy is likely to have a much better idea of what’s likely to happen in the near future than he does for a distant time points. Along these lines then, should present reward carry the same weight as future reward? Maybe it makes more sense to value rewards that come sooner more highly, since those rewards are more predictable. The idea is that we’ll maximize a different sum with rewards that are farther along in time are multiplied by smaller values. We refer to this sum as discounted return. By discounted, we mean that we’ll change the goal to care more about immediate rewards rather than rewards that are received further in the future. But how do we choose what values to use here? Well, in practice, we’ll define what’s called a discount rate, which is always denoted by the Greek letter gamma, and is always a number between zero and one.

Markov Decision Process (MDP)

So far, you’ve just started a conversation to set the stage for what we’d like to accomplish. We’ll use the remainder of this lesson to specify a rigorous definition for the reinforcement learning problem. For context, we’ll work with the example of a recycling robot from the Sutton textbook. So consider a robot that’s designed for picking up empty soda cans. The robot is equipped with arms to grab the cans and runs on a rechargeable battery. There’s a docking station set up in one corner of the room and the robot has to sit at the station if it needs to recharge its battery. Say, you’re trying to program this robot to collect empty soda cans without human intervention. In particular, you want the robot to be able to decide for itself when it needs to recharge its battery. And whenever it doesn’t need to recharge, you want it to focus on collecting as many soda cans as possible. So let’s see if we can frame this as a reinforcement learning problem. We’ll begin with the actions. We’ll say the robot is capable of executing three potential actions. It can search the room for cans, it can head to the docking station to recharge its battery, or it can stay put in the hopes that someone brings it a can.

We refer to the set of possible actions as the action space, and it’s common to denote it with a script A. What about the states? Remember, the states are just the context provided to the agent for making intelligent actions. So the state, in this case, could be the charge left on the robot’s battery. For simplicity, we’ll assume that the battery has one of two states. One corresponding to a high amount of charge left, and the other corresponding to a low amount of charge. We refer to the set of possible states as the state space and it’s common to denote with a script S.

So we’re working with an example of a recycling robot and we’ve already detailed the states and actions. In this example, remember that the state corresponds to the charge left on the robot’s battery. And there are two potential states, high and low. As a first step, consider the case of the charge on the battery is high. Then, the robot could choose to search, wait, or recharge. But actually, recharging doesn’t make much sense if the battery is already high, so we’ll say that the only options are to search or wait. All right, so if the agent chooses to search, then at the next time step, the state could be high or low. Let’s say that with 70 percent probability, it stays high. So there’s a 30 percent chance the battery switches to low. In both cases, we’ll say that this decision to search led to the robot collecting exactly four cans. And in line with this, the environment gives the agent a reward of four. The other option is to wait. If the robot has a high battery and then decides to wait, well, waiting doesn’t use any battery at all and we’ll say that then, it’s guaranteed that the battery will again be high at the next time step. In this case, we’ll suppose that since the robot wasn’t out actively searching, it’s able to collect fewer cans and say it’s delivered just one can. And again in line with this, the environment gives the agent a reward of one. Onto the case where the battery is low. Again, the robot has three options. If the battery is low and it chooses to wait for people to bring cans, that doesn’t use any battery until the state at the next time step is going to be low. And just like when the robot decided to wait when the battery was high, the agent gets a reward of one. If the robot recharges, then it goes back to the docking station and the state of that the next time step is guaranteed to be high. Say it collects no cans along the way and gets a reward of zero. And if it searches, well, that’s risky. It’s possible that it gets away with this and then at the next time step, the battery is still low but not entirely depleted. But it’s probably more likely that the robot depletes its battery, has to be rescued and is carried to a docking station to be charged. So the charge on its battery at the next time step is high. So the robot depletes its battery with 80 percent probability and otherwise gets away with that risky action with 20 percent probability. As for the reward, if the robot needs to be rescued, we want to make sure we’re punishing the robot in this case, so say we don’t look at all at the number of cans it was able to collect and we just give the robot a reward of negative three for that. But if the robot gets away with it, he collects four cans and get the reward of four. This picture completely characterizes one method that the environment could use to decide the next state in reward at any point in time.

Now that we’ve looked at an example, you should have the necessary intuition to understand the formal definition of the reinforcement learning framework. So, formally, a Markov decision process or MDP is defined by the set of states, the set of actions, and the set of rewards along with the one-step dynamics of the environment and the discount rate. We’ve detail the states actions, rewards, and one-step dynamics of the environment, but we will also need to talk about the discount rate.