We said reinforcement learning was the interaction between an agent and the environment. The agent is a decision maker, it has a set of actions, a ∈ A(s), that depend of the state, s ∈ S, of the environment. The environment produce a reward, r ∈ R, and the goal of the agent is to maximize the total reward by interacting with the environment.
A markov decision problem MDP is a specific problem of reinforcement learning. In essence MDP define the probability distribution of the environment states. It do so by stated that the probability of state s' only depends on the action taken and the previous state of the environment
Pr(s'|s, a) = Pr{St = s'| St-1 = s, At-1 = a}
So the function Pr(s'|s, a) return a number between 0 and 1, Pr(s'|s, a): ----------> (0, 1) and
∑s'∈ SPr(s'|s,a) = 1. We need a number between 0 and 1 for every possible combination of previous states, action and the state in the next period.
Small MDP
Here we do a small example similarly to the Udacity course Lets assume the environment is a matrix of 2x2 each of the positions are one state of the environment. Depending on the position the agent is in, it will get the following reward :
[ 1..........0.8 ]
[-0.4.........3 ]
The agent has the option to move up (U) right (R) down (D) or left (L). Every time the agent has to move it executes correctly his action with probability 0.8. There is a probability of .1 of moving in a right angle a probability of 0.1 of staying in the same state. For example if the agent is in the top left corner and moves up, then with probability .1 ends in the top right corner and with probability .1 the agents end in the bottom left corner.
With this MDP we can give a specific number to the probabilities of transition to one state to another given that we take some action. For example if we are in the first state, s = 1 (which refer to the row 1 column 1 of the state matrix) then we can compute the 4 x 4 tables of probabilities Pr(s' | a, s t = 1):
P(st+1 = 1 |U, s t = 1) P(st+1 = 1 |D, s t = 1) P(st+1 = 1 |L, s t = 1) P(st+1 = 1 |R, s t = 1)
P(st+1 = 2|U, s t = 1) P(st+1 = 2 |D, s t = 1) P(st+1 = 2 |L, s t = 1) P(st+1 = 2 |R, s t = 1)
P(st+1 = 3 |U, s t = 1) P(st+1 = 3 |D, s t = 1) P(st+1 = 3 |L, s t = 1) P(st+1 = 3 |R, s t = 1)
P(st+1 = 4 |U, s t = 1) P(st+1 = 4 |D, s t = 1) P(st+1 = 4 |L, s t = 1) P(st+1 = 4 |R, s t = 1)
If the agent is in state 1 and move up, then the agent has a .8 probability of been in state 1 at time t+1. The agent can move up in state 1 but ending in state 2 (row 1 column 2 of the matrix) with probability 0.1 and also the agent can end in state 3 (row 2 column 1) with probability .1. Doing this analysis for all possible action result in the matrix:
ps1 = [[.8, .1, .8, .1],
[.1, .1, .1, .8],
[.1, .8, .1, .1],
[0., 0., 0., 0.]]
But also we have to do this analysis for the 4 states! So we end up with the following 4 4x4 matrix with transition probabilities:
ps2 = [[.1, .1, .8, .1],
[.8, .1, .1, .8],
[0., 0., 0., 0.],
[.1, .8, .1, .1]]
ps3 = [[.8, 0., .1, .1],
[0., .1, 0., 0.],
[.1, .8, .8, .1],
[.1, .1, .1, .8]]
ps4 = [[0., 0., 0., 0.],
[.8, .1, .1, .1],
[.1, .1, .8, .1],
[.1, .8, .1, .8]]
Remember that our goal is to maximize the reward by choosing the best action given that we end up in a particular state. So if the agent is in state 3 what should he do? The answer of this question is associated with the value of each state. In this case is easy to see that our best action are the ones that can get us to state 4 and never leave from there because we get a reward of 3 each time we are in state 4. If we start at state one then our best is to go right and then down and then either down or right forever so we can stay in state 3 (with the highest probability of being in state 3).
The optimal action is connected with the value of one state. If one state has greater value than other, the agent should move to that direction. Remember value of state v(s) is different from the reward of one state. The value is a expected discounted sum of returns, when following an optimal policy.
Lets recall the bellman equation:
V(s) = R(s) + arg maxa γΣs' Pr (s'|a, s) V(s')
With γ = .8
To find the values V(s) for every state we can do it using a recursive algorith. We start with an assumption V0(s) = 0 ∀ S. Then we compute the bellman equation and we get a new V1(s) we do this until some convergence factor, for example when max{ |Vt+1 (s) - Vt (s) |< δ} with δ = 0.00001.
Vt+1 = arg maxa γΣs' Pr (s'|a, s) Vt(s').
Here is the code in python to find the values of the states:
##U(s_1) = R(s_1) + gamma * max{a}(Suma(p(s'|a, s_1)*U(s')))
gamma = .8
Prsas = [np.array(ps1), np.array(ps2), np.array(ps3), np.array(ps4)]
u0 = [0, 0, 0, 0]
Rs = np.ravel(returns)
delta = 0.00001
convergence = 1
while(abs(convergence) > delta):
rewA = []
for i in range(4):
usN = Rs[i] + gamma*np.max(np.sum(Prsas[i]*u0, 0))
rewA.append(usN)
convergence = np.max(np.array(rewA) - np.array(u0))
u0 = rewA
V(s) for each of the states are: [10.9, 12.8, 11.6, 14.9] which are very different from the immediate return. You can play with the value of gamma to find different V(s). For example if we do γ = 0.4 then V(s) = [0.9, 2.8, 1.5, 4.9]. This happens because gamma control how we evaluate future rewards. A gamma near to 1 means that we don't care if we receive some reward 2 periods from now than some reward 100 periods from now. A gamma close to 0 is that we only value the rewards receive in the immediate period.
Thanks for reading
Hola @anfego, upv0t3
Este es un servicio gratuito para nuevos usuarios de steemit, para apoyarlos y motivarlos a seguir generando contenido de valor para la comunidad.
<3 Este es un corazón, o un helado, tu eliges .
: )
N0. R4ND0M:
6592 8846 8145 2588
3093 5629 2323 8709
3336 6280 1227 9702
4348 9144 1825 9966
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @anfego! You received a personal award!
Click here to view your Board of Honor
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @anfego! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit