Some thoughts on Reinforcement Learning - Q Learning

Q learning

I just completed a Reinforcement Learning assignment - in particular on Q-learning. According to Wikipedia here, it’s a model-free Rl algorithm. The goal for the algo is to learn a policy, which tells an agent what action to take under different circumstances.

Here’s my confession. What I’m doing in this post is to summarise what I’ve just learnt so that I may come back to this at any point in future. Hence it may or may not make sense to you,

  • RL is rather different from a conventional ML process. It contains an in-built iterative process to refresh the parameters
  • User can train the algorithm till the ‘total reward’ (in ML lingo that could be error-type measures) converges. I will coin each training process as a simulation
  • To set up the learning process, I first initalize a Q learning table of 0s with dimension of number of states * actions (In python lingo, self.q = np.zeros((num_states, num_actions), dtype = np.float64))
  • In every step of each simulation, the algo will pick a random float of 0 to 1. If it’s lesser than the threshold, it will pick a random action. Else, it will pick the action with the best outcome. According to the literature, it seems that exploration plays a role in improving the results
  • From second step onward, the algo will update the Q-table as follows: self.q[self.s, self.a] = (1 - self.alpha) * self.q[self.s, self.a] + self.alpha * (r + self.gamma * np.max(self.q[s_prime,]))
  • What it meant in the above formula is that Q-learning computes a weighted score for a particular state and action based on present and future discounted score from best action.
  • The updated score will be used in subsequent simulations, and not current one. i.e in future iteration, if the option is non-random, it will pick the highest score option.
  • When the current simulation end, the algorithm will return to the starting point and retrain the algorithm with the refreshed Q-table

Dyna-Q

  • The additional bootstrap component algorithm is deemed to be cheaper because it doesn’t require additional interaction with the external environment to refresh the Q-table.
  • It will instead pick random states and actions
  • And select a new state based on a probability mass function in each loop (each state is assigned a discrete chance. Say there’re 100 states where 1 of the states has 2% chance and another has 1% chance. The latter might still be selected albeit with a lower chance)
  • What’s different here is that it will refresh the reward information too

Insights from this exercise

The strength of this algorithm lies in the fact that it doesn’t require a ton of data. I’m excited to apply this algorithm if there’s a chance.

Related

comments powered by Disqus