Home / Summaries / Resit 2017 / state-reward-action

Reinforcement Learning

4 important questions on Reinforcement Learning

We are going to apply reinforcement learning to support a user in becoming more active. We measure the activity level and activity type of a person and want to provide suggestions to that person based on his measured state (examples of advices could be: do activity x, stop activity y, etc.).
(3 pt) Explain what the Markov Property means (you can relate your explanation to this specific example or you can also explain it in general if you want).

The Markov property states that the probability of ending up in a next state with a certain reward given a current state and action can be determined solely by the current state and action. There is no need to consider a history before the current state and action. Alternatively, you can say that the probability is equal to the case where you do consider the entire history.

(4 pt) Explain how the one step Q-learning algorithm works.

Q-learning maintains so-called Q-values for state-action pairs. These indicate the expected reward for selecting an action in a given state. Actions are selected based on a certain selection approach. It updates the value of the state-action pair of the selected action (and the given state) by considering the reward obtained in the next state combined with the expected future reward based on selecting the action in the next state which maximizes the future reward.

(4 pt) Some of the measurements we perform are continuous (specifically, the activity level is), would this be a problem for SARSA or Q-learning? Argue why (not).

Yes, this would be a problem as they maintain expected reward values for all state-action pairs (in their most rudimental form) and the number of states is infinite when defined based on the continuous measurement. Hence, we cannot store these. A solution (does not have to be part of your answer) is to use a dedicated approach that discretizes the continuous state space.

(4 pt) We have the choice to either apply an ε-greedy approach or a softmax approach to select the actions. We know that the person we are supporting does not change at all in terms of responses to messages. Which one of the two approaches would be most suitable to use? Argue your choice.

Given that the user does not change its preferences, there is only a need to performing exploration in the beginning to find the most suitable messages and, once found, we can just perform exploitation without any need for exploration again. The ε-greedy approach does not change how much it explores or exploits, while the softmax does reduce the amount of exploration that takes place over time. Hence, the latter would be most suitable.

The question on the page originate from the summary of the following study material:

Resit 2017

View summary