With MDPs we have a set of states, a set of actions to choose from, and immediate reward function and a probabilistic state transition function. Our goal is to derive a mapping from states to actions, which represents the best actions to take for each state, for a given horizon length. This is easy if we know the value of each state for this horizon length. The value iteration algorithm computes this value function by finding a sequence of value functions, each one derived from the previous one.
The value iteration algorithm starts by trying to find the value function for a horizon length of 1. This will be the value of each state given that we only need to make a single decision. There isn't much to do to find this in an MDP. Recall that we have the immediate rewards, which specify how good each action is in each state. Since our horizon length is 1, we do not need to consider any future effects (there is no future). Thus, we can simply look at the immediate rewards and choose the action with the highest immediate value for each state.
The next step, which is the second iteration of the algorithm, is to determine the value function for a horizon length of 2. The value of acting when there are two steps to go, is the immediate rewards for the immediate action you will take, plus the value of the next action you choose. Conveniently, we have already computed the values of each state for a horizon length of 1. So to find the value for horizon 2, we can just add the immediate effects of each of the possible actions to the already computed value function to find the action with the best value given that there will be two decision to be made.
Note that we will have to figure in the probabilistic effects of the actions that might happen as we go from having 2 actions left to having 1 action left. Thus, we will really need to do a probabilistic weighting, or expected value computation when using the horizon length 1 solution to help solve the horizon length 2 solution.
Now the algorithm iterates again; it finds the horizon 3 value function using the horizon 2 value function. This iterates until we have found the value function for the desired horizon.
We could give the formula for how to calculate one value function from another, but we promised there would be no formulas. All we will say is that it involves iterating over all the states and using the transition probabilities to weight the values. Anyway, the formulas can be found most anywhere people talk about MDPs, so just be content that it is easy to derive one value function from another by iterating over the states.
Continue