Successor features and generalised policy improvement

July 22, 2021

⊕For this week’s personal Learning Day (I don’t work at OpenAI), I decided to explore the use of Orbit – a platform that allows you to intersperse Anki-style spaced repitition prompts in prose. Read Quantum Country to see it in action.

Together, successor features and generalised policy improvement (Barreto 2016) form a powerful framework for transfer in reinforcement learning. In particular, they facilitate transfer between tasks that have different reward functions but otherwise identical environments. Formally, define an MDP as a tuple \(M = (S,A,p,r,\gamma)\) where:

\(S\) is the state space
\(A\) is the action space
\(p(s' \| s,a)\) defines the environment dynamics
\(r(s,a,s')\) is the reward function giving the reward at the transition \(s \xrightarrow{a} s'\). For simplicity, we’ll consider the reward function to be deterministic – but you can also think of this as giving the expectation of a stochastic reward function.
\(\gamma\) is the discount rate

Then successor features and generalised policy improvement (SFs & GPI) facilitates transfer between tasks where everything else is the same except for \(R(s,a,s')\).

Successor features

To understand how SFs & GPI facilitates such transfer, let’s first discuss what successor features are. Suppose that we have features \(\phi(s,a,s')\) such that there exists some \(w\) such that \(r(s,a,s') = w \cdot \phi(s,a,s')\). That is, the reward function is a linear function of our features \(\phi\). For a discrete state space, one such feature is a one-hot vector indexing the current state \(s_i\) (see successor representations). Then \(w_i\) is simply the reward received by reaching state \(s_i\).

Since the reward function is a linear function of our features \(\phi\), we can factor out \(w\) when computing extended returns.

\[\begin{align*} R_t &= \sum_{k=0}^{N} \gamma ^k r_{t+k} \\ &= \sum_{k=0}^{N} \gamma ^k (w\cdot \phi_{t+k}) \\ &= w \cdot \Big( \sum_{k=0}^{N} \gamma ^k \phi_{t+k} \Big) \end{align*}\]

In particular, this means that you can write:

\[\begin{align*} Q^{\pi}(s,a) &= E^{\pi} [R _{t+1}|s_t = s, a_t=a ] \\ &= E^{\pi} \Big[ w \cdot \Big( \sum_{k=0}^{N} \gamma ^k \phi_{t+k} \Big)|s_t = s, a_t=a \Big] \\ &= w \cdot E^{\pi} \Big[\Big( \sum_{k=0}^{N} \gamma ^k \phi_{t+k} \Big)|s_t = s, a_t=a \Big] \\ &= w \cdot \psi^{\pi}(s,a) \end{align*}\]

where \(\psi^{\pi}(s,a)\) is what we refer to as a successor feature. This is powerful because it separates reward prediction from future state prediction.Note that you can use any value function estimation algorithm to estimate \(\psi\) just like you’d normally estimate \(Q(s,a)\) directly. If a new task has a reward function that is linear in \(\phi\), and you can estimate its \(w\), then you can instantly compute \(Q^{\pi}(s,a)\) for the new task – you just need to swap out the old \(w\) for the new \(w'\).

To make this concrete, let’s consider a navigation task in a grid world. Suppose that \(\pi\) is a random walk policy. Then, taking \(\phi(s)\) to be one-hot as desribed above, \((\psi(s))_i\) would basically encode the distance \(d(s,s_i)\) between the two squares \(s\) and \(s_i\). Suppose there’s a single goal location such that \(w\) is one-hot. If there’s a new goal location, you simply need to change \(w\) accordingly and recompute \(Q^{\pi}(s,a)\) – you can transfer your “map” \(\psi\) between the two tasks.

Generalised policy improvement

Successor features allow you to instantly recompute \(Q^{\pi}(s,a)\) for a new task given that you know the new \(w\). But on its own, SFs don’t allow you to transfer behavior. That’s where generalised policy improvement (GPI) comes in. In essence, GPI says that you act greedily over the set of \(Q^{\pi}(s,a)\) for any previous policies \(\pi\). Precisely, GPI says that you act according to:

\[\pi(s)= \text{argmax}_a \text{max}_i Q^{\pi_i}(s,a)\]

This guarantees that you do no worse than any of your previous policies (which may have been learned on different tasks). GPI is the part that allows you to transfer useful behavior between tasks. Putting SFs & GPI together: given \(w\), SFs allow you to instantly recompute \(Q^{\pi}(s,a)\) for any policy \(\pi\) and then use GPI to transfer useful behavior to your new task.