Is temporal discounting all you need?

April 3, 2022

Half baked thoughts on efficient credit assignment, the (un)surprising effectiveness of temporal discounting and the advantage function, and how we might improve in the low data limit with explanations and inverse planning….


Consider the episodic reinforcement learning setting, where you want to maximize undiscounted return from a finite episode \(R_t = \sum_{t=0}^{T} r_t\). Since returns are stochastic, we want to maximize its expectation, which we can do with a policy gradient:

\[\begin{aligned} \nabla_{\theta} V_{0} &=\nabla_{\theta} \mathbb{E}_{\pi}\left[\sum_{t=0}^{T} r_{t}\right] \\ &=\mathbb{E}_{\pi}\left[\sum_{t=0}^{T} \nabla_{\theta} \log \pi\left(\mathbf{a}_{t} \mid \mathbf{h}_{t} ; \theta\right) \mathcal{R}_{t}\right] \end{aligned}\]

Here \(R_t\) acts as the credit assigned to action \(a_t\) given state \(h_t\). This return may include causally irrelevant rewards which corrupt the credit assignment signal, creating unwanted variance. We can understand this intuitively with an example. Suppose that your goal is to have a maximally rewarding week. You always have a quiz on Friday, and your success on that quiz depends on how much you study for the quiz. So the causally relevant reward for the action “study” is your score on Friday’s quiz. But if we were to sum your all subsequent rewards for the week, we’d include irrelevant rewards like how much you enjoyed dinner on Thursday. These irrelevant rewards corrupt your learning signal for the action “study”, potentially hindering your ability to learn that you really should study for Friday’s quiz.

To learn more efficiently, we want to include only rewards caused by those actions. So how do we figure out which rewards are caused by which actions?

The “model-free” way to do it uses two methods to cancel out irrelevant rewards. First, temporal discounting uses temporal proximity as a proxy for causal relevance. This assumption rings true for reactionary games like Space Invaders. So even when temporal discounting is not inherent to your RL environment, you may want to use a temporally discounted return \(R_t = \sum_{t=0}^{T} \gamma^t r_t\). Secondly, the advantage function \(Q(s,a)-V(s)\) uses the baseline value \(V(s)\) as a counterfactual of sorts - what is the marginal benefit of choosing action \(a\) in response to state \(s\)? Here we use this marginal benefit as a proxy for causal relevance (the idea of a marginal benefit is related to the idea of a counterfactual).

Using these two methods are super effective, especially when aided with a planning component (see: many successes of deep RL). But they do seem to fail under two conditions:

The first failure mode may be survivable, but the two failure modes combined seems to be a death trap. This is precisely the failure mode addressed by these two papers, which proposes the use of episodic memory as a “portal” for temprally distant but causally important rewards:Can we use Transformer attention for credit assignment? Suppose that you have a Transformer which predicts \((s,a,r)\) sequences. Then you can use soft attention to assign credit for reward to relevant \((s,a)\) tuples.

So “model-free” credit assignment is pretty good. Can we do better? Maybe by directly trying to explain events in terms of object interactions? Certainly, the ability of humans to explain events seems related to our ability to learn from them. Imagine that you’ve recently adopted a puppy. You come home one day and find your favorite running shoes in tatters. You sigh, buy your puppy a chew toy, and it never happens again. This kind of inference is taken for granted when done by humans, but it’s actually amazing. Imagine what would happen if you trained a naive model-free RL agent to solve the same scenario. It would probably try vacuuming your carpets and a million silly things before even considering your puppy.

Of course, this is an unfair comparison. You did not come home that day totally naive. Instead, you arrived with a rich causal model of the world (e.g. puppies like to bite things, and biting things damages them). Our comparative advantage is our ability to learn general causal models of the world (e.g. puppies bite things), use them as priors to infer actual causal models for novel situations (e.g. my puppy bit my shoe), and use this model to rapidly learn (e.g. I should buy a chew toy). In a way, this ability is like a form of inverse planning…

Is temporal discounting all you need? - April 3, 2022 - Rheza Budiono