Distilling Uber's Causal Learning for User Retention

An attempt to understand and reproduce a paper from Uber Research.

NOTE: Citations, footnotes, and code blocks do not display correctly in the dark mode since distill does not support the dark mode by default. Click on the sun above to disable dark mode. —

At first thought, these moments at the intersection of digital and physical life may seem unimportant. Yet, most of us had experienced the following sequence of events.

It’s weekend, you are at home occupied by your hobby. A notification pops up on your phone, making a sound that by now your brain associates with either an important email, or a calendar reminder.

Unless you have notifications completely switched off, you will discover that it is an app reminding you about itself. The kind of app that you have not used in a while.

What do you do next? Irritated (or not) by the notification you have a few options:

It indeed was Dualingo, and you choose the third option. It managed to convince you to put aside your weekend activity and spend another hour on the phone. Somewhere, data scientists are raving - one more customer retained!

How did they do that? There are many ways to convince someone to use or to continue using the product. But we are not going to talk about the marketing strategies specifics. We are going to talk about one of the ways that we can formulate retention as a Data Science problem! One possible way is described in Improve User Retention with Causal Learning (2019) by Du et al. from Uber.

Warning! We do not have access to the data used in their study. That is fine! It is not an issue at all. In fact, we are over emphasising the importance of large datasets for research success. I hear that we cannot compete with FAANG due to their access to gpu-clusters and large datasets. We don’t have to! Let’s not play the game where they set the rules. There are plenty of examples of well designed studies, that lead to insights and need far less computational power. With the perk of having fewer degrees of freedom to control for. Fewer hyper-paratemers, no parallel computation quirks, etc.