You run experiments? Do you know Thomas Bayes?

Approaching experiments scientifically with Abacus

Published in

Wooga

5 min readMay 17, 2017

Article written by Paul Metzner, Senior Data Scientist at Wooga.

Since the earliest days of Wooga we’ve been running experiments to improve our games and create the best possible experience for our players. After an initial phase using Excel, we even built our own internal tool to allow product managers and analysts to set up and monitor their own tests.

The tool served us well for a long time but eventually we came to the realisation that, while the tool was easy for everyone to use, it was prone to delivering the occasional questionable result. My colleagues and I in the data science team are big fans of sound scientific methodology, so we set out to right that wrong and improve the tool.

Here’s how we did it.

How did the old tool work?

First, let’s look at our old tool. We were using frequentist methods to find out which version of a specific feature in one of our games performs better. Even though we assigned players randomly to groups, we always observed some differences between them. Of course, those could have been due to chance, so we needed a way to tell random fluctuations apart from real effects. The old tool allowed us to do that, which is good.

Yet, in practice, choosing a frequentist approach had some important implications. For instance, if the Jelly Splash team wanted to see whether a newly designed feature affected how many people played the game they would need to specify the expected outcome (e.g., the number of players should increase by 2%) and then expose random players to the feature and monitor the results. The experiment would conclude as soon as the team were sufficiently confident with how the new feature was performing, making a judgment soon after whether it was better or worse than before.

Unfortunately, the team would have had to commit to a specific effect size before the experiment started without knowing anything about the feature’s effects; necessary to calculate how long the experiment needed to run — given acceptable error rates. Only after that was done, could they start the experiment and let it run for a predetermined amount of time. “They could then evaluate the experiment and ascertain whether the difference between the groups was unlikely to come from chance. If that was the case, they could declare victory and adopt the feature. Otherwise, all the experiment told us was that we did not know whether the feature was any better. But we also couldn’t conclude that it wasn’t. Extending the experiment to increase its statistical power would have compromised the entire procedure and we would have to set up a new experiment.

What should the new tool improve?

In a competitive business like mobile free-to-play games, it’s crucial to also know how much a feature is worth. We need that information to make important decisions. For example, we have to figure out if it’s worth investing the manpower to maintain the feature. We may also consider to build the feature for another game from our portfolio. Whether or not we do that hinges on the impact. After all, the team should focus on something with high return instead of something that only leads to a marginal change.

In essence we wanted the new tool to tell us how likely it was that one version was better than the other and by how much.

Enter Abacus

The most important change from the new tool happened under the hood: We abandoned the frequentist methods and decided to approach testing from a probabilistic angle. Leveraging Bayes’ theorem (see below), we can get an answer to the fundamental question of any user experiment: Is B better than A?

In practice, we can translate the approach to English roughly like this: We quantify what we believe about our groups before the experiment, add what we learn about them during the experiment, and combine those to what we believe about them after the experiment. This final piece is what we call the posterior. For each metric, we obtain one for the control group and one for each test group and analyse the difference between them.

Since we obtain probabilities directly, we can evaluate experiments continuously. As soon as we are confident enough about what we’re seeing, we can decide to stop the experiment.

The first thing we look for is the probability itself. If it is very low or high and not fluctuating, we take it as evidence in favour of a negative or positive effect. To establish reliability, we use the 95% credible interval (or “highest posterior density interval”). This tells us where we can expect the difference between two groups to be with a probability of 95%. As a rule of thumb, we accept intervals that lie completely above or below zero.

Finally, and most importantly for judging the business value of a feature, we can infer how much we stand to lose or gain with each version if we enabled it for all of our players. The area under the difference between two posteriors that lies below zero is what we call “expected loss”. Consequently, the area under the positive side is our “expected gain”. If both are small but the loss is even smaller, it’s probably safe to adopt the feature. At the same time, it would not be worth tweaking the feature or even adopting it in another game.

Another benefit of the Bayesian approach is that we can also express our confidence that there is no effect and see when an experiment is not going anywhere. To do that, we monitor the credible interval over time. If it’s not converging (i.e., becoming tighter), we can infer that nothing (or at least not much) is going on. If the interval is tightening but is symmetrically wrapped around zero, it is also an indication that there is no difference between groups. These are conclusions we could not have drawn with our old tool.

All this flexibility comes at a “price”. With the old tool, we provided the game teams with a black-and-white (or rather, red-and-green) decision. Abacus requires and encourages us to weigh evidence in favour or against whatever we wanted to test. We firmly believe that this will help us learn a lot about our games and how people play them.

After a year of work by data engineers and data scientists, Abacus has just been rolled out internally to all our game teams. We’re already thinking about how to improve it and build upon it because now that we have an awesome tool to generate insights, we need a platform to share them with the rest of the company.

Special shout-out to Thomas Ducrot, Olaf Menzel, Cedrik Neumann, Jan Omar and Markus Steinkamp for their tremendous work on bringing Abacus to life as well as to the other members of the data science team for their valuable input.

You run experiments? Do you know Thomas Bayes?

Approaching experiments scientifically with Abacus

How did the old tool work?

What should the new tool improve?

Enter Abacus

Written by Wooga