# Exercise - Bundesliga Game Prediction

## Introduction

In this exercises we will define a simple model for predicting soccer games for the German "Bundesliga".

Remark: In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

## Requirements

### Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

### Python Modules

import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats
import theano

from theano import tensor as T
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize

%matplotlib inline

## Exercises

Simple model for the predictions of soccer games: How many goals a team scores.

As data only the results from prior games are used.

# some simple preprocessing of the data
url_vereine_csv = "https://github.com/hsro-wif-prg2/hsro-wif-prg2.github.io/raw/master/examples/src/main/resources/bundesliga_Verein.csv"
clubs = pd.read_csv(url_vereine_csv, sep=';')
# for convinience the club id should start with 0
clubs.V_ID = clubs.V_ID - 1
clubs = clubs.set_index("V_ID")
# just 1. liga
club_ids = clubs[clubs.Liga==1].index
club_ids
url_spiele_csv = "https://github.com/hsro-wif-prg2/hsro-wif-prg2.github.io/raw/master/examples/src/main/resources/bundesliga_Spiel.csv"
games = pd.read_csv(url_spiele_csv, sep=';')
#del(games["Unnamed: 8"]) ### not existent anymore?
# for convinience the club id should start with 0
games.Heim = games.Heim-1
games.Gast = games.Gast-1
relevant_games = games[games.Heim.isin(club_ids)]
relevant_games
actual_date = "2018-01-01"
relevant_games = relevant_games[games.Datum < actual_date]
len(relevant_games)
def get_goal_results(gh="Tore_Gast"):
result = list()
for i in relevant_games.iterrows():
r = i
result.append((r.Heim, r.Gast, r[gh]))
return result

away_goals_ = get_goal_results("Tore_Gast")
home_goals_ = get_goal_results("Tore_Heim")
low = 1e-10

Idea: The number of goals a team scores can be modeled with a Poisson distribution.

#### Poisson distribution

Probability for outcome $k \in \{0, 1, 2, \dots\}$

$P_\lambda (k) = \frac{\lambda^k}{k!}\, \mathrm{e}^{-\lambda}$

with parameter $\lambda>0$ - $\lambda$ is also the expectation and variance of the distribution

import scipy.stats
k=np.arange(0,10)
lambda_= 3.1

plt.figure(figsize=(8,6))
plt.plot(k, scipy.stats.poisson.pmf(k, lambda_), 'bo', ms=6, label='poisson pmf')
plt.xlabel("k")
plt.ylabel("probability mass")
scipy.stats.poisson.pmf(k, lambda_)

#### Probabilistic Model

Each team $i$ has a offence and defence strength (distribution). (Note that the average goals per game $\approx 3 \Rightarrow \Delta \mu=1.5$ ):

$offence_i \sim \mathcal N(\mu=1.5, \tau=1)$ $defence_i \sim \mathcal N(\mu=0, \tau=1)$ $\mathcal N$ is the Gaussian distribution with parameters

• mean: $\mu$
• precision: $\tau=1/\sigma^2$ (variance: $\sigma^2$ )

Model: The number of goals that team $i$ scores against team $j$ is Poisson distributed with

$goals_{ij} = Poisson \left(\lambda = (offence_i-defence_j) \right)$

### Graphical representation of the model

plot_model()

### Implementation with pymc

nb_clubs = len(club_ids)
nb_clubs
model = pm.Model()

with model:
offence = pm.Normal("offence", tau=1., mu=1.5, shape=nb_clubs)
defence = pm.Normal("defence", tau=1., mu=0., shape=nb_clubs)

home_goals = []
away_goals = []
hv = []

for i,(heim, gast, goals) in enumerate(home_goals_):
home_value = offence[heim]-defence[gast]
home_value = T.switch(T.lt(home_value, 0.), low, home_value)
hv.append(home_value)
home_goals.append(goals)
hv_ = T.stack(hv)
mu_h = pm.Deterministic("home_rate", hv_)
pm.Poisson("home_goals", observed=home_goals, mu=mu_h)

av = []
for i,(heim, gast, goals) in enumerate(away_goals_):
away_value = offence[gast]-defence[heim]
away_value = T.switch(T.lt(away_value, 0.), low, away_value)
av.append(away_value)
away_goals.append(goals)
av_ = T.stack(av)
mu_a = pm.Deterministic("away_rate", av_)
pm.Poisson("away_goals", observed=away_goals, mu=mu_a)

offence
# start the sampling procedure
#map_estimate = pm.find_MAP(model=model)

#### Sampling with pymc

# para
nb_samples=10000
with model:
trace = pm.sample(draws=nb_samples) #20000 5000
# don't use the first samples
burn = 1000
trace = trace[burn:]

#### Sampling histograms

nb_clubs = club_ids.max() + 1
bins=40
fig, axes = plt.subplots(nrows=nb_clubs, ncols=2, figsize=(10, 50))

for i in club_ids:
title = "Offence of " + clubs[clubs.index==i]["Name"][i]
axes[i, 0].set_title(title)
axes[i, 0].hist(trace.get_values("offence")[:,i], bins=bins, range=(0,4.2))

axes[i, 1].hist(trace.get_values("defence")[:,i], bins=bins, range=(-2.,2.2))
title = "Defence of " + clubs[clubs.index==i]["Name"][i]
axes[i, 1].set_title(title)

#fig.suptitle("Offence and defence distribution of the clubs.")
fig.tight_layout()

#### Exercise: Distribution of expected goals

Use the model and the sampling trace to predict how many goals a teams scores agains another team.

What is the expected number of the goals?

Implement the corresponding python (plot) functions, e.g.

# Expectation of number of goals scored by team 0, mean of strength
print((np.arange(len(p_goals_1)) * p_goals_1).sum(), d1.mean())

# Expectation of number of goals scored by team 1, mean of strength
print((np.arange(len(p_goals_2)) * p_goals_2).sum(), d2.mean())
# probability that team 0 scores 0,1,2, ... goals against team 8
plot_goal_diffs(0, 17)

#### Exercice: Extension of the model

Extend the model with "home advantage":

At home is a team in general a little bit stronger as away. Modify the model to take this into account.

How strong is the home advantage in your model?

nb_samples = 10000
trace_ha = pm.sample(draws=nb_samples, tune=1000)
# don't use the first samples
burn = 1000
trace_ha = trace_ha[burn:]
# This depends on your model!
trace_ha.get_values("home_advantage").mean()
plt.hist(trace_ha.get_values("home_advantage"), bins=20)
plt.title("Home advantage distribution")

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Bundesliga Game Prediction by Christian Herta, Klaus Strohmenger