Exercise: Variance dependence on sample size
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats
You read in the newspaper from a study about the difference of "infidelity" in the German federal states (Bayern, Berlin, ...). For the study 100 persons were asked in each federal state (Bayern, Berlin, ...) a yes/noquestion about "infidelity during their partnership".
The result of the study was:
 Percentage of infidelity in BadenWürttemberg is 20%
 Percentage of infidelity in Bayern is 10%
 Percentage of infidelity in Berlin is 14%
 Percentage of infidelity in Brandenburg is 24%
 Percentage of infidelity in Bremen is 16%
 Percentage of infidelity in Hamburg is 19%
 Percentage of infidelity in Hessen is 20%
 Percentage of infidelity in MecklenburgVorpommern is 16%
 Percentage of infidelity in Niedersachsen is 21%
 Percentage of infidelity in NordrheinWestfalen is 25%
 Percentage of infidelity in RheinlandPfalz is 21%
 Percentage of infidelity in Saarland is 15%
 Percentage of infidelity in Sachsen is 22%
 Percentage of infidelity in SachsenAnhalt is 21%
 Percentage of infidelity in SchleswigHolstein is 27%
 Percentage of infidelity in Thüringen is 21%
a) Do you trust the study? Why or why not?
b) To understand the problem better simulate i.i.d. binary trials:
Terminology (more explanation during the practical session):

i.i.d.: independent and identically distributed
 independent: the outcome of one observation does not effect the outcome of another observation
 identically distributed: for all events

Binary trial (Bernoulli trial):
 random variable X with two outcomes {0,1}
 p \in ]0,1[ is the parameter with P(X=1) = p and P(X=0) = 1p = q
Here try p=0.2
A Binomial(n,p) random variable is the sum of n i.i.d. Bernoulli(p) random variables. e.g. 200 Binary trials with p = 0.2 in numpy can result in: IN >> np.random.binomial(n=100, p=0.2) OUT>> 41 So we have 41 positive (X=1) outcomes and 159 negative (X=0) outcomes.
Make 100 Binominal experiments (with fixed p=0.2 and n) and estimate p from the outcomes of each experiment (by maximum likelihood) with n=10,100,1000,10000,100000,1000000.
What is the (empirical) variance and the standard deviation of p for the different n?
P% (Pvalue) confidence intervall: intervall in which P% of the estimates are expected to fall. In which intervall are 90% of the estimated p?
Plot histograms of the distribution of p?
Variance dependence on sample size
Plot the variance versus the sample size in a loglog plot.
Explain the graph.