### Exercise: Variance dependence on sample size

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats

You read in the newspaper from a study about the difference of "infidelity" in the German federal states (Bayern, Berlin, ...). For the study 100 persons were asked in each federal state (Bayern, Berlin, ...) a yes/no-question about "infidelity during their partnership".

The result of the study was:

• Percentage of infidelity in Baden-Württemberg is 20%
• Percentage of infidelity in Bayern is 10%
• Percentage of infidelity in Berlin is 14%
• Percentage of infidelity in Brandenburg is 24%
• Percentage of infidelity in Bremen is 16%
• Percentage of infidelity in Hamburg is 19%
• Percentage of infidelity in Hessen is 20%
• Percentage of infidelity in Mecklenburg-Vorpommern is 16%
• Percentage of infidelity in Niedersachsen is 21%
• Percentage of infidelity in Nordrhein-Westfalen is 25%
• Percentage of infidelity in Rheinland-Pfalz is 21%
• Percentage of infidelity in Saarland is 15%
• Percentage of infidelity in Sachsen is 22%
• Percentage of infidelity in Sachsen-Anhalt is 21%
• Percentage of infidelity in Schleswig-Holstein is 27%
• Percentage of infidelity in Thüringen is 21%

a) Do you trust the study? Why or why not?


np.random.seed(41)
# Solution:

# a) The study was done by random sampling:
sr = np.random.binomial(n=100, p=0.2, size=16)
"Mecklenburg-Vorpommern","Niedersachsen","Nordrhein-Westfalen",
"Rheinland-Pfalz","Saarland","Sachsen","Sachsen-Anhalt","Schleswig-Holstein",
u"Thüringen")
for s,f in zip(sr,fs):
print ("Percentage of infidelity in %s is %d%%"%(f,s))



b) To understand the problem better simulate i.i.d. binary trials:

Terminology (more explanation during the practical session):

• i.i.d.: independent and identically distributed

• independent: the outcome of one observation does not effect the outcome of another observation
• identically distributed: for all events
• Binary trial (Bernoulli trial):

• random variable X with two outcomes {0,1}
• p \in ]0,1[ is the parameter with P(X=1) = p and P(X=0) = 1-p = q

Here try p=0.2

A Binomial(n,p) random variable is the sum of n i.i.d. Bernoulli(p) random variables. e.g. 200 Binary trials with p = 0.2 in numpy can result in: IN >> np.random.binomial(n=100, p=0.2) OUT>> 41 So we have 41 positive (X=1) outcomes and 159 negative (X=0) outcomes.

Make 100 Binominal experiments (with fixed p=0.2 and n) and estimate p from the outcomes of each experiment (by maximum likelihood) with n=10,100,1000,10000,100000,1000000.

What is the (empirical) variance and the standard deviation of p for the different n?

P% (P-value) confidence intervall: intervall in which P% of the estimates are expected to fall. In which intervall are 90% of the estimated p?

Plot histograms of the distribution of p?

# Solution
def plot_hist(i, n, ps, fig):
ax.set_title('n=%d'%n)
ax.hist(ps, bins=np.arange(0.,0.5,0.01))

#hist, bins = np.histogram(ps, bins=np.arange(0.,0.5,0.01))
#print hist
#width = 0.7 * (bins[1] - bins[0])
#center = (bins[:-1] + bins[1:]) / 2
#print center
#print width
#ax.bar(center, hist, align='center', width=width)

vars = []
fig = plt.figure(1)
size=100
ns = 10 ** np.arange(1,7, dtype=int)
for i,n in enumerate(ns):
outcomes = np.random.binomial(n=n, p=0.2, size=size)
ps = outcomes/float(n)
print ("n: ", n)
var = np.var(ps, ddof=1)
vars.append(var)
print ("variance: ", var)
print ("standard deviation: " , np.std(ps, ddof=1))
print ("range for 68.2%% of the p-values (%f, %f)"%( np.percentile(ps, 15.9), np.percentile(ps, 84.1)))
print ("range for 95%% of the p-values: (%f, %f)\n"%( np.percentile(ps, 02.5), np.percentile(ps, 97.5)))
#print scipy.stats.mvsdist(ps, alpha=0.9)
if i<4:
plot_hist(i,n,ps,fig)



### Variance dependence on sample size

Plot the variance versus the sample size in a log-log plot.

Explain the graph.