Exercise: Variance dependence on sample size
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats
You read in the newspaper from a study about the difference of "infidelity" in the German federal states (Bayern, Berlin, ...). For the study 100 persons were asked in each federal state (Bayern, Berlin, ...) a yes/noquestion about "infidelity during their partnership".
The result of the study was:
 Percentage of infidelity in BadenWürttemberg is 20%
 Percentage of infidelity in Bayern is 10%
 Percentage of infidelity in Berlin is 14%
 Percentage of infidelity in Brandenburg is 24%
 Percentage of infidelity in Bremen is 16%
 Percentage of infidelity in Hamburg is 19%
 Percentage of infidelity in Hessen is 20%
 Percentage of infidelity in MecklenburgVorpommern is 16%
 Percentage of infidelity in Niedersachsen is 21%
 Percentage of infidelity in NordrheinWestfalen is 25%
 Percentage of infidelity in RheinlandPfalz is 21%
 Percentage of infidelity in Saarland is 15%
 Percentage of infidelity in Sachsen is 22%
 Percentage of infidelity in SachsenAnhalt is 21%
 Percentage of infidelity in SchleswigHolstein is 27%
 Percentage of infidelity in Thüringen is 21%
a) Do you trust the study? Why or why not?
np.random.seed(41)
# Solution:
# a) The study was done by random sampling:
sr = np.random.binomial(n=100, p=0.2, size=16)
fs = (u"BadenWürttemberg","Bayern","Berlin","Brandenburg","Bremen","Hamburg","Hessen"
"MecklenburgVorpommern","Niedersachsen","NordrheinWestfalen",
"RheinlandPfalz","Saarland","Sachsen","SachsenAnhalt","SchleswigHolstein",
u"Thüringen")
for s,f in zip(sr,fs):
print ("Percentage of infidelity in %s is %d%%"%(f,s))
b) To understand the problem better simulate i.i.d. binary trials:
Terminology (more explanation during the practical session):

i.i.d.: independent and identically distributed
 independent: the outcome of one observation does not effect the outcome of another observation
 identically distributed: for all events

Binary trial (Bernoulli trial):
 random variable X with two outcomes {0,1}
 p \in ]0,1[ is the parameter with P(X=1) = p and P(X=0) = 1p = q
Here try p=0.2
A Binomial(n,p) random variable is the sum of n i.i.d. Bernoulli(p) random variables. e.g. 200 Binary trials with p = 0.2 in numpy can result in: IN >> np.random.binomial(n=100, p=0.2) OUT>> 41 So we have 41 positive (X=1) outcomes and 159 negative (X=0) outcomes.
Make 100 Binominal experiments (with fixed p=0.2 and n) and estimate p from the outcomes of each experiment (by maximum likelihood) with n=10,100,1000,10000,100000,1000000.
What is the (empirical) variance and the standard deviation of p for the different n?
P% (Pvalue) confidence intervall: intervall in which P% of the estimates are expected to fall. In which intervall are 90% of the estimated p?
Plot histograms of the distribution of p?
# Solution
def plot_hist(i, n, ps, fig):
ax = fig.add_subplot(2,2,i+1)
ax.set_title('n=%d'%n)
ax.hist(ps, bins=np.arange(0.,0.5,0.01))
#hist, bins = np.histogram(ps, bins=np.arange(0.,0.5,0.01))
#print hist
#width = 0.7 * (bins[1]  bins[0])
#center = (bins[:1] + bins[1:]) / 2
#print center
#print width
#ax.bar(center, hist, align='center', width=width)
vars = []
fig = plt.figure(1)
size=100
ns = 10 ** np.arange(1,7, dtype=int)
for i,n in enumerate(ns):
outcomes = np.random.binomial(n=n, p=0.2, size=size)
ps = outcomes/float(n)
print ("n: ", n)
var = np.var(ps, ddof=1)
vars.append(var)
print ("variance: ", var)
print ("standard deviation: " , np.std(ps, ddof=1))
print ("range for 68.2%% of the pvalues (%f, %f)"%( np.percentile(ps, 15.9), np.percentile(ps, 84.1)))
print ("range for 95%% of the pvalues: (%f, %f)\n"%( np.percentile(ps, 02.5), np.percentile(ps, 97.5)))
#print scipy.stats.mvsdist(ps, alpha=0.9)
if i<4:
plot_hist(i,n,ps,fig)
Variance dependence on sample size
Plot the variance versus the sample size in a loglog plot.
Explain the graph.