### Exercise: Variance dependence on sample size

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats

You read in the newspaper from a study about the difference of "infidelity" in the German federal states (Bayern, Berlin, ...). For the study 100 persons were asked in each federal state (Bayern, Berlin, ...) a yes/no-question about "infidelity during their partnership".

The result of the study was:

• Percentage of infidelity in Baden-Württemberg is 20%
• Percentage of infidelity in Bayern is 10%
• Percentage of infidelity in Berlin is 14%
• Percentage of infidelity in Brandenburg is 24%
• Percentage of infidelity in Bremen is 16%
• Percentage of infidelity in Hamburg is 19%
• Percentage of infidelity in Hessen is 20%
• Percentage of infidelity in Mecklenburg-Vorpommern is 16%
• Percentage of infidelity in Niedersachsen is 21%
• Percentage of infidelity in Nordrhein-Westfalen is 25%
• Percentage of infidelity in Rheinland-Pfalz is 21%
• Percentage of infidelity in Saarland is 15%
• Percentage of infidelity in Sachsen is 22%
• Percentage of infidelity in Sachsen-Anhalt is 21%
• Percentage of infidelity in Schleswig-Holstein is 27%
• Percentage of infidelity in Thüringen is 21%

a) Do you trust the study? Why or why not?

b) To understand the problem better simulate i.i.d. binary trials:

Terminology (more explanation during the practical session):

• i.i.d.: independent and identically distributed

• independent: the outcome of one observation does not effect the outcome of another observation
• identically distributed: for all events
• Binary trial (Bernoulli trial):

• random variable X with two outcomes {0,1}
• p \in ]0,1[ is the parameter with P(X=1) = p and P(X=0) = 1-p = q

Here try p=0.2

A Binomial(n,p) random variable is the sum of n i.i.d. Bernoulli(p) random variables. e.g. 200 Binary trials with p = 0.2 in numpy can result in: IN >> np.random.binomial(n=100, p=0.2) OUT>> 41 So we have 41 positive (X=1) outcomes and 159 negative (X=0) outcomes.

Make 100 Binominal experiments (with fixed p=0.2 and n) and estimate p from the outcomes of each experiment (by maximum likelihood) with n=10,100,1000,10000,100000,1000000.

What is the (empirical) variance and the standard deviation of p for the different n?

P% (P-value) confidence intervall: intervall in which P% of the estimates are expected to fall. In which intervall are 90% of the estimated p?

Plot histograms of the distribution of p?

### Variance dependence on sample size

Plot the variance versus the sample size in a log-log plot.

Explain the graph.