MLFundamentals  Simple Neural Network
Table of Contents
Introduction
In this exercise you will be presented a classification problem with two classes and two features. The classes are not linearly separable. First you will implement the logistic regression, which will yield a very bad decision boundary. Then you will extend your model with a hidden layer consisting of two hidden neurons only. By executing the plots you will see, that these two hidden neurons are already almost enough to find a decision boundary, that separates our data much better.
Finally, you will implement a neural network with multiple hidden layers to solve the problem without any missclassifications.
Requirements
Knowledge
You should have a basic knowledge of:
 Logistic regression
 Logistic function
 Tanh as activation function
 Relu as activation function
 Mean squared error
 Crossentropy loss
 Gradient descent
 Backpropagation
 numpy
 matplotlib
Suitable sources for acquiring this knowledge are:
 Logistic Regression Notebook by Christian Herta and corresponding lecture slides (German)
 Deep Learning Book by Ian Goodfellow
 numpy quickstart
 Matplotlib tutorials
Python Modules
By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.
# External Modules
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Data Generation
For convenience and visualization, we will only use two features in this notebook, so we are still able to plot them together with the target class and decision boundary
First we will create some artificial data:
$ m_1 = 10 $ examples for class 0 $ m_2 = 15 $ examples for class 1 $ n = 2 $ features for each example
No exercise yet, just execute the cells.
m1 = 10
m2 = 15
m = m1 + m2
n = 2
X = np.ndarray((m,n))
X.shape
y = np.zeros((m))
y[m1:] = y[m1:] + 1.0
y
### Execute this to generate linearly sperable data
def x2_function_class_0(x):
return x*2 + 2
def x2_function_class_1(x):
return x*2 + 4
### Execute this to generate NOT linearly sperable data
def x2_function_class_0(x):
return np.sin(x)
def x2_function_class_1(x):
return np.sin(x) + 1
x1_min = 5
x1_max = +5
X[:m1,0] = np.linspace(x1_min, x1_max, m1)
X[m1:,0] = np.linspace(x1_min+0.5, x1_max0.2, m2)
X[:m1,1] = x2_function_class_0(X[:m1,0])
X[m1:,1] = x2_function_class_1(X[m1:,0])
def plot_data():
plt.scatter(X[:m1,0], X[:m1,1], alpha=0.5, label='class 0 train data')
plt.scatter(X[m1:,0], X[m1:,1], alpha=0.5, label='class 1 train data')
plt.plot(x1_line, x2_line_class_0, alpha=0.2, label='class 0 true target func')
plt.plot(x1_line, x2_line_class_1, alpha=0.2, label='class 1 true target func')
plt.legend(loc=1)
x1_line = np.linspace(x1_min, x1_max, 100)
x2_line_class_0 = x2_function_class_0(x1_line)
x2_line_class_1 = x2_function_class_1(x1_line)
plot_data()
Exercises
Activation and Cost Functions
In order to implement the logistic function and a neural net with a hidden layer, the least we need is:
 An activation function like tanh
 A cost function like crossentropy
 The sigmoid (or logistic function)
Task:
Implement at least the following functions and their derivatives:
 Tanh$ tanh(x) = \frac{e^x  e^{x}}{e^x + e^{x}} $
 Tanh derivative$ tanh(x)' = 1  tanh(x)^2 $
 Logistic$ \sigma(x) = \frac{1}{1 + e^{x}} $
 Logistic derivative$ \sigma(x)' = \sigma(x) \cdot (1\sigma(x)) $
 Crossentropy$ \frac{1}{m}\sum_i^m y^i \cdot log(\hat y^i)  (1y^i) \cdot log(1\hat y^i) $
 Crossentropy derivative$ \hat y  y $
Optionally (to play around with) also implement:
 ReLu$ max(0,x) $
 Relu derivative$ 0 \text{ if x <= 0 else 1} $
 Mean squared error$ \frac{1}{m}\sum_i^m (y^i  \hat y^i)^2 $
 Mean squared error$ 2 \cdot (\hat y  y) $
If you implementations are correct, the plot of the activation functions and the derivatives (by executing the last cell of this section), should look like the following:
def logistic(x, deriv=False):
if deriv:
raise NotImplementedError()
raise NotImplementedError()
def tanh(x, deriv=False):
if deriv:
raise NotImplementedError()
raise NotImplementedError()
def relu(x, deriv=False):
if deriv:
raise NotImplementedError()
raise NotImplementedError()
def cross_entropy(y_preds, y, deriv=False):
if deriv:
raise NotImplementedError()
raise NotImplementedError()
def mean_squared_error(y_preds, y, deriv=False):
if deriv:
raise NotImplementedError()
raise NotImplementedError()
### Just execute to print your implementation
plt.figure(figsize=(16,4))
x_tmp = np.linspace(5,5,100)
ax = plt.subplot(1,3,1)
ax.plot(x_tmp, logistic(x_tmp), label='logistic function')
ax.plot(x_tmp, logistic(x_tmp, True), label='logistic function derivative')
ax.legend()
ax = plt.subplot(1,3,2)
ax.plot(x_tmp, tanh(x_tmp), label='tanh')
ax.plot(x_tmp, tanh(x_tmp, True), label='tanh dervative')
ax.legend()
ax = plt.subplot(1,3,3)
ax.plot(x_tmp, relu(x_tmp), label='relu')
ax.plot(x_tmp, relu(x_tmp, True), label='relu derivative')
ax.legend()
Logstic Regression
Task:
Implement the iterative gradient_descent
function for logistic regression:
 Forward pass: $ Z = \vec x w + b $ $ A = \sigma (Z) $
 Print the cost (you can try crossentropy and mean squared error): $ C = crossentropy(A, y) $
 Gradient descent update rule:
$ w_{i_{new}} \leftarrow w_{i_{old}}  \alpha * \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} w_{i}} $ . $ b_{i_{new}} \leftarrow b_{i_{old}}  \alpha * \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} b_{i}} $
Recap:
The dataflow in the forward path should be like: $ x \rightarrow linear \rightarrow sigmoid \rightarrow \hat y $
 mathematically: $ \hat y = sigmoid(linear(\vec x)) $
 with$ \hat y $ the prediction of your model
The following picture visualizes the data flow:
Hint:
One way to calculate the partial derivatives of the weights and the bias would be to write out the forward pass on pen & paper and calculate the partial derivatives for$ w_1, w_2 $ and$ b $ by hand and just use this formula.
Another way (and the suggested) is: Calculate the derivatives of the individual functions and just chain them:
$ \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} w_{i}} = \frac{\partial C}{\partial A} \cdot \frac{\partial A}{\partial Z} \cdot \frac{\partial Z}{\partial w_i} $
$ \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} b} = \frac{\partial C}{\partial A} \cdot \frac{\partial A}{\partial Z} \cdot \frac{\partial Z}{\partial b} $
With:
$ \frac{\partial C}{\partial A}\rightarrow $ cross_entropy(A, y, deriv=True)
$ \frac{\partial A}{\partial Z}\rightarrow $ logistic(Z, deriv=True)
$ \frac{\partial Z}{\partial w}\rightarrow $ ...
$ \frac{\partial Z}{\partial b}\rightarrow $ ...
def gradient_descent(x, y, ws, b, lrate, epochs):
for i in range(epochs):
# forward
# calculate and print costs
# backward (calculation of partial derivatives for ws and b)
# update ws, b
pass
# return new ws, b and prediction of last iteration
raise NotImplementedError()
If your implementation is correct, running the training cell and plot cell below should result in either one of the following plots (depending on what data generation process you have chosen):
As we should have already known, using plain logistic regression we cannot seperate our dataset very well when you used the dataset on the right.
### TRAINING HERE, just execute
ws = np.array([1.,2.])
b = 0.
ws, b, y_pred = gradient_descent(X, y, ws, b, lrate=0.1, epochs=500)
y_pred[y_pred > 0.5] = 1.
y_pred[y_pred < 0.5] = 0.
print(y)
print(y_pred)
### Plot the data and decision boundary, just execute this cell
x2_boundary = (b ws[0]*x1_line)/ws[1]
plt.plot(x1_line, x2_boundary, c='g', label='boundary')
plot_data()
Adding Hidden Layer
Now we are going to add a hidden layer consisting of two neurons. For the hidden layer neurons use the activation function$ tanh $ instead of$ logistic $.
Task:

Implement the function for the
forward_pass
. It should return:
Z1
s (the results of$ \vec x \cdot w_{11} + b_1 $ and$ \vec x \cdot w_{12} + b_1 $) 
A1
s (the results of passingZ1
into activation function) 
Z2
s (the results of$ A1 \cdot w_{21} + b_2 $ and$ A1 \cdot w_{22} + b_2 $) 
A2
(also known as y_predicted, the resutl of passingZ2
s into$ logistic $ function)

 Then
backprop
, extending the chain rule:
$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} w_{2}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial w_2} $
$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1)),w_2,b_2)),y)}{\partial \text{} b_{2}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_2} $
$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} w_{1}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial w_1} $
$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} b_{1}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial b_1} $
def forward_pass(x, ws, bs, act_fs):
raise NotImplementedError()
def backprop(x, y, ws, bs, act_fs, cost_f, lrate, epochs):
for i in range(epochs):
### forward
Z1, A1, Z2, A2 = forward_pass(x, ws, bs, act_fs)
### cost
### backward
### updates
return ws, bs, Z1, A1, Z2, A2
### Training, just execute this cell
# wheights can be initilized such that training does not succeed. seed guarantees working wheights
# ATTENTION: if changing parameters like number of neurons, activation function, cost function
# you might need to try other seeds
np.random.seed(4242)
ws = [
np.random.randn(2, 2)*0.1,
np.random.randn(2, 1)*0.1
]
print(ws)
bs = np.full((len(ws),1),0.)
bs = [
np.full((1,2),0.),
np.full((1,1),0.),
]
act_fs = [tanh, logistic] ### possible: tanh / logistic / relu
cost_f = cross_entropy ### possible: mean_squared_error / cross_entropy
y = y.reshape((len(X),1))
ws, bs, Z1, A1, Z2, A2 = backprop(X, y, ws, bs, act_fs, cost_f, .1, 10000)
### Applying a threshold to our predictions
A2[A2<0.5] = 0.
A2[A2>0.5] = 1.
### Then we can compare our predictions witht the true labels
print(A2.flatten())
print(y.flatten())
Now we are going to plot two things:

1st: The two neurons in the hidden layer represent the original data, but transformed into another 2D space. This is more likely to be linearly seperable. Though, for our data following two different$ sin $ functions, these two hidden neurons are just not enough to seperate all data correct.

2nd: We can also plot the decision boundary in our original space.
If your implementation is correct and training did succeed, your plots could look like the following:
### plot hidden transformation of X with learned w1s (transformation) and learned w2s (boundary)
###
### ATTENTION: ONLY WORKDS IF HIDDEN LAYER has 2 neurons only
###
### Plot transformations
Z1, A1, Z2, A2 = forward_pass(X, ws, bs, act_fs)
plt.scatter(A1[:m1,0], A1[:m1,1], alpha=0.5, label='class 0')
plt.scatter(A1[m1:,0], A1[m1:,1], alpha=0.5, label='class 1')
### Plot true target functions
data_tmp = np.ndarray((len(x1_line), 2))
data_tmp[:,0] = x1_line
data_tmp[:,1] = x2_line_class_0
Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(A1[:,0], A1[:,1])
data_tmp[:,1] = x2_line_class_1
Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(A1[:,0], A1[:,1])
### Plot boundary
Z1, A1, Z2, A2 = forward_pass(X, ws, bs, act_fs)
x1_boundary_mlp = np.linspace(1, +1, 10)
x2_boundary_mlp = (bs[1][1] ws[1][0,0]*x1_boundary_mlp)/ws[1][1,0]
plt.plot(x1_boundary_mlp, x2_boundary_mlp, c='g')
plt.legend()
plt.title('Data and boundary in hidden space')
### plot boundary in original space
grid_density = 100
x1 = np.linspace(X[:,0].min()1,X[:,0].max()+1,grid_density)
x2 = np.linspace(X[:,1].min()1,X[:,1].max()+1,grid_density)
mash = np.meshgrid(x1,x2)
data_tmp = np.ndarray((grid_density**2, n))
data_tmp[:,0] = mash[0].flatten()
data_tmp[:,1] = mash[1].flatten()
Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
c0 = data_tmp[A2[:,0] < 0.5]
c1 = data_tmp[A2[:,0] >= 0.5]
plt.scatter(c0[:,0],c0[:,1], alpha=1.0, marker='s', color="#aaccee")
plt.scatter(c1[:,0],c1[:,1], alpha=1.0, marker='s', color="#eeccaa")
plot_data()
plt.title('Data and boundary in original space')
Adding more Layers and Parametrization
Task:
Now write the forward_pass
and backprop
function again, but this time fully parametrize your functions, so you can use it with different number of layers, different activation function for each layer and so on.
def forward_pass(x, ws, bs, act_fs):
raise NotImplementedError()
return Zs, As
def backprop(x, y, ws, bs, act_fs, cost_f, lrate, epochs):
for i in range(epochs):
### forward
Zs, As = forward_pass(x, ws, bs, act_fs)
### cost
### backward
### update
return ws, bs, Zs, As
np.random.seed(42)
ws = [
np.random.randn(2, 20)*0.1,
np.random.randn(20, 20)*0.1,
np.random.randn(20, 2)*0.1,
np.random.randn(2, 1)*0.1
]
bs = [np.full((1,len(w[1])),0.) for w in ws]
act_fs = [tanh, tanh, tanh, logistic] ### tanh / logistic / relu
cost_f = cross_entropy ### mean_squared_error / cross_entropy
y = y.reshape((len(X),1))
ws, bs, Zs, As = backprop(X, y, ws, bs, act_fs, cost_f, 0.1, 2000)
results = As[1].flatten()
results[results < .5] = 0.
results[results >= .5] = 1.
print(results, results.shape)
print(y.flatten())
Now we are going to plot again:
 1st: The data for the two neurons in the LAST hidden layer.
 2nd: The decision boundary in our original space.
If your implementation is correct and training did succeed, your plots could look like the following:
# plot hidden transformation of X with learned w1s (transformation) and learned w2s (boundary)
#
# ATTENTION: ONLY WORKS IF LAST LAYER1 has 2 neurons only
#
Zs, As = forward_pass(X, ws, bs, act_fs)
plt.scatter(As[2][:m1,0], As[2][:m1,1], alpha=0.5, label='class 0')
plt.scatter(As[2][m1:,0], As[2][m1:,1], alpha=0.5, label='class 1')
data_tmp = np.ndarray((len(x1_line), 2))
data_tmp[:,0] = x1_line
print(data_tmp.shape)
data_tmp[:,1] = x2_line_class_0
Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(As[2][:,0], As[2][:,1])
data_tmp[:,1] = x2_line_class_1
Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(As[2][:,0], As[2][:,1])
#x1_boundary_mlp = np.linspace(As[2][:,0].min(),As[2][:,0].max(), 10)
x1_boundary_mlp = np.linspace(1, +1, 10)
x2_boundary_mlp = (bs[1][1] ws[1][0,0]*x1_boundary_mlp)/ws[1][1,0]
plt.plot(x1_boundary_mlp, x2_boundary_mlp, c='g')
plt.title('Data and boundary in hidden space')
grid_density = 100
x1 = np.linspace(X[:,0].min()1,X[:,0].max()+1,grid_density)
x2 = np.linspace(X[:,1].min()1,X[:,1].max()+1,grid_density)
mash = np.meshgrid(x1,x2)
data_tmp = np.ndarray((grid_density**2, n))
data_tmp[:,0] = mash[0].flatten()
data_tmp[:,1] = mash[1].flatten()
Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
print(data_tmp.shape)
print(A2.shape)
c0 = data_tmp[As[1][:,0] < 0.5]
c1 = data_tmp[As[1][:,0] >= 0.5]
plt.scatter(c0[:,0],c0[:,1], alpha=1.0, marker='s', color="#aaccee")
plt.scatter(c1[:,0],c1[:,1], alpha=1.0, marker='s', color="#eeccaa")
plot_data()
plt.title('Data and boundary in original space')
Freestyle Exercise
When you are finished you can try different activation functions $ tanh $,$ logistic $m$ relu $) and / or different cost functions when calling the backprop
function for the Neural Network. You can also try to add more Layers.
Things you should note when trying different things:
 In order to plot the transformed hidden space, the last hidden layer may only consist of 2 neurons
 When using$ relu $ as activation function, do not use it for a Layer with only 2 neurons. Chances are high, that wheigths are negative and you end up with 2 dead neurons.
 When using$ relu $ in a layer with only 2 neurons, make sure not to initiliaze them negative.
Summary and Outlook
[TODO]
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise: Simple Neural Network
by Klaus Strohmenger
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2019 Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.