ML-Fundamentals - Simple Neural Network

Introduction

In this exercise you will be presented a classification problem with two classes and two features. The classes are not linearly separable. First you will implement the logistic regression, which will yield a very bad decision boundary. Then you will extend your model with a hidden layer consisting of two hidden neurons only. By executing the plots you will see, that these two hidden neurons are already almost enough to find a decision boundary, that separates our data much better.

Finally, you will implement a neural network with multiple hidden layers to solve the problem without any missclassifications.

Requirements

Knowledge

You should have a basic knowledge of:

  • Logistic regression
  • Logistic function
  • Tanh as activation function
  • Relu as activation function
  • Mean squared error
  • Cross-entropy loss
  • Gradient descent
  • Backpropagation
  • numpy
  • matplotlib

Suitable sources for acquiring this knowledge are:

Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# External Modules
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Data Generation

For convenience and visualization, we will only use two features in this notebook, so we are still able to plot them together with the target class and decision boundary

First we will create some artificial data:

-$ m_1 = 10 $ examples for class 0 -$ m_2 = 15 $ examples for class 1 -$ n = 2 $ features for each example

No exercise yet, just execute the cells.

m1 = 10
m2 = 15
m = m1 + m2
n = 2
X = np.ndarray((m,n))
X.shape
y = np.zeros((m))
y[m1:] = y[m1:] + 1.0
y
### Execute this to generate linearly sperable data
def x2_function_class_0(x):
    return -x*2 + 2

def x2_function_class_1(x):
    return -x*2 + 4
### Execute this to generate NOT linearly sperable data
def x2_function_class_0(x):
    return np.sin(x)

def x2_function_class_1(x):
    return np.sin(x) + 1
x1_min = -5
x1_max = +5

X[:m1,0] = np.linspace(x1_min, x1_max, m1)
X[m1:,0] = np.linspace(x1_min+0.5, x1_max-0.2, m2)
X[:m1,1] = x2_function_class_0(X[:m1,0])
X[m1:,1] = x2_function_class_1(X[m1:,0])
def plot_data():
    plt.scatter(X[:m1,0], X[:m1,1], alpha=0.5, label='class 0 train data')
    plt.scatter(X[m1:,0], X[m1:,1], alpha=0.5, label='class 1 train data')

    plt.plot(x1_line, x2_line_class_0, alpha=0.2, label='class 0 true target func')
    plt.plot(x1_line, x2_line_class_1, alpha=0.2, label='class 1 true target func')
    plt.legend(loc=1)
x1_line = np.linspace(x1_min, x1_max, 100)
x2_line_class_0 = x2_function_class_0(x1_line)
x2_line_class_1 = x2_function_class_1(x1_line)    

plot_data()

Exercises

Activation and Cost Functions

In order to implement the logistic function and a neural net with a hidden layer, the least we need is:

  • An activation function like tanh
  • A cost function like cross-entropy
  • The sigmoid (or logistic function)

Task:

Implement at least the following functions and their derivatives:

  • Tanh$ tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $
  • Tanh derivative$ tanh(x)' = 1 - tanh(x)^2 $
  • Logistic$ \sigma(x) = \frac{1}{1 + e^{-x}} $
  • Logistic derivative$ \sigma(x)' = \sigma(x) \cdot (1-\sigma(x)) $
  • Cross-entropy$ \frac{1}{m}\sum_i^m -y^i \cdot log(\hat y^i) - (1-y^i) \cdot log(1-\hat y^i) $
  • Cross-entropy derivative$ \hat y - y $

Optionally (to play around with) also implement:

  • ReLu$ max(0,x) $
  • Relu derivative$ 0 \text{ if x <= 0 else 1} $
  • Mean squared error$ -\frac{1}{m}\sum_i^m (y^i - \hat y^i)^2 $
  • Mean squared error$ 2 \cdot (\hat y - y) $

If you implementations are correct, the plot of the activation functions and the derivatives (by executing the last cell of this section), should look like the following:

internet connection needed

def logistic(x, deriv=False):
    if deriv:
        raise NotImplementedError()
    raise NotImplementedError()
    
def tanh(x, deriv=False):
    if deriv:
        raise NotImplementedError()
    raise NotImplementedError()

def relu(x, deriv=False):
    if deriv:
        raise NotImplementedError()
    raise NotImplementedError()
def cross_entropy(y_preds, y, deriv=False):
    if deriv:
        raise NotImplementedError()
    raise NotImplementedError()

def mean_squared_error(y_preds, y, deriv=False):
    if deriv:
        raise NotImplementedError()
    raise NotImplementedError()
### Just execute to print your implementation

plt.figure(figsize=(16,4))
x_tmp = np.linspace(-5,5,100)

ax = plt.subplot(1,3,1)
ax.plot(x_tmp, logistic(x_tmp), label='logistic function')
ax.plot(x_tmp, logistic(x_tmp, True), label='logistic function derivative')
ax.legend()

ax = plt.subplot(1,3,2)
ax.plot(x_tmp, tanh(x_tmp), label='tanh')
ax.plot(x_tmp, tanh(x_tmp, True), label='tanh dervative')
ax.legend()

ax = plt.subplot(1,3,3)
ax.plot(x_tmp, relu(x_tmp), label='relu')
ax.plot(x_tmp, relu(x_tmp, True), label='relu derivative')
ax.legend()

Logstic Regression

Task:

Implement the iterative gradient_descent function for logistic regression:

  • Forward pass: -$ Z = \vec x w + b $ -$ A = \sigma (Z) $
  • Print the cost (you can try cross-entropy and mean squared error): -$ C = crossentropy(A, y) $
  • Gradient descent update rule:

$ w_{i_{new}} \leftarrow w_{i_{old}} - \alpha * \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} w_{i}} $ . $ b_{i_{new}} \leftarrow b_{i_{old}} - \alpha * \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} b_{i}} $

Recap:

The dataflow in the forward path should be like: -$ x \rightarrow linear \rightarrow sigmoid \rightarrow \hat y $

  • mathematically: $ \hat y = sigmoid(linear(\vec x)) $
  • with$ \hat y $ the prediction of your model

The following picture visualizes the data flow:

logistic_regression_2_features_2_classes.svg

Hint:

One way to calculate the partial derivatives of the weights and the bias would be to write out the forward pass on pen & paper and calculate the partial derivatives for$ w_1, w_2 $ and$ b $ by hand and just use this formula.

Another way (and the suggested) is: Calculate the derivatives of the individual functions and just chain them:

$ \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} w_{i}} = \frac{\partial C}{\partial A} \cdot \frac{\partial A}{\partial Z} \cdot \frac{\partial Z}{\partial w_i} $

$ \frac{\partial C (A( Z(x,w,b ),y)}{\partial \text{} b} = \frac{\partial C}{\partial A} \cdot \frac{\partial A}{\partial Z} \cdot \frac{\partial Z}{\partial b} $

With:

-$ \frac{\partial C}{\partial A}\rightarrow $ cross_entropy(A, y, deriv=True)

-$ \frac{\partial A}{\partial Z}\rightarrow $ logistic(Z, deriv=True)

-$ \frac{\partial Z}{\partial w}\rightarrow $ ...

-$ \frac{\partial Z}{\partial b}\rightarrow $ ...

def gradient_descent(x, y, ws, b, lrate, epochs):
    
    for i in range(epochs):
        
        # forward
        
        # calculate and print costs

        # backward (calculation of partial derivatives for ws and b)
        
        # update ws, b
        
        pass
    
    # return new ws, b and prediction of last iteration
    raise NotImplementedError()

If your implementation is correct, running the training cell and plot cell below should result in either one of the following plots (depending on what data generation process you have chosen):

internet connection needed

As we should have already known, using plain logistic regression we cannot seperate our dataset very well when you used the dataset on the right.

### TRAINING HERE, just execute

ws = np.array([1.,2.])
b = 0.
ws, b, y_pred = gradient_descent(X, y, ws, b, lrate=0.1, epochs=500)
y_pred[y_pred > 0.5] = 1.
y_pred[y_pred < 0.5] = 0.
print(y)
print(y_pred)
### Plot the data and decision boundary, just execute this cell

x2_boundary = (-b -ws[0]*x1_line)/ws[1]
plt.plot(x1_line, x2_boundary, c='g', label='boundary')

plot_data()

Adding Hidden Layer

Now we are going to add a hidden layer consisting of two neurons. For the hidden layer neurons use the activation function$ tanh $ instead of$ logistic $.

Task:

  • Implement the function for the forward_pass. It should return:

    • Z1s (the results of$ \vec x \cdot w_{11} + b_1 $ and$ \vec x \cdot w_{12} + b_1 $)

    • A1s (the results of passing Z1 into activation function)

    • Z2s (the results of$ A1 \cdot w_{21} + b_2 $ and$ A1 \cdot w_{22} + b_2 $)

    • A2 (also known as y_predicted, the resutl of passing Z2s into$ logistic $ function)

  • Then backprop, extending the chain rule:

$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} w_{2}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial w_2} $

$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1)),w_2,b_2)),y)}{\partial \text{} b_{2}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_2} $

$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} w_{1}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial w_1} $

$ \frac{\partial C (A_2( Z_2(A_1( Z_1(x,w_1,b_1 )),w_2,b_2)),y)}{\partial \text{} b_{1}} = \frac{\partial C}{\partial A_2} \cdot \frac{\partial A_2}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial b_1} $

def forward_pass(x, ws, bs, act_fs):
        raise NotImplementedError()
def backprop(x, y, ws, bs, act_fs, cost_f, lrate, epochs):
    for i in range(epochs):
      
        ### forward
        Z1, A1, Z2, A2 = forward_pass(x, ws, bs, act_fs)
        
        ### cost
        
        ### backward
        
        ### updates
        
    return ws, bs, Z1, A1, Z2, A2
### Training, just execute this cell

# wheights can be initilized such that training does not succeed. seed guarantees working wheights
# ATTENTION: if changing parameters like number of neurons, activation function, cost function
# you might need to try other seeds
np.random.seed(4242)
ws = [
    np.random.randn(2, 2)*0.1,
    np.random.randn(2, 1)*0.1
    ]
print(ws)
bs = np.full((len(ws),1),0.)
bs = [
    np.full((1,2),0.),
    np.full((1,1),0.),
]
act_fs = [tanh, logistic] ### possible: tanh / logistic / relu
cost_f = cross_entropy ### possible: mean_squared_error / cross_entropy

y = y.reshape((len(X),1))

ws, bs, Z1, A1, Z2, A2 = backprop(X, y, ws, bs, act_fs, cost_f, .1, 10000)

### Applying a threshold to our predictions
A2[A2<0.5] = 0.
A2[A2>0.5] = 1.
### Then we can compare our predictions witht the true labels
print(A2.flatten())
print(y.flatten())

Now we are going to plot two things:

  • 1st: The two neurons in the hidden layer represent the original data, but transformed into another 2D space. This is more likely to be linearly seperable. Though, for our data following two different$ sin $ functions, these two hidden neurons are just not enough to seperate all data correct.

  • 2nd: We can also plot the decision boundary in our original space.

If your implementation is correct and training did succeed, your plots could look like the following:

internet connection needed

### plot hidden transformation of X with learned w1s (transformation) and learned w2s (boundary)
###
### ATTENTION: ONLY WORKDS IF HIDDEN LAYER has 2 neurons only
###

### Plot transformations
Z1, A1, Z2, A2 = forward_pass(X, ws, bs, act_fs)
plt.scatter(A1[:m1,0], A1[:m1,1], alpha=0.5, label='class 0')
plt.scatter(A1[m1:,0], A1[m1:,1], alpha=0.5, label='class 1')

### Plot true target functions
data_tmp = np.ndarray((len(x1_line), 2))
data_tmp[:,0] = x1_line

data_tmp[:,1] = x2_line_class_0
Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(A1[:,0], A1[:,1])

data_tmp[:,1] = x2_line_class_1
Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(A1[:,0], A1[:,1])

### Plot boundary
Z1, A1, Z2, A2 = forward_pass(X, ws, bs, act_fs)
x1_boundary_mlp = np.linspace(-1, +1, 10)
x2_boundary_mlp = (-bs[-1][-1] -ws[-1][0,0]*x1_boundary_mlp)/ws[-1][1,0]
plt.plot(x1_boundary_mlp, x2_boundary_mlp, c='g')
plt.legend()
plt.title('Data and boundary in hidden space')
### plot boundary in original space

grid_density = 100
x1 = np.linspace(X[:,0].min()-1,X[:,0].max()+1,grid_density)
x2 = np.linspace(X[:,1].min()-1,X[:,1].max()+1,grid_density)
mash = np.meshgrid(x1,x2)

data_tmp = np.ndarray((grid_density**2, n))
data_tmp[:,0] = mash[0].flatten()
data_tmp[:,1] = mash[1].flatten()

Z1, A1, Z2, A2 = forward_pass(data_tmp, ws, bs, act_fs)
c0 = data_tmp[A2[:,0] < 0.5]
c1 = data_tmp[A2[:,0] >= 0.5]
plt.scatter(c0[:,0],c0[:,1], alpha=1.0, marker='s', color="#aaccee")
plt.scatter(c1[:,0],c1[:,1], alpha=1.0, marker='s', color="#eeccaa")
plot_data()
plt.title('Data and boundary in original space')

Adding more Layers and Parametrization

Task:

Now write the forward_pass and backprop function again, but this time fully parametrize your functions, so you can use it with different number of layers, different activation function for each layer and so on.

def forward_pass(x, ws, bs, act_fs):
        
        raise NotImplementedError()
        return Zs, As
def backprop(x, y, ws, bs, act_fs, cost_f, lrate, epochs):

    for i in range(epochs):
        
        ### forward
        Zs, As = forward_pass(x, ws, bs, act_fs)
        
        ### cost
        
        ### backward    
        
        ### update
        
    return ws, bs, Zs, As
np.random.seed(42)
ws = [
    np.random.randn(2, 20)*0.1,
    np.random.randn(20, 20)*0.1,
    np.random.randn(20, 2)*0.1,
    np.random.randn(2, 1)*0.1
    ]
bs = [np.full((1,len(w[1])),0.) for w in ws]
act_fs = [tanh, tanh, tanh, logistic] ### tanh / logistic / relu
cost_f = cross_entropy ### mean_squared_error / cross_entropy

y = y.reshape((len(X),1))
ws, bs, Zs, As = backprop(X, y, ws, bs, act_fs, cost_f, 0.1, 2000)

results = As[-1].flatten()
results[results < .5] = 0.
results[results >= .5] = 1.
print(results, results.shape)
print(y.flatten())

Now we are going to plot again:

  • 1st: The data for the two neurons in the LAST hidden layer.
  • 2nd: The decision boundary in our original space.

If your implementation is correct and training did succeed, your plots could look like the following:

internet connection needed

# plot hidden transformation of X with learned w1s (transformation) and learned w2s (boundary)
#
# ATTENTION: ONLY WORKS IF LAST LAYER-1 has 2 neurons only
#
Zs, As = forward_pass(X, ws, bs, act_fs)
plt.scatter(As[-2][:m1,0], As[-2][:m1,1], alpha=0.5, label='class 0')
plt.scatter(As[-2][m1:,0], As[-2][m1:,1], alpha=0.5, label='class 1')

data_tmp = np.ndarray((len(x1_line), 2))
data_tmp[:,0] = x1_line
print(data_tmp.shape)
data_tmp[:,1] = x2_line_class_0
Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(As[-2][:,0], As[-2][:,1])

data_tmp[:,1] = x2_line_class_1
Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
plt.plot(As[-2][:,0], As[-2][:,1])

#x1_boundary_mlp = np.linspace(As[-2][:,0].min(),As[-2][:,0].max(), 10)
x1_boundary_mlp = np.linspace(-1, +1, 10)
x2_boundary_mlp = (-bs[-1][-1] -ws[-1][0,0]*x1_boundary_mlp)/ws[-1][1,0]
plt.plot(x1_boundary_mlp, x2_boundary_mlp, c='g')
plt.title('Data and boundary in hidden space')
grid_density = 100
x1 = np.linspace(X[:,0].min()-1,X[:,0].max()+1,grid_density)
x2 = np.linspace(X[:,1].min()-1,X[:,1].max()+1,grid_density)
mash = np.meshgrid(x1,x2)

data_tmp = np.ndarray((grid_density**2, n))
data_tmp[:,0] = mash[0].flatten()
data_tmp[:,1] = mash[1].flatten()

Zs, As = forward_pass(data_tmp, ws, bs, act_fs)
print(data_tmp.shape)
print(A2.shape)
c0 = data_tmp[As[-1][:,0] < 0.5]
c1 = data_tmp[As[-1][:,0] >= 0.5]
plt.scatter(c0[:,0],c0[:,1], alpha=1.0, marker='s', color="#aaccee")
plt.scatter(c1[:,0],c1[:,1], alpha=1.0, marker='s', color="#eeccaa")
plot_data()
plt.title('Data and boundary in original space')

Freestyle Exercise

When you are finished you can try different activation functions $ tanh $,$ logistic $m$ relu $) and / or different cost functions when calling the backprop function for the Neural Network. You can also try to add more Layers.

Things you should note when trying different things:

  • In order to plot the transformed hidden space, the last hidden layer may only consist of 2 neurons
  • When using$ relu $ as activation function, do not use it for a Layer with only 2 neurons. Chances are high, that wheigths are negative and you end up with 2 dead neurons.
  • When using$ relu $ in a layer with only 2 neurons, make sure not to initiliaze them negative.

Summary and Outlook

[TODO]

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise: Simple Neural Network
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2019 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.