Neural Networks Architectures

CSI 4106 - Fall 2024

Marcel Turcotte

Version: Nov 7, 2024 17:56

Preamble

Quote of the Day

Learning objectives

  • Explain the Hierarchy of Concepts in Deep Learning
  • Compare Deep and Shallow Neural Networks
  • Describe the Structure and Function of Convolutional Neural Networks (CNNs)
  • Understand Convolution Operations Using Kernels
  • Explain Receptive Fields, Padding, and Stride in CNNs
  • Discusss the Role and Benefits of Pooling Layer

Introduction

Hierarchy of concepts

Hierarchy of concepts

  • Each layer detects patterns from the output of the layer preceding it.
    • In other words, proceeding from the input to the output of the network, the network uncovers “patterns of patterns”.
      • Analyzing an image, the networks first detect simple patterns, such as vertical, horizontal, diagonal lines, arcs, etc.
      • These are then combined to form corners, crosses, etc.
  • (This explains how transfer learning works and why selecting the bottom layers only.)

But also …

“An MLP with just one hidden layer can theoretically model even the most complex functions, provided it has enough neurons. But for complex problems, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data.”

How many layers?

  • Start with one layer, then increase the number of layers until the model starts overfitting the training data.
  • Finetune the model adding regularization (dropout layers, regularization terms, etc.).

Observation

Consider a feed-forward network (FFN) and its model:

\[ h_{W,b}(X) = \phi_k(\ldots \phi_2(\phi_1(X)) \ldots) \]

where

\[ \phi_l(Z) = \sigma(W_l Z + b_l) \]

for \(l=1 \ldots k\). - The number of parameters in grows rapidly:

\[ (\text{size of layer}_{l-1} + 1) \times \text{size of layer}_{l} \]

Two layers 1,000-unit implies 1,000,000 parameters!

Convolutional Neural Network

Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN)

  • Crucial pattern information is often local.

    • e.g., edges, corners, crosses.
  • Convolutional layers reduce parameters significantly.

    • Unlike dense layers, neurons in a convolutional layer are not fully connected to the preceding layer.

    • Neurons connect only within their receptive fields (rectangular regions).

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel Placements

Kernel

Kernel

Kernel

Blurring

# Define the 3x3 averaging kernel

kernel = np.array([
    [1/9, 1/9, 1/9],
    [1/9, 1/9, 1/9],
    [1/9, 1/9, 1/9]
])

Vertical Edge detection

# Define the 3x3 averaging kernel

kernel = np.array([
    [-0.25, 0, 0.25],
    [-0.25, 0, 0.25],
    [-0.25, 0, 0.25]
])

Horizontal Edge detection

# Define the 3x3 averaging kernel

kernel = np.array([
    [-0.25, -0.25, -0.25],
    [0, 0, 0],
    [0.25, 0.25, 0.25]
])

Convolutions in Image Processing

But what is a convolution?

Kernels

In contrast to image processing, where kernels are manually defined by the user, in convolutional networks, the kernels are automatically learned by the network.

Receptive field

Receptive field

Receptive field

  • Each unit is connected to neurons in its receptive fields.
    • Unit \(i,j\) in layer \(l\) is connected to the units \(i\) to \(i+f_h-1\), \(j\) to \(j+f_w-1\) of the layer \(l-1\), where \(f_h\) and \(f_w\) are respectively the height and width of the receptive field.

Padding

Zero padding. In order to have layers of the same size, the grid can be padded with zeros.

Padding

No padding

Half padding

Full padding

Stride

Stride. It is possible to connect a larger layer \((l-1)\) to a smaller one \((l)\) by skipping units. The number of units skipped is called stride, \(s_h\) and \(s_w\).

  • Unit \(i,j\) in layer \(l\) is connected to the units \(i \times s_h\) to \(i \times s_h + f_h - 1\), \(j \times s_w\) to \(j \times s_w + f_w - 1\) of the layer \(l-1\), where \(f_h\) and \(f_w\) are respectively the height and width of the receptive field, \(s_h\) and \(s_w\) are respectively the height and width strides.

Stride

No padding, strides

Padding, strides

Filters

Filters

  • A window of size \(f_h \times f_w\) is moved over the output of layers \(l-1\), referred to as the input feature map, position by position.

  • For each location, the product is calculated between the extracted patch and a matrix of the same size, known as a convolution kernel or filter. The sum of the values in the resulting matrix constitutes the output for that location.

Model

Model

\[ z_{i,j,k} = b_k + \sum_{u=0}^{f_h-1} \sum_{v=0}^{f_w-1} \sum_{k'=0}^{f_{n'}-1} x_{i',j',k'} \cdot w_{u,v,k',k} \]

where \(i' = i \times s_h + u\) and \(j' = j \times s_w + v\).

Convolutional Layer

  • “Thus, a layer full of neurons using the same filter outputs a feature map.”

  • “Of course, you do not have to define the filters manually: instead, during training the convolutional layer will automatically learn the most useful filters for its task.”

Convolutional Layer

  • “(…) and the layers above will learn to combine them into more complex patterns.”

  • “The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model.”

Summmary

  1. Feature Map: In convolutional neural networks (CNNs), the output of a convolution operation is known as a feature map. It captures the features of the input data as processed by a specific kernel.

Summmary

  1. Kernel Parameters: The parameters of the kernel are learned through the backpropagation process, allowing the network to optimize its feature extraction capabilities based on the training data.

Summmary

  1. Bias Term: A single bias term is added uniformly to all entries of the feature map. This bias helps adjust the activation level, providing additional flexibility for the network to better fit the data.

Summmary

  1. Activation Function: Following the addition of the bias, the feature map values are typically passed through an activation function, such as ReLU (Rectified Linear Unit). The ReLU function introduces non-linearity by setting negative values to zero while retaining positive values, enabling the network to learn more complex patterns.

Pooling

Pooling

  • A pooling layer exhibits similarities to a convolutional layer.

    • Each neuron in a pooling layer is connected to a set of neurons within a receptive field.
  • However, unlike convolutional layers, pooling layers do not possess weights.

    • Instead, they produce an output by applying an aggregating function, commonly max or mean.

Pooling

  • This subsampling process leads to a reduction in network size; each window of dimensions \(f_h \times f_w\) is condensed to a single value, typically the maximum or mean of that window.

  • According to Géron (2019), a max pooling layer provides a degree of invariance to small translations (§ 14).

Pooling

  1. Dimensionality Reduction: Pooling layers reduce the spatial dimensions (width and height) of the input feature maps. This reduction decreases the number of parameters and computational load in the network, which can help prevent overfitting.

Pooling

  1. Feature Extraction: By summarizing the presence of features in a region, pooling layers help retain the most critical information while discarding less important details. This process enables the network to focus on the most salient features.

Pooling

  1. Translation Invariance: Pooling introduces a degree of invariance to translations and distortions in the input. For instance, max pooling captures the most prominent feature in a local region, making the network less sensitive to small shifts or variations in the input.

Pooling

  1. Noise Reduction: Pooling can help smooth out noise in the input by aggregating information over a region, thus emphasizing consistent features over random variations.

Pooling

  1. Hierarchical Feature Learning: By reducing the spatial dimensions progressively through the layers, pooling layers allow the network to build a hierarchical representation of the input data, capturing increasingly abstract and complex features at deeper layers.

Keras

import tensorflow as tf
from functools import partial  

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal") 

model = tf.keras.Sequential([     
  DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]), 
  tf.keras.layers.MaxPool2D(),     
  DefaultConv2D(filters=128),
  DefaultConv2D(filters=128),
  tf.keras.layers.MaxPool2D(),
  DefaultConv2D(filters=256),
  DefaultConv2D(filters=256),
  tf.keras.layers.MaxPool2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),     
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(units=10, activation="softmax") ])  

model.summary()

Keras

Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d_5 (Conv2D)               │ (None, 28, 28, 64)     │         3,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_3 (MaxPooling2D)  │ (None, 14, 14, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_6 (Conv2D)               │ (None, 14, 14, 128)    │        73,856 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_7 (Conv2D)               │ (None, 14, 14, 128)    │       147,584 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_4 (MaxPooling2D)  │ (None, 7, 7, 128)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_8 (Conv2D)               │ (None, 7, 7, 256)      │       295,168 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_9 (Conv2D)               │ (None, 7, 7, 256)      │       590,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_5 (MaxPooling2D)  │ (None, 3, 3, 256)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_1 (Flatten)             │ (None, 2304)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 128)            │       295,040 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 64)             │         8,256 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_3 (Dropout)             │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 10)             │           650 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,413,834 (5.39 MB)
 Trainable params: 1,413,834 (5.39 MB)
 Non-trainable params: 0 (0.00 B)

Convolutional Neural Networks

AlexNet

Krizhevsky, Sutskever, and Hinton (2012)

VGG

Simonyan and Zisserman (2015)

ConvNets Performance

StatQuest

Final Word

As you might expect, the number of layers and filters are hyperparameters that are optimized through the process of hyperparameter tuning.

Prologue

Summary

  • Hierarchy of Concepts in Deep Learning
  • Kernels and Convolution Operations
  • Receptive Field, Padding, and Stride
  • Filters and Feature Maps
  • Convolutional Layers
  • Pooling Layers

Future Directions

When integrating CNNs into your projects, consider exploring the following topics:

  • Feature Attribution: Various techniques are available to visualize what the network has learned. For example, in the context of self-driving cars, it is crucial to ensure that the network focuses on relevant features, avoiding distractions.

  • Transfer Learning: This approach enables the reuse of weights from pre-trained networks, which accelerates the learning process, reduces computational demands, and facilitates network training even with a limited number of examples.

Further Reading

  • Understanding Deep Learning (Prince 2023) is a recently published textbook focused on the foundational concepts of deep learning.

  • It begins with fundamental principles and extends to contemporary topics such as transformers, diffusion models, graph neural networks, autoencoders, adversarial networks, and reinforcement learning.

  • The textbook aims to help readers comprehend these concepts without delving excessively into theoretical details.

  • It includes sixty-eight Python notebook exercises.

  • The book follows a “read-first, pay-later” model.

Resources

Next lecture

  • We will introduce solution spaces.

References

Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media.
———. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Adaptive Computation and Machine Learning. MIT Press. https://dblp.org/rec/books/daglib/0040158.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C. J. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.
Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://doi.org/10.1109/5.726791.
Prince, Simon J. D. 2023. Understanding Deep Learning. The MIT Press. http://udlbook.com.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.
Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” In International Conference on Learning Representations.

Marcel Turcotte

[email protected]

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa