Neural Networks Architectures

CSI 4106 - Fall 2024

Marcel Turcotte

Version: Nov 7, 2024 17:56

Preamble

Quote of the Day

Learning objectives

Explain the Hierarchy of Concepts in Deep Learning
Compare Deep and Shallow Neural Networks
Describe the Structure and Function of Convolutional Neural Networks (CNNs)
Understand Convolution Operations Using Kernels
Explain Receptive Fields, Padding, and Stride in CNNs
Discusss the Role and Benefits of Pooling Layer

Today, we have a particularly dense agenda. The study of convolutional networks involves multiple levels of complexity. Please feel free to ask questions if you need clarification.

Detailed learning objectives.

Explain the Hierarchy of Concepts in Deep Learning
- Understand how deep learning models build hierarchical representations of data.
- Recognize how this hierarchy reduces the need for manual feature engineering.
Compare Deep and Shallow Neural Networks
- Discuss why deep networks are more parameter-efficient than shallow networks.
- Explain the benefits of depth in neural network architectures.
Describe the Structure and Function of Convolutional Neural Networks (CNNs)
- Understand how CNNs detect local patterns in data.
- Explain how convolutional layers reduce the number of parameters through weight sharing.
Understand Convolution Operations Using Kernels
- Describe how kernels (filters) are applied over input data to perform convolutions.
- Explain how feature maps are generated from convolution operations.
Explain Receptive Fields, Padding, and Stride in CNNs
- Define the concept of a receptive field in convolutional layers.
- Understand how padding and stride affect the output dimensions and computation.
Discuss the Role and Benefits of Pooling Layers
- Explain how pooling layers reduce spatial dimensions and control overfitting.
- Describe how pooling introduces translation invariance in CNNs.

Introduction

Hierarchy of concepts

In the book “Deep Learning” (Goodfellow, Bengio, and Courville 2016), authors Goodfellow, Bengio, and Courville define deep learning as a subset of machine learning that enables computers to “understand the world in terms of a hierarchy of concepts.”

This hierarchical approach is one of deep learning’s most significant contributions. It reduces the need for manual feature engineering and redirects the focus toward the engineering of neural network architectures.

Convolutional Neural Networks (CNNs) have had a profound impact on the field of machine learning, particularly in areas involving image and video processing.

Revolutionizing Image Recognition: CNNs have significantly advanced the state of the art in image recognition and classification, achieving high accuracy across various datasets. This has led to breakthroughs in fields such as medical imaging, autonomous vehicles, and facial recognition.
Feature Extraction: CNNs automatically learn to extract features from raw data, eliminating the need for manual feature engineering. This capability has been crucial in handling complex data patterns and has expanded the applicability of machine learning to diverse domains.
Transfer Learning: CNNs facilitate transfer learning, where pre-trained networks on large datasets can be fine-tuned for specific tasks with limited data. This has made CNNs accessible and effective for a wide range of applications beyond their original training scope.
Advancements in Deep Learning: The success of CNNs has spurred further research in deep learning architectures, inspiring the development of more sophisticated models like recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformer models.
Broader Application Areas: Beyond image processing, CNNs have been adapted for natural language processing, audio processing, and even in bioinformatics for tasks such as protein structure prediction and genomics.
Implications for Real-World Applications: CNNs have enabled practical applications in fields such as healthcare, where they assist in diagnostic imaging, and in security, where they enhance surveillance systems. They have also contributed to advancements in virtual reality, gaming, and augmented reality.

Hierarchy of concepts

Each layer detects patterns from the output of the layer preceding it.
- In other words, proceeding from the input to the output of the network, the network uncovers “patterns of patterns”.
  - Analyzing an image, the networks first detect simple patterns, such as vertical, horizontal, diagonal lines, arcs, etc.
  - These are then combined to form corners, crosses, etc.
(This explains how transfer learning works and why selecting the bottom layers only.)

But also …

“An MLP with just one hidden layer can theoretically model even the most complex functions, provided it has enough neurons. But for complex problems, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data.”

How many layers?

Start with one layer, then increase the number of layers until the model starts overfitting the training data.
Finetune the model adding regularization (dropout layers, regularization terms, etc.).

Observation

Consider a feed-forward network (FFN) and its model:

\[ h_{W,b}(X) = \phi_k(\ldots \phi_2(\phi_1(X)) \ldots) \]

where

\[ \phi_l(Z) = \sigma(W_l Z + b_l) \]

for \(l=1 \ldots k\). - The number of parameters in grows rapidly:

\[ (\text{size of layer}_{l-1} + 1) \times \text{size of layer}_{l} \]

Two layers 1,000-unit implies 1,000,000 parameters!

Convolutional Neural Network

Convolutional Neural Network (CNN)

Crucial pattern information is often local.
- e.g., edges, corners, crosses.
Convolutional layers reduce parameters significantly.
- Unlike dense layers, neurons in a convolutional layer are not fully connected to the preceding layer.
- Neurons connect only within their receptive fields (rectangular regions).

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel Placements

Kernel

Kernel

Kernel

Blurring

# Define the 3x3 averaging kernel

kernel = np.array([
    [1/9, 1/9, 1/9],
    [1/9, 1/9, 1/9],
    [1/9, 1/9, 1/9]
])

Vertical Edge detection

# Define the 3x3 averaging kernel

kernel = np.array([
    [-0.25, 0, 0.25],
    [-0.25, 0, 0.25],
    [-0.25, 0, 0.25]
])

Horizontal Edge detection

# Define the 3x3 averaging kernel

kernel = np.array([
    [-0.25, -0.25, -0.25],
    [0, 0, 0],
    [0.25, 0.25, 0.25]
])

Convolutions in Image Processing

But what is a convolution?

Kernels

In contrast to image processing, where kernels are manually defined by the user, in convolutional networks, the kernels are automatically learned by the network.

Receptive field

Receptive field

Each unit is connected to neurons in its receptive fields.
- Unit \(i,j\) in layer \(l\) is connected to the units \(i\) to \(i+f_h-1\), \(j\) to \(j+f_w-1\) of the layer \(l-1\), where \(f_h\) and \(f_w\) are respectively the height and width of the receptive field.

Padding

Zero padding. In order to have layers of the same size, the grid can be padded with zeros.

Padding

No padding

Half padding

Full padding

Stride

Stride. It is possible to connect a larger layer \((l-1)\) to a smaller one \((l)\) by skipping units. The number of units skipped is called stride, \(s_h\) and \(s_w\).

Unit \(i,j\) in layer \(l\) is connected to the units \(i \times s_h\) to \(i \times s_h + f_h - 1\), \(j \times s_w\) to \(j \times s_w + f_w - 1\) of the layer \(l-1\), where \(f_h\) and \(f_w\) are respectively the height and width of the receptive field, \(s_h\) and \(s_w\) are respectively the height and width strides.

Stride

No padding, strides

Padding, strides

Filters

A window of size \(f_h \times f_w\) is moved over the output of layers \(l-1\), referred to as the input feature map, position by position.
For each location, the product is calculated between the extracted patch and a matrix of the same size, known as a convolution kernel or filter. The sum of the values in the resulting matrix constitutes the output for that location.

Model

\[ z_{i,j,k} = b_k + \sum_{u=0}^{f_h-1} \sum_{v=0}^{f_w-1} \sum_{k'=0}^{f_{n'}-1} x_{i',j',k'} \cdot w_{u,v,k',k} \]

where \(i' = i \times s_h + u\) and \(j' = j \times s_w + v\).

Convolutional Layer

“Thus, a layer full of neurons using the same filter outputs a feature map.”
“Of course, you do not have to define the filters manually: instead, during training the convolutional layer will automatically learn the most useful filters for its task.”

Convolutional Layer

“(…) and the layers above will learn to combine them into more complex patterns.”
“The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model.”

Summmary

Feature Map: In convolutional neural networks (CNNs), the output of a convolution operation is known as a feature map. It captures the features of the input data as processed by a specific kernel.

Summmary

Kernel Parameters: The parameters of the kernel are learned through the backpropagation process, allowing the network to optimize its feature extraction capabilities based on the training data.

Summmary

Bias Term: A single bias term is added uniformly to all entries of the feature map. This bias helps adjust the activation level, providing additional flexibility for the network to better fit the data.

Summmary

Activation Function: Following the addition of the bias, the feature map values are typically passed through an activation function, such as ReLU (Rectified Linear Unit). The ReLU function introduces non-linearity by setting negative values to zero while retaining positive values, enabling the network to learn more complex patterns.

Pooling

A pooling layer exhibits similarities to a convolutional layer.
- Each neuron in a pooling layer is connected to a set of neurons within a receptive field.
However, unlike convolutional layers, pooling layers do not possess weights.
- Instead, they produce an output by applying an aggregating function, commonly max or mean.

In a pooling layer, specifically max pooling, the max function is inherently nondifferentiable because it involves selecting the maximum value from a set of inputs. However, in the context of backpropagation in neural networks, we can work around this by using a concept known as the “gradient of the max function.”

Here’s how it is done:

Forward Pass: During the forward pass, the max pooling layer selects the maximum value from each pooling region (e.g., a 2x2 window) and passes these values to the next layer.
Backward Pass: During backpropagation, the gradient is propagated only to the input that was the maximum value in the forward pass. This means that the derivative is 1 for the position that held the maximum value and 0 for all other positions within the pooling window.

This approach effectively allows the max operation to participate in gradient-based optimization processes like backpropagation, even though the max function itself is nondifferentiable. By assigning the gradient to the position of the maximum value, the network can learn which features are most important for the task at hand.

Pooling

This subsampling process leads to a reduction in network size; each window of dimensions \(f_h \times f_w\) is condensed to a single value, typically the maximum or mean of that window.
According to Géron (2019), a max pooling layer provides a degree of invariance to small translations (§ 14).

Pooling

Dimensionality Reduction: Pooling layers reduce the spatial dimensions (width and height) of the input feature maps. This reduction decreases the number of parameters and computational load in the network, which can help prevent overfitting.

Pooling

Feature Extraction: By summarizing the presence of features in a region, pooling layers help retain the most critical information while discarding less important details. This process enables the network to focus on the most salient features.

Pooling

Translation Invariance: Pooling introduces a degree of invariance to translations and distortions in the input. For instance, max pooling captures the most prominent feature in a local region, making the network less sensitive to small shifts or variations in the input.

Pooling

Noise Reduction: Pooling can help smooth out noise in the input by aggregating information over a region, thus emphasizing consistent features over random variations.

Pooling

Hierarchical Feature Learning: By reducing the spatial dimensions progressively through the layers, pooling layers allow the network to build a hierarchical representation of the input data, capturing increasingly abstract and complex features at deeper layers.

Keras

import tensorflow as tf
from functools import partial  

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal") 

model = tf.keras.Sequential([     
  DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]), 
  tf.keras.layers.MaxPool2D(),     
  DefaultConv2D(filters=128),
  DefaultConv2D(filters=128),
  tf.keras.layers.MaxPool2D(),
  DefaultConv2D(filters=256),
  DefaultConv2D(filters=256),
  tf.keras.layers.MaxPool2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),     
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(units=10, activation="softmax") ])  

model.summary()

Keras

Model: "sequential_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d_5 (Conv2D)               │ (None, 28, 28, 64)     │         3,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_3 (MaxPooling2D)  │ (None, 14, 14, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_6 (Conv2D)               │ (None, 14, 14, 128)    │        73,856 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_7 (Conv2D)               │ (None, 14, 14, 128)    │       147,584 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_4 (MaxPooling2D)  │ (None, 7, 7, 128)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_8 (Conv2D)               │ (None, 7, 7, 256)      │       295,168 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_9 (Conv2D)               │ (None, 7, 7, 256)      │       590,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_5 (MaxPooling2D)  │ (None, 3, 3, 256)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_1 (Flatten)             │ (None, 2304)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 128)            │       295,040 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 64)             │         8,256 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_3 (Dropout)             │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 10)             │           650 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 1,413,834 (5.39 MB)

 Trainable params: 1,413,834 (5.39 MB)

 Non-trainable params: 0 (0.00 B)

Convolutional Neural Networks

AlexNet

Krizhevsky, Sutskever, and Hinton (2012)

VGG

Simonyan and Zisserman (2015)

ConvNets Performance

StatQuest

Final Word

As you might expect, the number of layers and filters are hyperparameters that are optimized through the process of hyperparameter tuning.

Prologue

Summary

Hierarchy of Concepts in Deep Learning
Kernels and Convolution Operations
Receptive Field, Padding, and Stride
Filters and Feature Maps
Convolutional Layers
Pooling Layers

Summary

Hierarchy of Concepts in Deep Learning
- Deep learning models represent data through layers of increasing abstraction.
- Each layer learns patterns based on the outputs of preceding layers (“patterns of patterns”).
- This hierarchical learning reduces reliance on manual feature engineering.
- Deep networks achieve better performance with fewer parameters compared to shallow networks.
Convolutional Neural Networks (CNNs)
- CNNs specialize in processing data with a grid-like topology (e.g., images).
- They detect local patterns using convolutional layers with shared weights.
- Neurons in convolutional layers connect only within their receptive fields, not fully connected.
- This local connectivity and weight sharing significantly reduce the number of parameters.
Kernels and Convolution Operations
- Kernels (filters) are small matrices that slide over the input data to perform convolutions.
- The convolution operation involves element-wise multiplication and summation.
- Kernels can be designed to detect specific features like edges and textures.
- Feature maps are generated, highlighting where certain features are detected in the input.
Receptive Field, Padding, and Stride
- Receptive Field: The local region of the input that a neuron is sensitive to.
- Padding: Adding zeros around the input to maintain spatial dimensions after convolution.
- Stride: The step size with which the kernel moves over the input data.
- These parameters control the output size and computation in convolutional layers.
Filters and Feature Maps
- Filters are learned during training and are crucial for feature detection.
- All neurons in a feature map share the same filter parameters.
- This sharing leads to efficient parameter usage and consistent feature detection across the input.
Convolutional Layers
- Perform convolutions followed by adding a bias term and applying an activation function (e.g., ReLU).
- The activation function introduces non-linearity, allowing the network to learn complex patterns.
- The use of shared weights and biases reduces the total number of parameters.
Pooling Layers
- Purpose: Reduce the spatial dimensions of feature maps to control overfitting and computation.
- Types:
  - Max Pooling: Takes the maximum value within a pooling window.
  - Average Pooling: Computes the average value within a pooling window.
- Benefits:
  - Dimensionality reduction leads to fewer parameters and faster computation.
  - Introduces translation invariance, making the network robust to shifts and distortions.
  - Helps in hierarchical feature learning by focusing on prominent features.
Building CNN Architectures
- CNNs are built by stacking convolutional and pooling layers.
- Spatial dimensions decrease while the number of feature maps increases in deeper layers.
- Often culminates with fully connected layers for classification tasks.
- Example architectures can be implemented using frameworks like Keras.
Hyperparameter Tuning
- Key hyperparameters include the number of layers, filters, kernel sizes, strides, and padding.
- Proper tuning is essential for achieving optimal model performance.
- Overfitting can be controlled using techniques like dropout and regularization.
Future Directions
- Feature Attribution:
  - Techniques like saliency maps and activation maximization help interpret model decisions.
  - Essential for applications requiring explainability (e.g., self-driving cars).
- Transfer Learning:
  - Involves using pre-trained models on new tasks to save time and resources.
  - Particularly useful when labeled data is scarce.

Future Directions

When integrating CNNs into your projects, consider exploring the following topics:

Feature Attribution: Various techniques are available to visualize what the network has learned. For example, in the context of self-driving cars, it is crucial to ensure that the network focuses on relevant features, avoiding distractions.
Transfer Learning: This approach enables the reuse of weights from pre-trained networks, which accelerates the learning process, reduces computational demands, and facilitates network training even with a limited number of examples.

Resources

A guide to convolution arithmetic for deep learning
Authors: Vincent Dumoulin and Francesco Visin
Last revised: 11 Jan 2018
- arXiv:1603.07285
- GitHub Repository

Next lecture

We will introduce solution spaces.

References

Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media.

———. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Adaptive Computation and Machine Learning. MIT Press. https://dblp.org/rec/books/daglib/0040158.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C. J. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.

Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://doi.org/10.1109/5.726791.

Prince, Simon J. D. 2023. Understanding Deep Learning. The MIT Press. http://udlbook.com.

Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” In International Conference on Learning Representations.

Marcel Turcotte

[email protected]

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa

Neural Networks Architectures

Preamble

Quote of the Day

Learning objectives

Introduction

Hierarchy of concepts

Hierarchy of concepts

But also …

How many layers?

Observation

Convolutional Neural Network

Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN)

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel

Kernel Placements

Kernel

Kernel

Kernel

Blurring

Vertical Edge detection

Horizontal Edge detection

Convolutions in Image Processing

But what is a convolution?

Kernels

Receptive field

Receptive field

Receptive field

Padding

Padding

Stride

Stride

Filters

Filters

Model

Model

Convolutional Layer

Convolutional Layer

Summmary

Summmary

Summmary

Summmary

Pooling

Pooling

Pooling

Pooling

Pooling

Pooling

Pooling

Pooling

Keras

Keras

Convolutional Neural Networks

AlexNet

VGG

ConvNets Performance

StatQuest

Final Word

Prologue

Summary

Future Directions

Further Reading

Resources

Next lecture

References