CSI 4106 - Fall 2024
Version: Nov 7, 2024 17:56
“An MLP with just one hidden layer can theoretically model even the most complex functions, provided it has enough neurons. But for complex problems, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data.”
Consider a feed-forward network (FFN) and its model:
\[ h_{W,b}(X) = \phi_k(\ldots \phi_2(\phi_1(X)) \ldots) \]
where
\[ \phi_l(Z) = \sigma(W_l Z + b_l) \]
for \(l=1 \ldots k\). - The number of parameters in grows rapidly:
\[ (\text{size of layer}_{l-1} + 1) \times \text{size of layer}_{l} \]
Two layers 1,000-unit implies 1,000,000 parameters!
Crucial pattern information is often local.
Convolutional layers reduce parameters significantly.
Unlike dense layers, neurons in a convolutional layer are not fully connected to the preceding layer.
Neurons connect only within their receptive fields (rectangular regions).
In contrast to image processing, where kernels are manually defined by the user, in convolutional networks, the kernels are automatically learned by the network.
Zero padding. In order to have layers of the same size, the grid can be padded with zeros.
Stride. It is possible to connect a larger layer \((l-1)\) to a smaller one \((l)\) by skipping units. The number of units skipped is called stride, \(s_h\) and \(s_w\).
A window of size \(f_h \times f_w\) is moved over the output of layers \(l-1\), referred to as the input feature map, position by position.
For each location, the product is calculated between the extracted patch and a matrix of the same size, known as a convolution kernel or filter. The sum of the values in the resulting matrix constitutes the output for that location.
\[ z_{i,j,k} = b_k + \sum_{u=0}^{f_h-1} \sum_{v=0}^{f_w-1} \sum_{k'=0}^{f_{n'}-1} x_{i',j',k'} \cdot w_{u,v,k',k} \]
where \(i' = i \times s_h + u\) and \(j' = j \times s_w + v\).
“Thus, a layer full of neurons using the same filter outputs a feature map.”
“Of course, you do not have to define the filters manually: instead, during training the convolutional layer will automatically learn the most useful filters for its task.”
“(…) and the layers above will learn to combine them into more complex patterns.”
“The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model.”
A pooling layer exhibits similarities to a convolutional layer.
However, unlike convolutional layers, pooling layers do not possess weights.
This subsampling process leads to a reduction in network size; each window of dimensions \(f_h \times f_w\) is condensed to a single value, typically the maximum or mean of that window.
According to Géron (2019), a max pooling layer provides a degree of invariance to small translations (§ 14).
import tensorflow as tf
from functools import partial
DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal")
model = tf.keras.Sequential([
DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
tf.keras.layers.MaxPool2D(),
DefaultConv2D(filters=128),
DefaultConv2D(filters=128),
tf.keras.layers.MaxPool2D(),
DefaultConv2D(filters=256),
DefaultConv2D(filters=256),
tf.keras.layers.MaxPool2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(units=10, activation="softmax") ])
model.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d_5 (Conv2D) │ (None, 28, 28, 64) │ 3,200 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_3 (MaxPooling2D) │ (None, 14, 14, 64) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_6 (Conv2D) │ (None, 14, 14, 128) │ 73,856 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_7 (Conv2D) │ (None, 14, 14, 128) │ 147,584 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_4 (MaxPooling2D) │ (None, 7, 7, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_8 (Conv2D) │ (None, 7, 7, 256) │ 295,168 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_9 (Conv2D) │ (None, 7, 7, 256) │ 590,080 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_5 (MaxPooling2D) │ (None, 3, 3, 256) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten_1 (Flatten) │ (None, 2304) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 128) │ 295,040 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 64) │ 8,256 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_3 (Dropout) │ (None, 64) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 10) │ 650 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,413,834 (5.39 MB)
Trainable params: 1,413,834 (5.39 MB)
Non-trainable params: 0 (0.00 B)
Krizhevsky, Sutskever, and Hinton (2012)
Simonyan and Zisserman (2015)
As you might expect, the number of layers and filters are hyperparameters that are optimized through the process of hyperparameter tuning.
When integrating CNNs into your projects, consider exploring the following topics:
Feature Attribution: Various techniques are available to visualize what the network has learned. For example, in the context of self-driving cars, it is crucial to ensure that the network focuses on relevant features, avoiding distractions.
Transfer Learning: This approach enables the reuse of weights from pre-trained networks, which accelerates the learning process, reduces computational demands, and facilitates network training even with a limited number of examples.
Understanding Deep Learning (Prince 2023) is a recently published textbook focused on the foundational concepts of deep learning.
It begins with fundamental principles and extends to contemporary topics such as transformers, diffusion models, graph neural networks, autoencoders, adversarial networks, and reinforcement learning.
The textbook aims to help readers comprehend these concepts without delving excessively into theoretical details.
It includes sixty-eight Python notebook exercises.
The book follows a “read-first, pay-later” model.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa