CSI 4106 - Fall 2024
Version: Sep 25, 2024 19:09
In this lecture, we will cover the foundational concepts of linear regression, and gradient descent.
You will gain a deeper understanding of these essential machine learning techniques, enabling you to apply them effectively in your work.
Linear regression is introduced to conveniently present a well-known training algorithm, gradient descent. Additionally, it serves as a foundation for introducing logistic regression–a classification algorithm—which further facilitates discussions on artificial neural networks.
Can you think of examples of regression tasks?
Focusing on applications possibly running on a mobile device.
Problem: find values for all the model parameters so that the model “best fits” the training data.
\[ \sqrt{\frac{1}{N}\sum_1^N [h(x_i) - y_i]^2} \]
User: Explain the code on the previous slide line by line.
AI: Certainly! Here is a line-by-line explanation of the provided code:
X
containing 100 rows and 1 column of random numbers. The np.random.rand(100, 1)
function generates uniformly distributed random numbers between 0 and 1. Multiplying by 6 scales these numbers to the range [0, 6), and subtracting 4 shifts the range to [-4, 2).y
based on a quadratic equation with some added noise. Specifically:
X ** 2
computes the square of each element in X
.- 4 * X
subtracts four times each element in X
.+ 5
adds a constant term of 5.+ np.random.randn(100, 1)
adds Gaussian noise with a mean of 0 and standard deviation of 1 to each element, making the data more realistic by introducing some randomness.LinearRegression
class from the sklearn.linear_model
module, which is part of the Scikit-Learn library. Scikit-Learn is widely used for machine learning in Python.LinearRegression
class and assigns it to the variable lin_reg
. This object will be used to fit the linear regression model to the data.(X, y)
. The fit
method trains the model using the provided features X
and target variable y
.X_new
with two values: -4 and 2. These values represent new data points for which we want to predict the target variable using the trained model.y_pred
for the new data points in X_new
. The predict
method returns the predicted values based on the fitted model.In summary, this script generates synthetic data, fits a linear regression model to it, and then uses the model to make predictions on new data points.
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
def save_fig(fig_id, tight_layout=True, fig_extension="pdf", resolution=300):
path = os.path.join(fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
A typical learning algorithm comprises the following components:
Until some termination criteria is met\(^1\):
1: E.g. the value of the loss function no longer decreases or the maximum number of iterations.
The graph of the derivative, \(f^{'}(t)\), is depicted in red.
The derivative indicates how changes in the input affect the output, \(f(t)\).
The magnitude of the derivative at \(t = -2\) is \(0\).
This point corresponds to the minimum of our function.
A positive derivative indicates that increasing the input variable will increase the output value.
Additionally, the magnitude of the derivative quantifies how rapidly the output changes.
A negative derivative indicates that increasing the input variable will decrease the output value.
Additionally, the magnitude of the derivative quantifies how rapidly the output changes.
import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
# Define the variable and function
t = sp.symbols('t')
f = t**2 + 4*t + 7
# Compute the derivative
f_prime = sp.diff(f, t)
# Lambdify the functions for numerical plotting
f_func = sp.lambdify(t, f, "numpy")
f_prime_func = sp.lambdify(t, f_prime, "numpy")
# Generate t values for plotting
t_vals = np.linspace(-5, 2, 400)
# Get y values for the function and its derivative
f_vals = f_func(t_vals)
f_prime_vals = f_prime_func(t_vals)
# Plot the function and its derivative
plt.plot(t_vals, f_vals, label=r'$f(t) = t^2 + 4t + 7$', color='blue')
plt.plot(t_vals, f_prime_vals, label=r"$f'(t) = 2t + 4$", color='red')
# Fill the area below the derivative where it's negative
plt.fill_between(t_vals, f_prime_vals, where=(f_prime_vals > 0), color='red', alpha=0.3)
# Add labels and legend
plt.axhline(0, color='black',linewidth=1)
plt.axvline(0, color='black',linewidth=1)
plt.title('Function and Derivative')
plt.xlabel('t')
plt.ylabel('y')
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
Given
\[ J(\theta_0, \theta_1) = \frac{1}{N}\sum_1^N [h(x_i) - y_i]^2 = \frac{1}{N}\sum_1^N [\theta_0 + \theta_1 x_i - y_i]^2 \]
We have
\[ \frac {\partial}{\partial \theta_0}J(\theta_0, \theta_1) = \frac{2}{N} \sum\limits_{i=1}^{N} (\theta_0 - \theta_1 x_i - y_{i}) \]
and
\[ \frac {\partial}{\partial \theta_1}J(\theta_0, \theta_1) = \frac{2}{N} \sum\limits_{i=1}^{N} x_{i} \left(\theta_0 + \theta_1 x_i - y_{i}\right) \]
from IPython.display import Math, display
from sympy import *
# Define the symbols
theta_0, theta_1, x_i, y_i = symbols('theta_0 theta_1 x_i y_i')
# Define the hypothesis function:
h = theta_0 + theta_1 * x_i
print("Hypothesis function:")
display(Math('h(x) = ' + latex(h)))
Hypothesis function:
\(\displaystyle h(x) = \theta_{0} + \theta_{1} x_{i}\)
# Calculate the partial derivative with respect to theta_0
partial_derivative_theta_0 = diff(J, theta_0)
print("Partial derivative with respect to theta_0:")
display(Math(latex(partial_derivative_theta_0)))
Partial derivative with respect to theta_0:
\(\displaystyle \frac{\sum_{x_{i}=1}^{N} \left(2 \theta_{0} + 2 \theta_{1} x_{i} - 2 y_{i}\right)}{N}\)
# Calculate the partial derivative with respect to theta_1
partial_derivative_theta_1 = diff(J, theta_1)
print("\nPartial derivative with respect to theta_1:")
display(Math(latex(partial_derivative_theta_1)))
Partial derivative with respect to theta_1:
\(\displaystyle \frac{\sum_{x_{i}=1}^{N} 2 x_{i} \left(\theta_{0} + \theta_{1} x_{i} - y_{i}\right)}{N}\)
\[ h (x_i) = \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \theta_3 x_i^{(3)} + \cdots + \theta_D x_i^{(D)} \]
\[ \begin{align*} x_i^{(j)} &= \text{value of the feature } j \text{ in the } i \text{th example} \\ D &= \text{the number of features} \end{align*} \]
The new loss function is
\[ J(\theta_0, \theta_1,\ldots,\theta_D) = \dfrac {1}{N} \displaystyle \sum _{i=1}^N \left (h(x_{i}) - y_i \right)^2 \]
Its partial derivative:
\[ \frac {\partial}{\partial \theta_j}J(\theta) = \frac{2}{N} \sum\limits_{i=1}^N x_i^{(j)} \left( \theta x_i - y_i \right) \]
where \(\theta\), \(x_i\) and \(y_i\) are vectors, and \(\theta x_i\) is a vector operation!
The vector containing the partial derivative of \(J\) (with respect to \(\theta_j\), for \(j \in \{0, 1\ldots D\}\)) is called the gradient vector.
\[ \nabla_\theta J(\theta) = \begin{pmatrix} \frac {\partial}{\partial \theta_0}J(\theta) \\ \frac {\partial}{\partial \theta_1}J(\theta) \\ \vdots \\ \frac {\partial}{\partial \theta_D}J(\theta)\\ \end{pmatrix} \]
\[ \theta' = \theta - \alpha \nabla_\theta J(\theta) \]
The gradient descent algorithm becomes:
Repeat until convergence:
\[ \begin{aligned} \{ & \\ \theta_j := & \theta_j - \alpha \frac {\partial}{\partial \theta_j}J(\theta_0, \theta_1, \ldots, \theta_D) \\ &\text{for } j \in [0, \ldots, D] \textbf{ (update simultaneously)} \\ \} & \end{aligned} \]
Repeat until convergence:
\[ \begin{aligned} \; \{ & \\ \; & \theta_0 := \theta_0 - \alpha \frac{2}{N} \sum\limits_{i=1}^{N} x^{0}_i(h(x_i) - y_i) \\ \; & \theta_1 := \theta_1 - \alpha \frac{2}{N} \sum\limits_{i=1}^{N} x^{1}_i(h(x_i) - y_i) \\ \; & \theta_2 := \theta_2 - \alpha \frac{2}{N} \sum\limits_{i=1}^{N} x^{2}_i(h(x_i) - y_i) \\ & \cdots \\ \} & \end{aligned} \]
What were our assumptions?
A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph.
For functions that are not convex, the gradient descent algorithm converges to a local minimum.
The loss function generally used with linear or logistic regressions, and Support Vector Machines (SVM) are convex, but not the ones for artificial neural networks.
The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.
epochs = 10
for epoch in range(epochs):
for i in range(N):
selection = np.random.randint(N)
# Calculate the gradient using selection
# Update the weights
Batch gradient descent is inherently slow and impractical for large datasets requiring out-of-core support, though it is capable of handling a substantial number of features.
Stochastic gradient descent is fast and well-suited for processing a large volume of examples efficiently.
Mini-batch gradient descent combines the benefits of both batch and stochastic methods; it is fast, capable of managing large datasets, and leverages hardware acceleration, particularly with GPUs.
We will briefly revisit the subject when discussing deep artificial neural networks, for which specialized optimization algorithms exist.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa