Linear models: logististic regression

CSI 4106 - Fall 2024

Marcel Turcotte

Version: Sep 29, 2024 13:24

Preamble

Quote of Day

Learning Objectives

Differentiate between binary classification and multi-class classification paradigms.
Describe a methodology for converting multi-class classification problems into binary classification tasks.
Articulate the concept of decision boundary and its significance in classification tasks.
Implement a logistic regression algorithm, focusing on its application in classification problems.

Classification tasks

Definitions

Binary classification is a supervised learning task where the objective is to categorize instances (examples) into one of two discrete classes.
A multi-class classification task is a type of supervised learning problem where the objective is to categorize instances into one of three or more discrete classes.

Binary classification

Some machine learning algorithms are specifically designed to solve binary classification problems.
- Logistic regression and support vector machines (SVMs) are such examples.

Multi-class classification

Any multi-class classification problem can be transformed into a binary classification problem.
One-vs-All (OvA)
- A separate binary classifier is trained for each class.
- For each classifier, one class is treated as the positive class, and all other classes are treated as the negative class.
- The final assignment of a class label is made based on the classifier that outputs the highest confidence score for a given input.

Discussion

To introduce the concept of decision boundaries, let’s reexamine the Iris dataset.

Loading the Iris dataset

from sklearn.datasets import load_iris

# Load the Iris dataset

iris = load_iris()

import pandas as pd

# Create a DataFrame

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

Pairwise Scatter Plots of Iris Features

import seaborn as sns
import matplotlib.pyplot as plt

# Using string labels to ease visualization

df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Display all pairwise scatter plots

sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle("Pairwise Scatter Plots of Iris Features", y=1.02)
plt.show()

Pairwise Scatter Plots of Iris Features

One-vs-All (OvA) on the Iris dataset

import numpy as np

# Transform the target variable into binary classification
# 'setosa' (class 0) vs. 'not setosa' (classes 1 and 2)

y_binary = np.where(iris.target == 0, 0, 1)

# Create a DataFrame for easier plotting with Seaborn

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_setosa'] = y_binary

print(y_binary)

One-vs-All (OvA) on the Iris dataset

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]

One-vs-All (OvA) on the Iris dataset

# Using string labels for visualization

df['is_setosa'] = df['is_setosa'].map({0: 'setosa', 1: 'not_setosa'})

# Pairwise scatter plots

sns.pairplot(df, hue='is_setosa', markers=["o", "s"])
plt.suptitle('Pairwise Scatter Plots of Iris Attributes \n(Setosa vs. Not Setosa)', y=1.02)
plt.show()

One-vs-All (OvA) on the Iris dataset

Setosa vs not setosa

Clearly, we have simplified the problem.
In the majority of the scatter plots, the setosa examples are clustered together.

Decision boundaries

Definition

A decision boundary is a boundary that partitions the underlying feature space into regions corresponding to different class labels.

Decision boundary

Consider two attributes, say petal length and sepal width, the decision boundary can be line.

Decision boundary

Consider two attributes, say petal length and sepal width, the decision boundary can be line.

Definition

We say that the data is linearly separable when two classes of data can be perfectly separated by a single linear boundary, such as a line in two-dimensional space or a hyperplane in higher dimensions.

Simple decision boundary

(a) training data, (b) quadratic curve, and (c) linear function.

Attribution: Geurts, P., Irrthum, A. & Wehenkel, L. Supervised learning with decision tree-based methods in computational and systems biology. Mol Biosyst 5 1593–1605 (2009).

The table on the left presents training data for a hypothetical binary classification task in a medical context, where the two attributes, \(X_1\) and \(X_2\), are used to predict the target variable, \(y\), which can take on two values: sick and healthy. You can imagine that \(X_1\) and \(X_2\) are measurements, such as blood pressure and heart rate or cholesterol and glucose levels.

Logistic regression (c) employs a linear decision boundary. In this specific example, the decision boundary is represented by a straight line. Employing logistic regression for this problem results in several classification errors: red dots above the line, which should be classified as ‘sick’, are incorrectly predicted as ‘healthy’. Conversely, green dots below the line, which should be classified as ‘healthy’, are incorrectly predicted as ‘sick’.

Complex decision boundary

Decision trees are capable of generating irregular and non-linear decision boundaries.

Attribution: ibidem.

Decision boundary

Digression

Decision boundary

2 attributes, the linear decision boundary would be a line.
2 attributes, the non-linear decision boundary would be a non-linear curve.
3 attributes, the linear decision boundary would be a plane.
3 attributes, the non-linear decision boundary would be a non-linear surface.
\(\gt\) 3 attributes, the linear decision boundary would be a hyperplane.
\(\gt\) 3 attributes, the non-linear decision boundary would be a hypersurface.

Definition (revised)

A decision boundary is a hypersurface that partitions the underlying feature space into regions corresponding to different class labels.

Logistic regression

Logistic (Logit) Regression

Despite its name, logistic regression serves as a classification algorithm rather than a regression technique.
The labels in logistic regression are binary values, denoted as \(y_i \in \{0,1\}\), making it a binary classification task.
The primary objective of logistic regression is to determine the probability that a given instance \(x_i\) belongs to the positive class, i.e., \(y_i = 1\).

Logistic regression

Consider one attributs, say petal length, and the value of the target (0,1).

Fitting a linear regression

\(\ldots\) is not the answer

The resulting line extends infinitely in both directions, but our goal is to constrain values between 0 and 1. Here, 1 indicates a high probability that \(x_i\) belongs to the positive class, while a value near 0 indicates a low probability.

Logistic function

In mathematics, the standard logistic function maps a real-valued input from \(\mathbb{R}\) to the open interval \((0,1)\). The function is defined as:

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Logistic regression (intuition)

When the distance to the decision boundary is zero, uncertainty is high, making a probability of 0.5 appropriate.
As we move away from the decision boundary, confidence increases, warranting higher or lower probabilities accordingly.

Logistic function

An S-shaped curve, such as the standard logistic function (aka sigmoid), is termed a squashing function because it maps a wide input domain to a constrained output range.

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Logistic (Logit) Regression

Analogous to linear regression, logistic regression computes a weighted sum of the input features, expressed as: \[ \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)} \]
However, using the sigmoid function limits its output to the range \((0,1)\): \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

Logistic regression

The Logistic Regression model, in its vectorized form, is defined as:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]

Logistic regression (two attributes)

\[ h_\theta(x_i) = \sigma(\theta x_i) \]

In logistic regression, the probability of correctly classifying an example increases as its distance from the decision boundary increases.
This principle holds for both positive and negative classes.
An example lying on the decision boundary has a 50% probability of belonging to either class.

Logistic regression

The Logistic Regression model, in its vectorized form, is defined as:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]
Predictions are made as follows:
- \(y_i = 0\), if \(h_\theta(x_i) < 0.5\)
- \(y_i = 1\), if \(h_\theta(x_i) \geq 0.5\)

The values of \(\theta\) are learned using gradient descent.

One-vs-All

One-vs-All classifier (complete)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size=0.2, random_state=42)

One-vs-All classifier (complete)

# Train a One-vs-All classifier for each class

classifiers = []
for i in range(3):
    clf = LogisticRegression()
    clf.fit(X_train, y_train[:, i])
    classifiers.append(clf)

One-vs-All classifier (complete)

# Predict on a new sample
new_sample = X_test[0].reshape(1, -1)
confidences = [clf.decision_function(new_sample) for clf in classifiers]

# Final assignment
final_class = np.argmax(confidences)

# Printing the result
print(f"Final class assigned: {iris.target_names[final_class]}")
print(f"True class: {iris.target_names[np.argmax(y_test[0])]}")

Final class assigned: versicolor
True class: versicolor

`label_binarized`

from sklearn.preprocessing import label_binarize

# Original class labels
y_train = np.array([0, 1, 2, 0, 1, 2, 1, 0])

# Binarize the labels
y_train_binarized = label_binarize(y_train, classes=[0, 1, 2])

# Assume y_train_binarized contains the binarized labels
print("Binarized labels:\n", y_train_binarized)

# Convert binarized labels back to the original numerical values
original_labels = [np.argmax(b) for b in y_train_binarized]
print("Original labels:\n", original_labels)

Binarized labels:
 [[1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]
Original labels:
 [0, 1, 2, 0, 1, 2, 1, 0]

Digits example

UCI ML hand-written digits datasets

Loading the dataset

from sklearn.datasets import load_digits

digits = load_digits()

What is the type of digits.data

type(digits.data)

numpy.ndarray

UCI ML hand-written digits datasets

How many examples (N) and how many attributes (D)?

digits.data.shape

(1797, 64)

Assigning N and D

N, D = digits.data.shape

target has the same number of entries (examples) as data?

digits.target.shape

(1797,)

UCI ML hand-written digits datasets

What are the width and height of those images?

digits.images.shape

(1797, 8, 8)

Assigning width and height

_, width, height = digits.images.shape

UCI ML hand-written digits datasets

Assigning X and y

X = digits.data
y = digits.target

UCI ML hand-written digits datasets

X[0] is a vector of size width * height = D (\(8 \times 8 = 64\)).

X[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

It corresponds to an \(8 \times 8 = 64\) image.

X[0].reshape(width, height)

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

UCI ML hand-written digits datasets

Plot the first n=5 examples

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10,2))
n = 5

for index, (image, label) in enumerate(zip(X[0:n], y[0:n])):
    plt.subplot(1, n, index + 1)
    plt.imshow(np.reshape(image, (width,width)), cmap=plt.cm.gray)
    plt.title(f'y = {label}')

UCI ML hand-written digits datasets

In our dataset, each \(x_i\) is an attribute vector of size \(D = 64\).
This vector is formed by concatenating the rows of an \(8 \times 8\) image.
The reshape function is employed to convert this 64-dimensional vector back into its original \(8 \times 8\) image format.

UCI ML hand-written digits datasets

We will train 10 classifiers, each corresponding to a specific digit in a One-vs-All (OvA) approach.
Each classifier will determine the optimal values of \(\theta_j\) (associated with the pixel features), allowing it to distinguish one digit from all other digits.

UCI ML hand-written digits datasets

Preparing for our machine learning experiment

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

UCI ML hand-written digits datasets

Optimization algorithms generally work best when the attributes have similar ranges.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

UCI ML hand-written digits datasets

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='ovr')
clf = clf.fit(X_train, y_train)

UCI ML hand-written digits datasets

Applying the classifier to our test set

from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.95      0.95      0.95        19
           2       1.00      1.00      1.00        18
           3       0.95      0.88      0.91        24
           4       1.00      0.94      0.97        17
           5       0.95      0.95      0.95        21
           6       0.93      1.00      0.97        14
           7       1.00      1.00      1.00        15
           8       0.94      1.00      0.97        16
           9       0.94      1.00      0.97        15

    accuracy                           0.97       180
   macro avg       0.97      0.97      0.97       180
weighted avg       0.97      0.97      0.97       180

Visualization

How many classes?

clf.classes_

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The coefficients and intercepts are in distinct arrays.

(clf.coef_.shape, clf.intercept_.shape)

((10, 64), (10,))

Intercepts are \(\theta_0\), where as coefficents are \(\theta_j, j \in [1,64]\).

Visualization

clf.coef_[0].round(2).reshape(width, height)

array([[ 0.  , -0.14, -0.02,  0.29, -0.02, -0.73, -0.47, -0.05],
       [-0.  , -0.26, -0.08,  0.48,  0.54,  0.89,  0.03, -0.18],
       [-0.01,  0.31,  0.37, -0.14, -0.96,  0.84,  0.02, -0.14],
       [-0.05,  0.22,  0.09, -0.6 , -1.74,  0.08,  0.16, -0.05],
       [ 0.  ,  0.4 ,  0.5 , -0.65, -1.67,  0.  ,  0.13,  0.  ],
       [-0.13, -0.1 ,  0.85, -0.98, -0.74, -0.  ,  0.25,  0.02],
       [-0.08, -0.27,  0.45,  0.12,  0.28, -0.  , -0.42, -0.51],
       [ 0.02, -0.26, -0.43,  0.51, -0.58, -0.1 , -0.32, -0.28]])

Visualization

coef = clf.coef_
plt.imshow(coef[0].reshape(width,height))

Visualization

plt.figure(figsize=(10,5))

for index in range(len(clf.classes_)):
    plt.subplot(2, 5, index + 1)
    plt.title(f'y = {clf.classes_[index]}')
    plt.imshow(clf.coef_[index].reshape(width,width), 
               cmap=plt.cm.RdBu,
               interpolation='bilinear')

Prologue

References

Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Resources

Logistic Regression 3-class Classifier from sklearn
Plot the decision surface of decision trees trained on the iris dataset from sklearn
Decision trees by Jan Kirenz, a Professor at HdM Stuttgart
CS 320 Apr12-2021 (Part 2) - Decision Boundaries by Tyler Caraza-Harter, an Instructor at UW-Madison

Next lecture

Cross evaluation and performance measures

3D plot with points below and above the plane

from mpl_toolkits.mplot3d import Axes3D

# Function to generate points
def generate_points_above_below_plane(num_points=100):
    # Define the plane z = ax + by + c
    a, b, c = 1, 1, 0  # Plane coefficients

    # Generate random points
    x1 = np.random.uniform(-10, 10, num_points)
    x2 = np.random.uniform(-10, 10, num_points)

    y1 = np.random.uniform(-10, 10, num_points)
    y2 = np.random.uniform(-10, 10, num_points)

    # Points above the plane
    z_above = a * x1 + b * y1 + c + np.random.normal(20, 2, num_points)

    # Points below the plane
    z_below = a * x2 + b * y2 + c - np.random.normal(20, 2, num_points)

    # Stack the points into arrays
    points_above = np.vstack((x1, y1, z_above)).T
    points_below = np.vstack((x2, y2, z_below)).T

    return points_above, points_below

# Generate points
points_above, points_below = generate_points_above_below_plane()

# Visualization
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot points above the plane
ax.scatter(points_above[:, 0], points_above[:, 1], points_above[:, 2], c='r', label='Above the plane')

# Plot points below the plane
ax.scatter(points_below[:, 0], points_below[:, 1], points_below[:, 2], c='b', label='Below the plane')

# Plot the plane itself for reference
xx, yy = np.meshgrid(range(-10, 11), range(-10, 11))
zz = 1 * xx + 1 * yy + 0
ax.plot_surface(xx, yy, zz, alpha=0.2, color='gray')

# Set labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

# Set title and legend
ax.set_title('3D Points Above and Below a Plane')
ax.legend()

# Show plot
plt.show()

Marcel Turcotte

[email protected]

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa