CSI 4106 - Fall 2024
Version: Sep 29, 2024 13:24
Binary classification is a supervised learning task where the objective is to categorize instances (examples) into one of two discrete classes.
A multi-class classification task is a type of supervised learning problem where the objective is to categorize instances into one of three or more discrete classes.
To introduce the concept of decision boundaries, let’s reexamine the Iris dataset.
import seaborn as sns
import matplotlib.pyplot as plt
# Using string labels to ease visualization
df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
# Display all pairwise scatter plots
sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle("Pairwise Scatter Plots of Iris Features", y=1.02)
plt.show()
import numpy as np
# Transform the target variable into binary classification
# 'setosa' (class 0) vs. 'not setosa' (classes 1 and 2)
y_binary = np.where(iris.target == 0, 0, 1)
# Create a DataFrame for easier plotting with Seaborn
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_setosa'] = y_binary
print(y_binary)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1]
A decision boundary is a boundary that partitions the underlying feature space into regions corresponding to different class labels.
Consider two attributes, say petal length and sepal width, the decision boundary can be line.
Consider two attributes, say petal length and sepal width, the decision boundary can be line.
We say that the data is linearly separable when two classes of data can be perfectly separated by a single linear boundary, such as a line in two-dimensional space or a hyperplane in higher dimensions.
(a) training data, (b) quadratic curve, and (c) linear function.
Attribution: Geurts, P., Irrthum, A. & Wehenkel, L. Supervised learning with decision tree-based methods in computational and systems biology. Mol Biosyst 5 1593–1605 (2009).
Decision trees are capable of generating irregular and non-linear decision boundaries.
Attribution: ibidem.
A decision boundary is a hypersurface that partitions the underlying feature space into regions corresponding to different class labels.
Despite its name, logistic regression serves as a classification algorithm rather than a regression technique.
The labels in logistic regression are binary values, denoted as \(y_i \in \{0,1\}\), making it a binary classification task.
The primary objective of logistic regression is to determine the probability that a given instance \(x_i\) belongs to the positive class, i.e., \(y_i = 1\).
Consider one attributs, say petal length, and the value of the target (0,1).
The resulting line extends infinitely in both directions, but our goal is to constrain values between 0 and 1. Here, 1 indicates a high probability that \(x_i\) belongs to the positive class, while a value near 0 indicates a low probability.
In mathematics, the standard logistic function maps a real-valued input from \(\mathbb{R}\) to the open interval \((0,1)\). The function is defined as:
\[ \sigma(t) = \frac{1}{1+e^{-t}} \]
An S-shaped curve, such as the standard logistic function (aka sigmoid), is termed a squashing function because it maps a wide input domain to a constrained output range.
\[ \sigma(t) = \frac{1}{1+e^{-t}} \]
Analogous to linear regression, logistic regression computes a weighted sum of the input features, expressed as: \[ \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)} \]
However, using the sigmoid function limits its output to the range \((0,1)\): \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]
The Logistic Regression model, in its vectorized form, is defined as:
\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]
\[ h_\theta(x_i) = \sigma(\theta x_i) \]
The Logistic Regression model, in its vectorized form, is defined as:
\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]
Predictions are made as follows:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size=0.2, random_state=42)
# Predict on a new sample
new_sample = X_test[0].reshape(1, -1)
confidences = [clf.decision_function(new_sample) for clf in classifiers]
# Final assignment
final_class = np.argmax(confidences)
# Printing the result
print(f"Final class assigned: {iris.target_names[final_class]}")
print(f"True class: {iris.target_names[np.argmax(y_test[0])]}")
Final class assigned: versicolor
True class: versicolor
label_binarized
from sklearn.preprocessing import label_binarize
# Original class labels
y_train = np.array([0, 1, 2, 0, 1, 2, 1, 0])
# Binarize the labels
y_train_binarized = label_binarize(y_train, classes=[0, 1, 2])
# Assume y_train_binarized contains the binarized labels
print("Binarized labels:\n", y_train_binarized)
# Convert binarized labels back to the original numerical values
original_labels = [np.argmax(b) for b in y_train_binarized]
print("Original labels:\n", original_labels)
Binarized labels:
[[1 0 0]
[0 1 0]
[0 0 1]
[1 0 0]
[0 1 0]
[0 0 1]
[0 1 0]
[1 0 0]]
Original labels:
[0, 1, 2, 0, 1, 2, 1, 0]
Loading the dataset
What is the type of digits.data
How many examples (N
) and how many attributes (D
)?
Assigning N
and D
target
has the same number of entries (examples) as data
?
What are the width and height of those images?
Assigning width
and height
Assigning X
and y
X[0]
is a vector of size width * height = D
(\(8 \times 8 = 64\)).
array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
It corresponds to an \(8 \times 8 = 64\) image.
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
Plot the first n=5
examples
In our dataset, each \(x_i\) is an attribute vector of size \(D = 64\).
This vector is formed by concatenating the rows of an \(8 \times 8\) image.
The reshape
function is employed to convert this 64-dimensional vector back into its original \(8 \times 8\) image format.
We will train 10 classifiers, each corresponding to a specific digit in a One-vs-All (OvA) approach.
Each classifier will determine the optimal values of \(\theta_j\) (associated with the pixel features), allowing it to distinguish one digit from all other digits.
Preparing for our machine learning experiment
Optimization algorithms generally work best when the attributes have similar ranges.
Applying the classifier to our test set
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 21
1 0.95 0.95 0.95 19
2 1.00 1.00 1.00 18
3 0.95 0.88 0.91 24
4 1.00 0.94 0.97 17
5 0.95 0.95 0.95 21
6 0.93 1.00 0.97 14
7 1.00 1.00 1.00 15
8 0.94 1.00 0.97 16
9 0.94 1.00 0.97 15
accuracy 0.97 180
macro avg 0.97 0.97 0.97 180
weighted avg 0.97 0.97 0.97 180
How many classes?
The coefficients and intercepts are in distinct arrays.
Intercepts are \(\theta_0\), where as coefficents are \(\theta_j, j \in [1,64]\).
array([[ 0. , -0.14, -0.02, 0.29, -0.02, -0.73, -0.47, -0.05],
[-0. , -0.26, -0.08, 0.48, 0.54, 0.89, 0.03, -0.18],
[-0.01, 0.31, 0.37, -0.14, -0.96, 0.84, 0.02, -0.14],
[-0.05, 0.22, 0.09, -0.6 , -1.74, 0.08, 0.16, -0.05],
[ 0. , 0.4 , 0.5 , -0.65, -1.67, 0. , 0.13, 0. ],
[-0.13, -0.1 , 0.85, -0.98, -0.74, -0. , 0.25, 0.02],
[-0.08, -0.27, 0.45, 0.12, 0.28, -0. , -0.42, -0.51],
[ 0.02, -0.26, -0.43, 0.51, -0.58, -0.1 , -0.32, -0.28]])
sklearn
sklearn
from mpl_toolkits.mplot3d import Axes3D
# Function to generate points
def generate_points_above_below_plane(num_points=100):
# Define the plane z = ax + by + c
a, b, c = 1, 1, 0 # Plane coefficients
# Generate random points
x1 = np.random.uniform(-10, 10, num_points)
x2 = np.random.uniform(-10, 10, num_points)
y1 = np.random.uniform(-10, 10, num_points)
y2 = np.random.uniform(-10, 10, num_points)
# Points above the plane
z_above = a * x1 + b * y1 + c + np.random.normal(20, 2, num_points)
# Points below the plane
z_below = a * x2 + b * y2 + c - np.random.normal(20, 2, num_points)
# Stack the points into arrays
points_above = np.vstack((x1, y1, z_above)).T
points_below = np.vstack((x2, y2, z_below)).T
return points_above, points_below
# Generate points
points_above, points_below = generate_points_above_below_plane()
# Visualization
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot points above the plane
ax.scatter(points_above[:, 0], points_above[:, 1], points_above[:, 2], c='r', label='Above the plane')
# Plot points below the plane
ax.scatter(points_below[:, 0], points_below[:, 1], points_below[:, 2], c='b', label='Below the plane')
# Plot the plane itself for reference
xx, yy = np.meshgrid(range(-10, 11), range(-10, 11))
zz = 1 * xx + 1 * yy + 0
ax.plot_surface(xx, yy, zz, alpha=0.2, color='gray')
# Set labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Set title and legend
ax.set_title('3D Points Above and Below a Plane')
ax.legend()
# Show plot
plt.show()
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa