The perceptron is the simplest algorithm for finding a linear classifier, and its usable when data is linearly separable.

At a high level, the algorithm looks something like:

Initialize $w = 0$ and $b = 0$
Perform forward pass through training data. For each training point $(x^{i}, y^{i})$ :
- Compute the functional margin via $y^{i} (w * x^{i} + b)$
- If ≤ 0, the point is misclassified and we perform an update: $w \leftarrow w + y^{i} x^{i}$ , $b \leftarrow b + y^{i}$
- If > 0, the point is correctly classified. Nothing further is done
We repeat until no mistakes are made on a full pass, or we never converge if that doesn’t happen.

Convergence Theorem

This theorem tells us how quickly the perceptron converges when the data is separable. It’s stated in terms of two key quantities:

$R = max_{i} ∣∣ x^{i} ∣∣$ , the radius of the data/norm of farthest point from origin
$γ$ , the margin or how easily separable the data is

Assuming there exists a unit vector $w^{*}$ with $∣∣ w^{*} ∣∣ = 1$ such that every training point satisfies $y^{i} (w * x^{i} + b) \geq γ$ for some $γ > 0$ , then the number of mistakes that the perceptron makes is $\leq \frac{R ^{2}}{γ ^{2}}$ . Uniform scaling doesn’t change this bound. Adding a single new point with a very large norm can increase $R$ without changing $γ$ , which does increase the bound.

Multiclass

Going beyond binary labels ${- 1, + 1}$ , multiclass perceptrons deal with $k$ classes ${1, 2, ..., k}$ . Instead of just one weight vector, we have one weight vector per class $w_{1}, ... w_{k}$ and biases $b_{1}, ..., b_{k}$ . Each class $j$ has a score for a point $x$ , calculated via $score_{j} (x) = w_{j} * x + b_{j}$ .

To predict, we just pick the class with the highest score, formalized as:

$\overset{y}{^} = argmax_{j} (w_{j} * x + b_{j})$

When a point $(x, y)$ with true label $y$ is misclassified as $\overset{y}{^}$ , we boost the correct class via:

$w_{y} \leftarrow w_{y} + x$
$b_{y} \leftarrow b_{y} + 1$

And penalize the wrong prediction:

$w_\hat{y} \leftarrow w_\hat{y} - x$
$b_\hat{y} \leftarrow b_\hat{y} - 1$

While keeping all the other $k - 2$ weight vectors unchanged.

Vishruth's Notes

Explorer

👓 Perceptron

Convergence Theorem

Multiclass

Graph View

Table of Contents

Backlinks