The perceptron is the simplest algorithm for finding a linear classifier, and its usable when data is linearly separable.
At a high level, the algorithm looks something like:
- Initialize and
- Perform forward pass through training data. For each training point :
- Compute the functional margin via
- If โค 0, the point is misclassified and we perform an update: ,
- If > 0, the point is correctly classified. Nothing further is done
- We repeat until no mistakes are made on a full pass, or we never converge if that doesnโt happen.
Convergence Theorem
This theorem tells us how quickly the perceptron converges when the data is separable. Itโs stated in terms of two key quantities:
- , the radius of the data/norm of farthest point from origin
- , the margin or how easily separable the data is
Assuming there exists a unit vector with such that every training point satisfies for some , then the number of mistakes that the perceptron makes is . Uniform scaling doesnโt change this bound. Adding a single new point with a very large norm can increase without changing , which does increase the bound.
Multiclass
Going beyond binary labels , multiclass perceptrons deal with classes . Instead of just one weight vector, we have one weight vector per class and biases . Each class has a score for a point , calculated via .
To predict, we just pick the class with the highest score, formalized as:
When a point with true label is misclassified as , we boost the correct class via:
And penalize the wrong prediction:
- w_\hat{y} \leftarrow w_\hat{y} - x
- b_\hat{y} \leftarrow b_\hat{y} - 1
While keeping all the other weight vectors unchanged.