The perceptron is the simplest algorithm for finding a linear classifier, and its usable when data is linearly separable.

At a high level, the algorithm looks something like:

  1. Initialize and
  2. Perform forward pass through training data. For each training point :
    • Compute the functional margin via
    • If โ‰ค 0, the point is misclassified and we perform an update: ,
    • If > 0, the point is correctly classified. Nothing further is done
  3. We repeat until no mistakes are made on a full pass, or we never converge if that doesnโ€™t happen.

Convergence Theorem

This theorem tells us how quickly the perceptron converges when the data is separable. Itโ€™s stated in terms of two key quantities:

  • , the radius of the data/norm of farthest point from origin
  • , the margin or how easily separable the data is

Assuming there exists a unit vector with such that every training point satisfies for some , then the number of mistakes that the perceptron makes is . Uniform scaling doesnโ€™t change this bound. Adding a single new point with a very large norm can increase without changing , which does increase the bound.

Multiclass

Going beyond binary labels , multiclass perceptrons deal with classes . Instead of just one weight vector, we have one weight vector per class and biases . Each class has a score for a point , calculated via .

To predict, we just pick the class with the highest score, formalized as:

When a point with true label is misclassified as , we boost the correct class via:

And penalize the wrong prediction:

  • w_\hat{y} \leftarrow w_\hat{y} - x
  • b_\hat{y} \leftarrow b_\hat{y} - 1

While keeping all the other weight vectors unchanged.