If the data is linearly separable, there are infinitely many valid separators. The issue is that a ๐ Perceptron just stumbles onto whichever one its update path leads to, which isnโt optimal.
Support Vectors
The training points that sit right on the margin boundary, or the points where exactly. These are the points important for defining the boundary. The solution takes the form:
where only for support vectors and for all other points (they contribute nothing).
Hard-margin SVM
Among all possible separating hyperplanes, hard-margin SVM picks the one with maximum margin, or the one farthest from the nearest training point on either side.
For hyperplane , the distance from any point to this hyperplane is . For any separating hyperplane weโre able to rescale and such that the closest points satisfy exactly (support vectors). Under this scaling, the margin becomes . Thus, maximizing the margin means maximizing or minimizing (or equivalently, ).
Formally put, the optimization problem becomes a matter of subject to for all , or in other words, every single training point is on the correct side of the boundary with a margin of at least 1.
Soft-margin SVM
Unfortunately, hard-margin SVM works only on linearly separable data, and in most cases, data is almost never perfectly separable. Be it noise, outliers, overlapping classes, etc. Soft-margin SVM allows us to define some tolerance for mistakes via a slack variable for each training point.
The core optimization problem can be formalized as:
The slack relaxes the constraint, but we still pay for it since the objective function includes a penalty , which is the slack penalty weight. Having allows us to be more or less lenient with allowing data points to cross the margin or decision boundary, and itโs generally chosen using cross-validation.
| Value of | Implication for |
|---|---|
| Point is on/beyond correct side of margin | |
| Point has crossed into margin, but still on correct side of decision boundary | |
| Point is exactly on decision boundary () | |
| Point is on the wrong side of the decision boundary โย misclassified |
A point is a support vector in a soft-margin SVM if itโs either exactly on the margin with or it has any positive slack (e.g. any point with is a support vector),
Multiclass
Similar to multiclass perceptron, where we have one weight vector per class with biasses, and prediction is .
For each training point, the score of the correct class must beat the score of every wrong class by at least .
Per training point, there are constraints per training point, one for each class that isnโt the true label. The total main constraints are , plus nonnegativity constraints .
weight components + biases + slack variables.