If the data is linearly separable, there are infinitely many valid separators. The issue is that a ๐Ÿ‘“ Perceptron just stumbles onto whichever one its update path leads to, which isnโ€™t optimal.

Support Vectors

The training points that sit right on the margin boundary, or the points where exactly. These are the points important for defining the boundary. The solution takes the form:

where only for support vectors and for all other points (they contribute nothing).

Hard-margin SVM

Among all possible separating hyperplanes, hard-margin SVM picks the one with maximum margin, or the one farthest from the nearest training point on either side.

For hyperplane , the distance from any point to this hyperplane is . For any separating hyperplane weโ€™re able to rescale and such that the closest points satisfy exactly (support vectors). Under this scaling, the margin becomes . Thus, maximizing the margin means maximizing or minimizing (or equivalently, ).

Formally put, the optimization problem becomes a matter of subject to for all , or in other words, every single training point is on the correct side of the boundary with a margin of at least 1.

Soft-margin SVM

Unfortunately, hard-margin SVM works only on linearly separable data, and in most cases, data is almost never perfectly separable. Be it noise, outliers, overlapping classes, etc. Soft-margin SVM allows us to define some tolerance for mistakes via a slack variable for each training point.

The core optimization problem can be formalized as:

The slack relaxes the constraint, but we still pay for it since the objective function includes a penalty , which is the slack penalty weight. Having allows us to be more or less lenient with allowing data points to cross the margin or decision boundary, and itโ€™s generally chosen using cross-validation.

Value of Implication for
Point is on/beyond correct side of margin
Point has crossed into margin, but still on correct side of decision boundary
Point is exactly on decision boundary ()
Point is on the wrong side of the decision boundary โ€“ย misclassified

A point is a support vector in a soft-margin SVM if itโ€™s either exactly on the margin with or it has any positive slack (e.g. any point with is a support vector),

Multiclass

Similar to multiclass perceptron, where we have one weight vector per class with biasses, and prediction is .

For each training point, the score of the correct class must beat the score of every wrong class by at least .

Per training point, there are constraints per training point, one for each class that isnโ€™t the true label. The total main constraints are , plus nonnegativity constraints .

weight components + biases + slack variables.