Support Vector Machine

By CHAN, CHUNG

TA: Andrew M7

 

Introduction

What is support vector machine?

        Support Vector machine is a training algorithm {we have learned nearest neighbor, identification tree, genetic algorithm and neural net} for learning classification rules (discrete classes for each input instance) and regression functions (continuous output for every input).

 

Brief steps of attaining the learning rules

        Let’s start with linearly separable training data with two distinct classes, P and N (+ and -).

 

STEP 1. The definition for best separator (Maximal Margin Hyperplane, MMH)

        It is the line (curvy or not) that is farthest after from the data that it separates. (i.e. widest street).

 

1

        The decision boundary is:

2

For all x in P

3

For all x in N

4

For all x, with yi chosen to be +/- 1 such that 2 and 3 are satisfied

5

Expression to maximize

 

w, u, x are column vectors, with dimension equals the total number of features.

w is the weight vector. It is normal to the decision boundary and its magnitude is chosen such that the distance between the decision boundary and the two classes of point are normalized to at least one.

b is the bias (-b is the threshold). It is chosen because of the extra degree of freedom.

 

STEP 2. Solve the maximization problem

We could maximize 5 with constraints 4 by maximizing the Lagrangian with respect to the Lagrangian multiplier and minimizing it with respect to w and b. In other words, the Lagrangian multiplier have to be non-negative to ensure that the gradient of the constraint function and the gradient of 5 are collinear and in the same direction. For colinearity to be possible, some Lagrangian multiplier have to be zero. Support vectors refer to the vectors with non-zero Lagrangian multiplier (vectors that contribute to the decision boundary).

 

STEP 3. Classification using dot product

The solution requires the use of dot product between input vectors, according to the definition of dot product in the Euclidean space. To extend the idea to non-linearly separable data, we could use a kernel function that evaluates the dot product of input vectors in a transformed space in which the data are linearly separable.

 

K is the kernel function

No Transformation

Polynomial

Radial-basis

 

Polynomial kernel function allows curvy decision boundary. Radial-basis allows even curvier boundaries for small sigma.

 

Examples

The following demonstrate how different kernel function lead to different shapes of the decision boundaries. Note that the parameters can also be adjusted to determine how curvy the decision boundaries can be in fitting the data.

Dot Product

Polynomial Function

Radial-Basis Function

 

Conclusion

        Support vector machine is learns the widest streets that separate data into different groups. This streets, if straight, can be calculated using dot product and the classification also require dot product of input vectors. If the streets are curvy (non-linear in shape), they may be straight (or slightly curvy) after transformation into a different space. Fortunately, no knowledge of the transformation function is needed because kernel functions evaluates the dot product in the transformed space and what we need is only dot product. Different kernel function determines the general shapes of the streets and the parameters determine how curvy the streets could be. However, some vectors do not contribute to the street because the constraints cannot be satisfied but including them. Vectors that contribute to the street are called support vectors, and hence the algorithm is called support vector machines (finding the support vectors and using them to determine how unknown data could be classified).

        Compared with other learning algorithms, support vector machine is flexible (because different kernel functions can be used), reliable (because the result must be the global maximum), fast (only require evaluation of the kernel function when classifying unknown data) and robust.