library(knitr)
opts_chunk$set(echo = TRUE, cache = TRUE)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Lecture 1 Linear Support Vector Machine

Course Introduction

Course Design

From foundations to techniques

Large-Margin Separating Hyperplane

Linear Classification Revisited

PLA/pocket: \(h(x) = \text{sign}(s)\)
- plausible err = 0/1 (small flipping noise) minimize specially

linear (hyperplane) classifiers:

\[ h(x) = \text{sign} (w^T x) \]

Which Line Is Best

PLA? depending on randomness
VC bound? whichever you like!
- \(E_{out}(w) \le \underbrace{E_{in}(w)}_{0} + \underbrace{\Omega(\mathcal{H})}_{d_{VC} = d + 1}\)
You? rightmost? Possibly?

Why Rightmost Hyperplane?

informal argument
- if (Gaussian-like) noise on future \(x \approx x_n\) (可能因为测量误差):
- leftmost: a tiny noise will make \(x\) cross decision boundary.
- Difference between left and right figure: 测量误差容忍度.
\(x_n\) further from hyperplane \(\iff\) tolerate more noise
- In machine learning foundations, we know noise is the cause of overfitting.
- Then if the line can tolerate more noise, the line is better.
\(\iff\) more robust to overfitting.
distance to closest \(x_n\) \(\iff\) amount of noise tolerance \(\iff\) robustness of hyperplane
rightmost one: more robust because of larger distance to closest \(x_n\).

Fat Hyperplane

robust separating hyperplane: fat
- far from both sides of examples
robustness \(\equiv\) fatness: distance to closest \(x_n\)

goal: find fattest separating hyperplane.

Large-Margin Separating Hyperplane

\(\underset{w}{\text{max}}\) fatness(w)
subject to:
- \(w\) classifies every \((x_n, y_n)\) correctly
- fastness(w) = \(\underset{n = 1, \dots, N}{\text{distance}(x_n, w)}\)
fatness: formally called margin
correctness: \(y_{n} = \text{sign} (w^T x_{n})\)

So we have the following:

\(\underset{w}{\text{max}}\) fatness(w)
subject to:
- every \(y_{n} w^T x_{n} > 0\)
- margin(w) = \(\underset{n = 1, \dots, N}{\text{distance}(x_n, w)}\)

goal: find Large-Margin separating hyperplane.

Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

‘shorten’ \(x\) and \(w\)

distance needs \(w_{0}\) and \((w_1, \dots, w_d)\) differently (to be derived)
\(b = w_0\), \(w = [w_1, \dots, x_d]\); \(x = [x_1, \dots, x_d]\)

Distance to Hyperplane

want: distance \((x, b, w)\), with hyperplane \(w^T x' + b = 0\)

consider \(x', x''\) on hyperplane:

\(w^Tx' = -b\), \(w^Tx'' = -b\)
\(w \perp\) hyperplane \[ w^T (x'' - x') = 0 \]
distance = project \((x - x')\) to \(\perp\) hyperplane.

\[ \text{distance}(x, b, w) = \begin{vmatrix} \frac{w^T}{ \begin{Vmatrix} w \end{Vmatrix} } (x - x') \end{vmatrix} = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]

Distance to Separating Hyperplane

\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]

separating hyperplane: for every \(n\)

\[ y_{n}(w^T x_n + b) > 0 \]

distance to separating hyperplane:

\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n(w^T x + b) \]

In summary:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \text{margin}(b, w) \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \text{margin}(b, w) = \underset{n=1, \dots, N}{min}\frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n (w^T x_n + b) \end{split} \]

Margin of Special Separating Hyperplane

From above problem and notice the following 2 points:

\(w^Tx + b = 0\) same as \(3w^Tx + 3b = 0\): scaling does not matter
special scaling: only consider separating \((b, w)\) such that

\[ \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \implies \text{margin}(b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \]

We have new formula:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]

We could further simplify to the following since the 2nd condition is strong than the 1st one:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]

Standard Large-Margin Hyperplane Problem

original constraints: \(\underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1\)
- want: optimal \((b, w)\) here (inside)
necessary constraints: \(y_n (w^T x_n + b) \ge 1\) for all \(n\)
if optimal \((b, w)\) outside, e.g. \(y_n (w^T x_n + b) > 1.126\) for all \(n\)
- can scale \((b, w)\) to “more optimal” \((\frac{b}{1.126}, \frac{w}{1.126})\) (contradiction!)

We have another new constraints:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

final change: \(\text{max }\implies \text{min}\), remove \(\sqrt{\quad}\)

\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

Support Vector Machine

Solving a Particular Standard Problem

\(\underset{b,w}{\text{min}} \frac{1}{2} w^T w\)
subject to:
- \(y_n (w^T x_n + b) \ge 1\) for all \(n\)

data <- data.frame(x0=c(0, 2, 2, 3), x1=c(0, 2, 0, 0), y=c(-1, -1, 1, 1))
plot(data$x0, data$x1, type = "n", xlim = c(-1, 4), ylim = c(-1, 4), asp = 1, xlab = "x_1", ylab = "x_2")
points(data$x0[1:2], data$x1[1:2], pch = 4, col = "red")
points(data$x0[3:4], data$x1[3:4], pch = 1, col = "blue")
abline(a = -1, b = 1)
abline(a = 0, b = 1, lty = 2)
abline(a = -2, b = 1, lty = 2)

\(g_{SVM} = \text{sign}(x_1 - x_2 - 1)\)

Support Vector Machine (SVM)

optimal solution: \((w_1 = 1, w_2 = -1, b = -1)\)
\(\text{margin}(b, w) = \frac{1}{\begin{Vmatrix}w\end{Vmatrix}} = \frac{1}{\sqrt{2}}\)
examples on boundary: ‘locates’ fattest hyperplane. Other examples: not needed.
call boundary example Support Vector

Support vector machine (SVM): learn fattest hyperplanes (with help of support vectors)

Solving General SVM

\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to } &\quad y_n (w^T x_n + b) \ge 1 \end{split} \]

not easy manually, of course.
gradient descent? not easy with constraints
luckily:
- (convex) quadratic objective function of \((b, w)\)
- linear constraints of \((b, w)\)
quadratic programming (QP): ‘easy’ optimization problem.

Quadratic Programming

\[ \begin{split} \text{optimal}(b, w) &=? \\ \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{for } n = 1, 2, \dots, N \\ \end{split} \]

Generic form of quadratic programming

\[ \begin{split} \text{optimal } u \rightarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} u^T Q u + p^T u \\ \text{subject to } \quad & a_m^T u \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]

objective function:
- \(u = \begin{bmatrix} b \\ w \end{bmatrix}\)
- \(Q = \begin{bmatrix} 0 & 0^T \\ 0 & I \end{bmatrix}\)
- \(p = 0_{d+1}\)
constraints:
- \(a_m = y_n[1, x_n^T]\)
- \(c_n = 1\)
- \(M = N\)
SVM with general QP solver: easy if you’ve read the manual.

SVM with QP Solver

Linear Hard-Margin SVM Algorithm:
1. \(Q, p, a^T_n, c_n\) as above
2. \(\begin{bmatrix} b \\ w \end{bmatrix} \leftarrow \text{QP}(Q, p, A, c)\)
3. return \(b \& w\) as \(g_{svm}\)
Hard-Margin: nothing violate ‘fat boundary’
linear \(x_n\)
want non-linear? \(z_n = \Phi(x_n)\) - remember?

Reasons Behind Large-Margin Hyperplane

Why Large-Margin Hyperplane

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

	minimize	constraint
regularization	\(E_{in}\)	\(w^Tw \le C\)
SVM	\(w^Tw\)	\(E_{in} = 0\) [and more: \(y_n(wx_n+b) \ge 1\)]

SVM (Large-Margin hyperplane): ‘weight-decay regularization’ within \(E_{in} = 0\)

Large-Margin Restricts Dichotomies

consider ‘Large-Margin algorithm’ \(\mathcal{A}_{\rho}\): either returns \(g\) with margin(\(g\)) \(\ge \rho\), or 0 otherwise.

\(\mathcal{A}_0\): like PLA \(\implies\) shatter ‘general’ 3 inputs
\(\mathcal{A}_{1.126}\): more strict than SVM \(\implies\) cannot shatter any 3 inputs

fewer dichotomies \(\implies\) smaller VC dim \(\implies\) better generalization.

VC Dimenson of Large-Margin Algorithm

fewer dichotomies \(\implies\) smaller ‘VC dim.’

considers \(d_{VC}(\mathcal{A}_{\rho})\) [data dependent, need more than VC]
instead of \(d_{VC}(\mathcal{H})\) [data independent, covered VC]
\(d_{VC}(\mathcal{A}_{\rho})\) when \(\mathcal{X}\) = unit circle in \(\mathbb{R}^2\), where \(\mathcal{X}\) is the data set.
\(\rho = 0\): just perceptrons (\(d_{VC} = 3\))
\(\rho \ge \frac{\sqrt{3}}{2}\): cannot shatter any 3 inputs (\(d_{VC} < 3\))
- some inputs must be of distance \(\le \sqrt{3}\)

generally, when \(\mathcal{X}\) in radius-R hyperball:

\[ d_{VC}(\mathcal{A}_{\rho}) \le \text{min} (\frac{R^2}{\rho^2}, d) + 1 \le \underbrace{d+1}_{d_{VC}(\text{perceptron})} \]

Benefits of Large-Margin Hyperplanes

	Large-Margin Hyperplanes	Hyperplanes	Hyperplanes + feature transform \(\Phi\)
# of hypothesis	even fewer	not many	many
boundary	simple	simple	sophisticated

not many: good for \(d_{VC}\) and generalization
sophisticated: good for possibly better \(E_{in}\)
a new possiblility: non-linear SVM

	Large-Margin Hyperplanes + numerous feature transform \(\Phi\)
#	not many
boundary	sophisticated

Conclusions

Large-Margin Separating Hyperplane
- intuitively more robust against noise
Standard Large-Margin Problem
- minimize ‘length of w’ at special separating scale
Support Vector Machine
- ‘easy’ via quadratic programming
Reasons Behind Large-Margin Hyperplane
- fewer dichotomies and better generalization

Lecture 2 Dual Support Vector Machine

Motivation of Dual SVM

Non-Linear Support Vector Machine Revisited

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]

Non-Linear Hard-Margin SVM

\(Q = \begin{bmatrix} 0 & 0^T_{\tilde{d}} \\ 0_{\tilde{d}} & I_{\tilde{d}} \end{bmatrix}\); \(p = 0_{\tilde{d}+1}\); \(a_n = y_n[1, z_n^T]\); \(c_n = 1\)
\(\begin{bmatrix} b \\ w \end{bmatrix} \leftarrow \text{QP}(Q, p, A, c)\)
return \(b \in \mathbb{R}\) & \(w \in \mathbb{R}^{\tilde{d}}\) with \(g_{svm}(x) = \text{sign}(w^T \Phi(x) + b)\)

demanded: not many (Large-Margin), but sophisticated boundary (feature transform)
QP with \(\tilde{d}+1\) variables and \(N\) constraints - challenging if \(\tilde{d}\), or infinite?!

Goal: SVM without dependence on \(\tilde{d}\)

TODO: SVM without \(\tilde{d}\)

Original SVM
- \(\tilde{d} + 1\) variables
- \(N\) constraints
‘Equivalent’ SVM
- \(N\) variables
- \(N+1\) constraints

Warning: Heavy Math!!!

introduce some necessary math without rigor to help understand SVM deeper
‘claim’ some results if details unnecessary - like how we ‘claimed’ Hoeffding

‘Equivalent’ SVM: based on some dual problem of Original SVM

Key Tool: Lagrange Multipliers

Regularization by Constrained-Minimizing \(E_{in}\)	Regularization by Minimizing E_{aug}
\(\underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^Tw \le C\)	\(\underset{w}{\text{min}} E_{aug}(w) =E_{in}(w) + \frac{\lambda}{N} w^T w\)

\(C\) equivalent to some \(\lambda \ge 0\) by checking optimality condition

\[ \nabla E_{in}(w) + \frac{2\lambda}{N} w = 0 \]

regularization: view \(\lambda\) as given parameter instead of \(C\), and solve ‘easily’
dual SVM: view \(\lambda\) as unknown given the constraints, and solve them as variables instead.

how many \(\lambda\)’s as variables? \(N\).

Starting Point: Constrained to ‘Unconstrained’

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]

Lagrange Function: with Lagrange Multipliers \(\alpha_n\)

\[ \mathcal{L}(b, w, \alpha) = \underbrace{\frac{1}{2} w^T w}_{\text{objective}} + \sum_{n=1}^{N} \alpha_n \underbrace{(1 - y_n(w^T z_n + b))}_{\text{constraint}} \]

Claim

\[ \text{SVM } \equiv \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) = \underset{b,w}{\text{min}} ( \infty \text{ if violate }; \frac{1}{2} w^T w \text{ if feasible}) \]

any violating \((b, w)\): \(\underset{\text{all } \alpha_n \ge 0}{\text{max}} (\square + \sum_n \alpha_n (\text{some positive})) \rightarrow \infty\)
any feasible \((b, w)\): \(\underset{\text{all } \alpha_n \ge 0}{\text{max}} (\square + \sum_n \alpha_n (\text{al non-positive})) = \square\)

Constraint now hidden in \(max\).

Lagrange Dual SVM

Lagrange Dual Problem

for any fixed \(\alpha'\) with all \(\alpha_n' \ge 0\),

\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) \]

because \(\text{max} \ge \text{any}\).

for best \(\alpha' \ge 0\) on RHS,

\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]

because best is one of any.

If we solve the dual problem, we get a lower bound of the primal problem.

Lagrange dual problem:
- outer maximization of \(\alpha\) on lower bound of original problem

Strong Duality of Quadratic Programming

\[ \underbrace{ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) }_{\text{equiv. to original (primal) SVM}} = \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]

\(ge\): weak duality
\(=\): strong duality, true for QP if
- convex primal
- feasible primal (true if \(\Phi\)-separable)
- linear constraints
called constraint qualification

exists primal-dual optimal solution \((b, w, \alpha)\) for both sides.

Solving Lagrange Dual: Simplification (1/2)

\[ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \left( \underset{b,w}{\text{min}} \underbrace{ \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n + b)) }_{\mathcal{L}(b, w, \alpha)} \right) \]

inner problem ‘Unconstrained’, at optimal:
- \(\frac{\partial \mathcal{L}(b, w, \alpha)}{\partial b} = 0 = -\sum_{n=1}^{N} \alpha_n y_{n}\)
no loss of optimality if solving with constraint \(\sum_{n=1}^{N} \alpha_n y_{n} = 0\)

Then we can remove \(b\).

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]

Solving Lagrange Dual: Simplification (2/2)

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]

inner problem ‘Unconstrained’, at optimal:
- \(\frac{\partial \mathcal{L}(b, w, \alpha)}{\partial w_i} = 0 = w_i -\sum_{n=1}^{N} \alpha_n y_{n} z_{n,i}\)
no loss of optimality if solving with constraint \(w = \sum_{n=1}^{N} \alpha_n y_{n} z_n\)

\[ \begin{split} & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n - w^T w \right) \\ \iff & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \end{split} \]

KTT Optimality Conditions

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \]

If primal-dual optimal \((b, w, \alpha)\)
- primal feasible: \(y_{n} (w^T z_n + b) \ge 1\)
- dual feasible: \(\alpha_n \ge 0\)
- dual-inner optimal: \(\sum y_n \alpha_n = 0; w = \sum \alpha_n y_n z_n\)
- primal-inner optimal (at optimal all ‘Lagrange term’ disappear): \(\alpha_n (1 - y_n (w^T z_n + b)) = 0\). which is called complimentary slackness

called Karush-Kuhn-Tucker (KKT) conditions, necessary for optimality [& sufficient here].

Will use KKT to ‘solve’ \((b, w)\) from optimal \(\alpha\).

Solving Dual SVM

Dual Formulation of Support Vector Machine

Standard Hard-Margin SVM dual

\[ \begin{split} \underset{\alpha}{\text{min}} & \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} \alpha_n \alpha_m y_n y_m z_n^T z_m - \sum_{n=1}^{N} \alpha_n \\ \text{subject to } & \sum_{n=1}^{N} y_n \alpha_n = 0 \\ & \alpha_n \ge 0, \text{ for } n = 1, 2, \dots, N \end{split} \]

(convex) QP of \(N\) variables & \(N+1\) constraints, as promised
How to solve? yeah! we know QP! :)

Dual SVM with QP Solver

\[ \begin{split} \text{optimal } \alpha \leftarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} \alpha^T Q \alpha + p^T u \\ \text{subject to } \quad & a_m^T \alpha \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]

\[ \begin{split} q_{n,m} \quad & = y_n y_m z_n^T z_m \\ p \quad & = -1_N \\ a_{\ge} = y \quad &; c_{\ge} = 0; \\ a_{\le} = -y \quad &; c_{\le} = 0; \\ a_n^T = n \text{-th unit direction} \quad &; c_n = 0 \end{split} \]

Note: many solvers treat equality (\(a_{\ge}, a_{\le}\)) & bound constraints (\(a_n\)) specially for numerical stability.

Dual SVM with Special QP Solver

\(q_{n,m} = y_n y_m z_n^T z_m\), often non-zero, dense.
if \(N = 30000\), dense \(Q_d\) (\(N\) by \(N\) symmetric) takes > 3G RAM
need special solver for
- not storing whole \(Q_D\).
- utilizing special constraints properly.
to scale up to large \(N\).
usually better to use special solver in practice.

optimal \((b, w)\)

KKT conditions

if primal-dual optimal \((b, w, \alpha)\)
- primal feasible: \(y_{n} (w^T z_n + b) \ge 1\)
- dual feasible: \(\alpha_n \ge 0\)
- dual-inner optimal: \(\sum y_n \alpha_n = 0; w = \sum \alpha_n y_n z_n\)
- primal-inner optimal (at optimal all ‘Lagrange term’ disappear): \(\alpha_n (1 - y_n (w^T z_n + b)) = 0\). (complimentary slackness)
optimal \(\alpha \implies\) optimal \(w\)? easy above!
optimal \(\alpha \implies\) optimal \(b\)? a range from primal feasible & equality from complimentary slackness if one \(\alpha_n > 0 \implies b = y_n - w^T z_n\)

complimentary slackness: \(\alpha_n > 0 \implies\) on fat boundary (SV!)

Messages behind Dual SVM

Support Vectors Revisited

on boundary: ‘locates’ fattest hyperplane; others: not needed.
examples with \(\alpha_n > 0\): on boundary
call \(\alpha_n > 0\) examples \((z_n, y_n)\) support vectors
SV (positive \(\alpha_n\)) \(\subseteq\) SV candidates (on boundary)
only SV needed to compute \(w: w = \sum_{n=1}^{N} \alpha_n y_n z_n = \sum_{SV} \alpha_n y_n z_n\)
only SV needed to compute \(b: b = y_n - w^T z_n\) with any SV \((z_n, y_n)\)

SVM: learn fattest hyperplane by identifying support vectors with dual optimal solution.

Representation of Fattest Hyperplane

SVM	PLA
\(w_{SVM}= \sum_{n=1}^{N} \alpha_n y_n z_n\)	\(w_{PLA} = \sum_{n=1}^{N} \beta_n y_n z_n\)
\(\alpha_n\) from dual solution	\(\beta_n\) by # mistake corrections

\(w =\) linear combination of \(y_n z_n\)

also true for GD/SGD-based LogReg/LinReg when \(w_0 = 0\)
call \(w\) ‘represented’ by data

哲学上来说：我们想如何表现 \(w\):

SVM: represent \(w\) by SVs only.

Summary: Two Forms of Hard-Margin SVM

Primal Hard-Margin SVM	Dual Hard-Margin SVM
\(\begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{ for all } n \end{split}\)	\(\begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split}\)
\(\tilde{d} + 1\) variables	\(N\) variables
\(N\) constraints	\(N+1\) simple constraints
suitable when \(\tilde{d} + 1\) small	suitable when \(N\) small
physical meaning: locate specially-scaled \((b, w)\)	physical meaning: locate SVs \((z_n, y_n)\) & their \(\alpha_n\)

Both eventually result in optimal \((b, w)\) for fattest hyperplane

\[ g_{SVM}(x) = \text{sign} (w^T \Phi(x) + b) \]

Are We Done Yet?

\[ \begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split} \]

\(N\) variables, \(N+1\) simple constraints: no dependence on \(\tilde{d}\)?
\(q_{n,m} = y_n y_m z_n^T z_m\): inner product in \(\mathbb{R}^{\tilde{d}}\)
- \(O(\tilde{d})\) via naive computation!

no dependence only if avoiding naive computation! (next lecture :))

Conclusions

Motivation of Dual SVM
- want to remove dependence \(\tilde{d}\)
Lagrange Dual SVM
- KKT conditions link primal/dual
Solving Dual SVM
- another QP, better solved with special solver
Messages behind Dual SVM
- SVs represent fattest hyperplane

Lecture 3 Kernel Support Vector Machine

Kernel Trick

Dual SVM Revisited

goal: SVM without dependence on \(\tilde{d}\)

half way done:

\(q_{n,m} = y_n y_m z_n^T z_m\): inner product in \(\mathbb{R}^{\tilde{d}}\)
need: \(z_n^T z_m = \Phi(x_n)^T \Phi(x_m)\) calculated faster than \(O(\tilde{d})\)
- 先做转换，再做内积。Assume 10000 dimension, 10000 cost to to the transform, 10000 cost to do the inner product.
- 可以合起来偷吃步吗？

Fast Inner Product for \(\Phi_2\)

2nd order polynomial transform
- \(\Phi_2(x) = (1, x_1, x_2, \dots, x_d, x_1^2, x_1x_2, \dots, x_1x_d, x_2x_1, x_2^2, \dots, x_2x_d, \dots, x_d^2)\)
- Include both \(x_1x_2\) & \(x_2x_1\) for simplicity.

\[ \begin{split} \Phi_2(x)^T \Phi_2(x') \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} \sum_{j=1}^{d} x_ix_jx_i'x_j' \\ \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} x_ix_i' \sum_{j=1}^{d} x_jx_j' \\ \quad & = 1 + x^Tx' + (x^Tx')(x^Tx') \end{split} \]

For \(\Phi_2\), transform + inner product can be carefully done in \(O(d)\) instead of \(O(d^2)\)

Kernel: Transform + Inner Product

transform \(\Phi \iff\) Kernel function: \(K_{\Phi}(x, x') \equiv \Phi(x)^T \Phi(x')\)
\(\Phi_2 \iff K_{\Phi^2}(x, x') = 1 + x^Tx' + (x^Tx')^2\)
quadratic coefficient \(q_{n,m} = y_n y_m z_n^T z_m = y_ny_m K(x_n, x_m)\)
optimal bias \(b\)? From SV(\(x_s\), \(y_s\))

\[ b = y_s - w^T z_s = y_s - (\sum_{i=1}^{N} \alpha_n y_n z_n)^T z_s = y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)) \]

optimal hypothesis \(g_{SVM}\): for test input x,

\[ g_{SVM} = \text{sign} (w^T \Phi(x) + b) = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b) \]

没有任何地方需要在\(z\)空间中进行。

Kernel Trick: plug in efficient kernel function to avoid dependence on \(\tilde{d}\)

Kernel SVM with QP

Kernel Hard-Margin SVM Algorithm
1. \(q_{n,m} = y_n y_m z_n^T z_m\); \(p = -1_N\); \((A, c)\) for equ./bound constraints
2. \(\alpha \leftarrow QP(Q_D, p, A, c)\)
3. \(b \leftarrow (y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)))\) with SV \((x_s, y_s)\)
4. return SVs and their \(\alpha_n\) as well as \(b\) such that for new \(x\),
\(g_{SVM} = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b)\)

1: time complexity \(O(N^2) \cdot\) (kernel evaluation)
2: QP with \(N\) variables and \(N+1\) constraints
3 and 4: time complexity \(O(\# SV) \cdot\) (kernel evaluation)

Kernel SVM: use computational shortcut to avoid \(\tilde{d}\) & predict with SV only.

Polynomial Kernel

General Poly-2 Kernel

\[ \begin{split} \Phi_2(x) = (1, x_1, \dots, x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2}x_1, \dots, \sqrt{2} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2\gamma}x_1, \dots, \sqrt{2\gamma} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2\gamma x^Tx' + (x^Tx')^2 \\ \end{split} \]

\[ K_2(x, x') = (1 + \gamma x^Tx')^2 \text{ with } \gamma > 0 \]

\(K_2\): somewhat ‘easier’ to calculate than \(K_{\Phi_2}\)
\(\Phi_2\) and \(\\Phi_2\): equivalent power, different inner product \(\implies\) different geometry.
- 不同内积代表不同距离，不同的距离有不同的几何特性，算出来的margin就不一样。
- 就算在同样的空间，也会得到不同的边界。
\(K_2\) commonly used.

Poly-2 Kernels in Action

\(g_{SVM}\) different, SVs different.
- ‘hard’ to say which is better before learning.
change of kernel \(\iff\) change of margin definition

need selecting \(K\), just like selecting \(\Phi\).

General Polynomial Kernel

\[ \begin{split} K_2(x, x') \quad & = (\zeta + \gamma x^Tx')^2 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_3(x, x') \quad & = (\zeta + \gamma x^Tx')^3 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_Q(x, x') \quad & = (\zeta + \gamma x^Tx')^Q \text{ with } \gamma > 0, \zeta \ge 0 \end{split} \]

embeds \(\Phi_Q\) specially with parameters
allows computing Large-Margin polynomial classification without dependence on \(\tilde{d}\)
- Will 10th order overfit? Probably, but the bottom line is that you still has large margin to help you to harness the complexity.
SVM + Polynomial Kernel: Polynomial SVM

Special Case Linear Kernel

\[ K_1(x, x') = (0 + 1 \dot x^T x)^1 \]

\(K_1\): just usual inner product, called linear kernel
‘even easier’: can be solved (often in primal form) efficiently.

linear first, remember? :)

Gaussian Kernel

Kernel of Infinite Dimensional Transform

Infinite dimensional \(\Phi(x)\)? Yes, if \(K(x, x')\) efficiently computable!

\[ \begin{split} \text{when } x = (x), K (x, x') \quad &= \quad \text{exp}(-(x-x')^2) \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \text{exp}(2xx') \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \left( \sum_{i=0}^{\infty} \frac{(2xx')^i}{i!}\right) \\ \quad &= \quad \sum_{i=0}^{\infty} \left( \text{exp}(-(x)^2) \text{exp}(-(x')^2) \sqrt{\frac{2^i}{i!}} \sqrt{\frac{2^i}{i!}} (x)^i (x')^i \right) \\ \quad &= \quad \Phi(x)^T \Phi(x') \end{split} \]

with infinite dimensional \(\Phi(x) = \text{exp}(-x^2) \dot (1, \sqrt{\frac{2^1}{1!}}, \sqrt{\frac{2^2}{2!}}, \dots)\)

more generally, Gaussian kernel \(K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x-x' \end{Vmatrix}^2 )\) with \(\gamma > 0\)

Hypothesis of Gaussian SVM

\[ \begin{split} g_{SVM} &= (\text{sign} \sum_{SV} \alpha_n y_n K(x_n, x) + b) \\ &= (\text{sign} \sum_{SV} \alpha_n y_n \text{exp}(-\gamma \begin{Vmatrix} x - x_n \end{Vmatrix}^2) + b) \\ \end{split} \]

linear combination of Gaussian centered at SVs \(x_n\)
also called Radial Basis Function (RBF) kernel

Gaussian SVM: find \(\alpha_n\) to combine Gaussian centered at \(x_n\) & achieve large margin in infinite-dim. space

Support Vector Mechanism

Header One	Large-Margin hyperplane + higher-order transforms with kernel trick
#	not many
boundary	sophisticated

transformed vector \(z = \Phi(x) \implies\) efficient kernel \(K(x, x')\)
store optimal \(w \implies\) store a few SVs and \(\alpha_n\)

new possiblility by Gaussian SVM: infinite-dimensional linear classification, with generalization ‘guarded by’ Large-Margin :)

Gaussian SVM in Action

large \(\gamma \implies\) sharp Gaussian \(\implies\) ‘overfit’
Warning: SVM can still overfit :(

Gaussian SVM: need careful selection of \(\gamma\).

Comparison of Kernels

Linear Kernel: Cons and Pros

\[ K(x, x') = x^T x \]

Cons	Pros
restricted - not always separable?!	fast - with special QP solver in primal
	safe - linear first, remember
	very explainable - \(w\) and SVs say something

Linear kernel: an important basic tool.

Polynomial Kernel: Cons and Pros

\[ K(x, x') = (\zeta + \gamma x^T x')^Q \]

Cons	Pros
numerical difficulty for large \(Q\):	less restricted than linear
\(\|\zeta + \gamma x^Tx'\| < 1: K \rightarrow 0\)	strong physical control - ‘knows’ degree Q
\(\|\zeta + \gamma x^Tx'\| > 1: K \rightarrow big\)
three parameters \((\gamma, \zeta, Q)\) - more difficult to select

polynomial kernel: perhaps small-Q only. 当我们心里想好 \(Q\) 的时候.

sometimes efficiently done by linear on \(\Phi_Q(x)\)

Gaussian Kernel: Cons and Pros

\[ K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x - x' \end{Vmatrix}^2) \]

Cons	Pros
mysterious - no \(w\)	more powerful than linear/poly
slower than linear	bounded - less numerical difficult than poly
too powerful!	one parameter only - easier to select than poly

Gaussian kernel: one of most popular but shall be used with care.

Other Valid Kernels

kernel represents special similarity: \(\Phi(x)^T \Phi(x')\)
any similarity \(\implies\) valid kernel? not really
necessary and sufficient conditions for valid kernel: Mercer’s condition
- symmetric
- let \(k_{ij} = K(x_i, x_j)\), the matrix \(K = ZZ^T\) must always be positive semi-definite

define your own kernel: possible, but hard.

Conclusions

Kernel Trick
- kernel as shortcut of transform + inner product
Polynomial Kernel
- embeds specially-scaled polynomial transform
Gaussian Kernel
- embeds infinite dimensional transform
Comparison of Kernels
- linear for efficiency or Gaussian for power

Lecture 4 Soft-Margin Support Vector Machine

Motivation and Primal Problem

Cons of Hard-Margin SVM

recall: SVM can still overfit :(

part of reasons: \(\Phi\)
- \(\Phi_1\) looks reasonable, although make some mistakes.
- \(\Phi_4\) separate data perfectly, but looks weird and not reasonable
other part: separable

If always insisting on separable (\(\implies\) shatter), have power to overfit to noise.

Give Up on Some Examples

want: give up on some noisy examples

pocket	Hard-Margin SVM
\(\underset{b,w}{\text{min}} \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)]\)	\(\underset{b,w}{\text{min}} \frac{1}{2} w^T w\)
	s.t. \(y_n (w^T z_n + b) \ge 1\) for all \(n\)

Combination:

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 \text{ for correct } n \\ & y_n (w^T z_n + b) \ge -\infty \text{ for incorrect } n \end{split} \]

\(C\): trade-off of large margin & noise tolerance

Soft-Margin SVM (1/2)

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [ y_n \ne \text{sign} (w^Tz_n + b) ] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \infty \dot [ y_n \ne \text{sign} (w^Tz_n + b) ] \end{split} \]

\([\dot]\): non-linear, not QP anymore :(
- what about dual? kernel?
cannot distinguish small error (slightly away from fat boundary)
- large error (far away from fat boundary)
record ‘margin violation’ by \(\xi_n\) - linear constraints
penalize with margin violation instead of error count - quadratic objective

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]

Soft-Margin SVM (1/2)

record ‘margin violation’ by \(\xi_n\)
penalize with margin violation instead of error count

\[ \begin{split} \underset{b,w,\xi}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]

parameter \(C\): trade-off of Large-Margin and margin violation
- large \(C\): want less margin violation
- small \(C\): want large margin
QP of \(\tilde{d} + 1 + N\) variables, \(2N\) constraints

next: remove dependence on \(\tilde{d}\) by Soft-Margin SVM primal \(\implies\) dual?

Dual Problem

Lagrange Dual

Primal:

Lagrange function with Lagrange multipliers \(\alpha_n\) and \(\beta_n\)

\[ \begin{split} \mathcal{L}(b, w, \xi, \alpha, \beta) = & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) \end{split} \]

want: Lagrange dual

\[ \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} \mathcal{L}(b, w, \xi, \alpha, \beta) ) \]

Simplify \(\xi_n\) and \(\beta_n\)

\[ \begin{split} \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) ) \end{split} \]

\(\frac{\partial \mathcal{L}}{\partial \xi_n} = 0 = C - \alpha_n - \beta_n\)
no loss of optimality if solving with implicit constraint \(\beta_n = C - \alpha_n\) and explicit constraint \(0 \le \alpha_n \le C\)

Messages behind Soft-Margin SVM

Model Selection

Conclusions

Motivation and Primal Problem
Dual Problem
Messages behind Soft-Margin SVM
Model Selection

Notes on Machine Learning Techniques

Yang Ge

November 13, 2015

Lecture 1 Linear Support Vector Machine

Course Introduction

Large-Margin Separating Hyperplane

Standard Large-Margin Problem

Support Vector Machine

Reasons Behind Large-Margin Hyperplane

Conclusions

Lecture 2 Dual Support Vector Machine

Motivation of Dual SVM

Lagrange Dual SVM

Solving Dual SVM

Messages behind Dual SVM

Conclusions

Lecture 3 Kernel Support Vector Machine

Kernel Trick

Polynomial Kernel

Gaussian Kernel

Comparison of Kernels

Conclusions

Lecture 4 Soft-Margin Support Vector Machine

Motivation and Primal Problem

Dual Problem

Messages behind Soft-Margin SVM

Model Selection

Conclusions