\usepackage{stmaryrd}
library(knitr)
opts_chunk$set(echo = TRUE, cache = TRUE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Lecture 1 Linear Support Vector Machine

Course Introduction

Course Design

  • From foundations to techniques

Large-Margin Separating Hyperplane

Linear Classification Revisited

  • PLA/pocket: \(h(x) = \text{sign}(s)\)
    • plausible err = 0/1 (small flipping noise) minimize specially

linear (hyperplane) classifiers:

\[ h(x) = \text{sign} (w^T x) \]

Which Line Is Best

  • PLA? depending on randomness
  • VC bound? whichever you like!
    • \(E_{out}(w) \le \underbrace{E_{in}(w)}_{0} + \underbrace{\Omega(\mathcal{H})}_{d_{VC} = d + 1}\)
  • You? rightmost? Possibly?

Why Rightmost Hyperplane?

  • informal argument
    • if (Gaussian-like) noise on future \(x \approx x_n\) (可能因为测量误差):
    • leftmost: a tiny noise will make \(x\) cross decision boundary.
    • Difference between left and right figure: 测量误差容忍度.
  • \(x_n\) further from hyperplane \(\iff\) tolerate more noise
    • In machine learning foundations, we know noise is the cause of overfitting.
    • Then if the line can tolerate more noise, the line is better.
  • \(\iff\) more robust to overfitting.
  • distance to closest \(x_n\) \(\iff\) amount of noise tolerance \(\iff\) robustness of hyperplane

  • rightmost one: more robust because of larger distance to closest \(x_n\).

Fat Hyperplane

  • robust separating hyperplane: fat
    • far from both sides of examples
  • robustness \(\equiv\) fatness: distance to closest \(x_n\)

goal: find fattest separating hyperplane.

Large-Margin Separating Hyperplane

  • \(\underset{w}{\text{max}}\) fatness(w)
  • subject to:
    • \(w\) classifies every \((x_n, y_n)\) correctly
    • fastness(w) = \(\underset{n = 1, \dots, N}{\text{distance}(x_n, w)}\)
  • fatness: formally called margin
  • correctness: \(y_{n} = \text{sign} (w^T x_{n})\)

So we have the following:

  • \(\underset{w}{\text{max}}\) fatness(w)
  • subject to:
    • every \(y_{n} w^T x_{n} > 0\)
    • margin(w) = \(\underset{n = 1, \dots, N}{\text{distance}(x_n, w)}\)

goal: find Large-Margin separating hyperplane.

Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

‘shorten’ \(x\) and \(w\)

  • distance needs \(w_{0}\) and \((w_1, \dots, w_d)\) differently (to be derived)
  • \(b = w_0\), \(w = [w_1, \dots, x_d]\); \(x = [x_1, \dots, x_d]\)

Distance to Hyperplane

want: distance \((x, b, w)\), with hyperplane \(w^T x' + b = 0\)

consider \(x', x''\) on hyperplane:

  1. \(w^Tx' = -b\), \(w^Tx'' = -b\)
  2. \(w \perp\) hyperplane \[ w^T (x'' - x') = 0 \]
  3. distance = project \((x - x')\) to \(\perp\) hyperplane.

\[ \text{distance}(x, b, w) = \begin{vmatrix} \frac{w^T}{ \begin{Vmatrix} w \end{Vmatrix} } (x - x') \end{vmatrix} = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]

Distance to Separating Hyperplane

\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]

  • separating hyperplane: for every \(n\)

\[ y_{n}(w^T x_n + b) > 0 \]

  • distance to separating hyperplane:

\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n(w^T x + b) \]

  • In summary:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \text{margin}(b, w) \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \text{margin}(b, w) = \underset{n=1, \dots, N}{min}\frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n (w^T x_n + b) \end{split} \]

Margin of Special Separating Hyperplane

From above problem and notice the following 2 points:

  • \(w^Tx + b = 0\) same as \(3w^Tx + 3b = 0\): scaling does not matter
  • special scaling: only consider separating \((b, w)\) such that

\[ \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \implies \text{margin}(b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \]

  • We have new formula:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]

  • We could further simplify to the following since the 2nd condition is strong than the 1st one:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]

Standard Large-Margin Hyperplane Problem

  • original constraints: \(\underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1\)
    • want: optimal \((b, w)\) here (inside)
  • necessary constraints: \(y_n (w^T x_n + b) \ge 1\) for all \(n\)

  • if optimal \((b, w)\) outside, e.g. \(y_n (w^T x_n + b) > 1.126\) for all \(n\)
    • can scale \((b, w)\) to “more optimal” \((\frac{b}{1.126}, \frac{w}{1.126})\) (contradiction!)

We have another new constraints:

\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

final change: \(\text{max }\implies \text{min}\), remove \(\sqrt{\quad}\)

\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

Support Vector Machine

Solving a Particular Standard Problem

  • \(\underset{b,w}{\text{min}} \frac{1}{2} w^T w\)
  • subject to:
    • \(y_n (w^T x_n + b) \ge 1\) for all \(n\)
data <- data.frame(x0=c(0, 2, 2, 3), x1=c(0, 2, 0, 0), y=c(-1, -1, 1, 1))
plot(data$x0, data$x1, type = "n", xlim = c(-1, 4), ylim = c(-1, 4), asp = 1, xlab = "x_1", ylab = "x_2")
points(data$x0[1:2], data$x1[1:2], pch = 4, col = "red")
points(data$x0[3:4], data$x1[3:4], pch = 1, col = "blue")
abline(a = -1, b = 1)
abline(a = 0, b = 1, lty = 2)
abline(a = -2, b = 1, lty = 2)

  • \(g_{SVM} = \text{sign}(x_1 - x_2 - 1)\)

Support Vector Machine (SVM)

  • optimal solution: \((w_1 = 1, w_2 = -1, b = -1)\)
  • \(\text{margin}(b, w) = \frac{1}{\begin{Vmatrix}w\end{Vmatrix}} = \frac{1}{\sqrt{2}}\)
  • examples on boundary: ‘locates’ fattest hyperplane. Other examples: not needed.
  • call boundary example Support Vector

Support vector machine (SVM): learn fattest hyperplanes (with help of support vectors)

Solving General SVM

\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to } &\quad y_n (w^T x_n + b) \ge 1 \end{split} \]

  • not easy manually, of course.
  • gradient descent? not easy with constraints
  • luckily:
    • (convex) quadratic objective function of \((b, w)\)
    • linear constraints of \((b, w)\)
  • quadratic programming (QP): ‘easy’ optimization problem.

Quadratic Programming

\[ \begin{split} \text{optimal}(b, w) &=? \\ \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{for } n = 1, 2, \dots, N \\ \end{split} \]

  • Generic form of quadratic programming

\[ \begin{split} \text{optimal } u \rightarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} u^T Q u + p^T u \\ \text{subject to } \quad & a_m^T u \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]

  • objective function:
    • \(u = \begin{bmatrix} b \\ w \end{bmatrix}\)
    • \(Q = \begin{bmatrix} 0 & 0^T \\ 0 & I \end{bmatrix}\)
    • \(p = 0_{d+1}\)
  • constraints:
    • \(a_m = y_n[1, x_n^T]\)
    • \(c_n = 1\)
    • \(M = N\)
  • SVM with general QP solver: easy if you’ve read the manual.

SVM with QP Solver

  • Linear Hard-Margin SVM Algorithm:
    1. \(Q, p, a^T_n, c_n\) as above
    2. \(\begin{bmatrix} b \\ w \end{bmatrix} \leftarrow \text{QP}(Q, p, A, c)\)
    3. return \(b \& w\) as \(g_{svm}\)
  • Hard-Margin: nothing violate ‘fat boundary’
  • linear \(x_n\)
  • want non-linear? \(z_n = \Phi(x_n)\) - remember?

Reasons Behind Large-Margin Hyperplane

Why Large-Margin Hyperplane

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]

minimize constraint
regularization \(E_{in}\) \(w^Tw \le C\)
SVM \(w^Tw\) \(E_{in} = 0\) [and more: \(y_n(wx_n+b) \ge 1\)]

SVM (Large-Margin hyperplane): ‘weight-decay regularization’ within \(E_{in} = 0\)

Large-Margin Restricts Dichotomies

consider ‘Large-Margin algorithm’ \(\mathcal{A}_{\rho}\): either returns \(g\) with margin(\(g\)) \(\ge \rho\), or 0 otherwise.

  • \(\mathcal{A}_0\): like PLA \(\implies\) shatter ‘general’ 3 inputs
  • \(\mathcal{A}_{1.126}\): more strict than SVM \(\implies\) cannot shatter any 3 inputs

fewer dichotomies \(\implies\) smaller VC dim \(\implies\) better generalization.

VC Dimenson of Large-Margin Algorithm

fewer dichotomies \(\implies\) smaller ‘VC dim.’

  • considers \(d_{VC}(\mathcal{A}_{\rho})\) [data dependent, need more than VC]
  • instead of \(d_{VC}(\mathcal{H})\) [data independent, covered VC]

  • \(d_{VC}(\mathcal{A}_{\rho})\) when \(\mathcal{X}\) = unit circle in \(\mathbb{R}^2\), where \(\mathcal{X}\) is the data set.
  • \(\rho = 0\): just perceptrons (\(d_{VC} = 3\))
  • \(\rho \ge \frac{\sqrt{3}}{2}\): cannot shatter any 3 inputs (\(d_{VC} < 3\))
    • some inputs must be of distance \(\le \sqrt{3}\)

generally, when \(\mathcal{X}\) in radius-R hyperball:

\[ d_{VC}(\mathcal{A}_{\rho}) \le \text{min} (\frac{R^2}{\rho^2}, d) + 1 \le \underbrace{d+1}_{d_{VC}(\text{perceptron})} \]

Benefits of Large-Margin Hyperplanes

Large-Margin Hyperplanes Hyperplanes Hyperplanes + feature transform \(\Phi\)
# of hypothesis even fewer not many many
boundary simple simple sophisticated
  • not many: good for \(d_{VC}\) and generalization
  • sophisticated: good for possibly better \(E_{in}\)

  • a new possiblility: non-linear SVM

Large-Margin Hyperplanes + numerous feature transform \(\Phi\)
# not many
boundary sophisticated

Conclusions

  • Large-Margin Separating Hyperplane
    • intuitively more robust against noise
  • Standard Large-Margin Problem
    • minimize ‘length of w’ at special separating scale
  • Support Vector Machine
    • ‘easy’ via quadratic programming
  • Reasons Behind Large-Margin Hyperplane
    • fewer dichotomies and better generalization

Lecture 2 Dual Support Vector Machine

Motivation of Dual SVM

Non-Linear Support Vector Machine Revisited

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]

Non-Linear Hard-Margin SVM

  1. \(Q = \begin{bmatrix} 0 & 0^T_{\tilde{d}} \\ 0_{\tilde{d}} & I_{\tilde{d}} \end{bmatrix}\); \(p = 0_{\tilde{d}+1}\); \(a_n = y_n[1, z_n^T]\); \(c_n = 1\)
  2. \(\begin{bmatrix} b \\ w \end{bmatrix} \leftarrow \text{QP}(Q, p, A, c)\)
  3. return \(b \in \mathbb{R}\) & \(w \in \mathbb{R}^{\tilde{d}}\) with \(g_{svm}(x) = \text{sign}(w^T \Phi(x) + b)\)
  • demanded: not many (Large-Margin), but sophisticated boundary (feature transform)
  • QP with \(\tilde{d}+1\) variables and \(N\) constraints - challenging if \(\tilde{d}\), or infinite?!

Goal: SVM without dependence on \(\tilde{d}\)

TODO: SVM without \(\tilde{d}\)

  • Original SVM
    • \(\tilde{d} + 1\) variables
    • \(N\) constraints
  • ‘Equivalent’ SVM
    • \(N\) variables
    • \(N+1\) constraints

Warning: Heavy Math!!!

  • introduce some necessary math without rigor to help understand SVM deeper
  • ‘claim’ some results if details unnecessary - like how we ‘claimed’ Hoeffding

‘Equivalent’ SVM: based on some dual problem of Original SVM

Key Tool: Lagrange Multipliers

Regularization by Constrained-Minimizing \(E_{in}\) Regularization by Minimizing E_{aug}
\(\underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^Tw \le C\) \(\underset{w}{\text{min}} E_{aug}(w) =E_{in}(w) + \frac{\lambda}{N} w^T w\)
  • \(C\) equivalent to some \(\lambda \ge 0\) by checking optimality condition

\[ \nabla E_{in}(w) + \frac{2\lambda}{N} w = 0 \]

  • regularization: view \(\lambda\) as given parameter instead of \(C\), and solve ‘easily’
  • dual SVM: view \(\lambda\) as unknown given the constraints, and solve them as variables instead.

how many \(\lambda\)’s as variables? \(N\).

Starting Point: Constrained to ‘Unconstrained’

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]

Lagrange Function: with Lagrange Multipliers \(\alpha_n\)

\[ \mathcal{L}(b, w, \alpha) = \underbrace{\frac{1}{2} w^T w}_{\text{objective}} + \sum_{n=1}^{N} \alpha_n \underbrace{(1 - y_n(w^T z_n + b))}_{\text{constraint}} \]

Claim

\[ \text{SVM } \equiv \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) = \underset{b,w}{\text{min}} ( \infty \text{ if violate }; \frac{1}{2} w^T w \text{ if feasible}) \]

  • any violating \((b, w)\): \(\underset{\text{all } \alpha_n \ge 0}{\text{max}} (\square + \sum_n \alpha_n (\text{some positive})) \rightarrow \infty\)
  • any feasible \((b, w)\): \(\underset{\text{all } \alpha_n \ge 0}{\text{max}} (\square + \sum_n \alpha_n (\text{al non-positive})) = \square\)

Constraint now hidden in \(max\).

Lagrange Dual SVM

Lagrange Dual Problem

for any fixed \(\alpha'\) with all \(\alpha_n' \ge 0\),

\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) \]

because \(\text{max} \ge \text{any}\).

for best \(\alpha' \ge 0\) on RHS,

\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]

because best is one of any.

If we solve the dual problem, we get a lower bound of the primal problem.

  • Lagrange dual problem:
    • outer maximization of \(\alpha\) on lower bound of original problem

Strong Duality of Quadratic Programming

\[ \underbrace{ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) }_{\text{equiv. to original (primal) SVM}} = \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]

  • \(ge\): weak duality
  • \(=\): strong duality, true for QP if
    • convex primal
    • feasible primal (true if \(\Phi\)-separable)
    • linear constraints
  • called constraint qualification

exists primal-dual optimal solution \((b, w, \alpha)\) for both sides.

Solving Lagrange Dual: Simplification (1/2)

\[ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \left( \underset{b,w}{\text{min}} \underbrace{ \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n + b)) }_{\mathcal{L}(b, w, \alpha)} \right) \]

  • inner problem ‘Unconstrained’, at optimal:
    • \(\frac{\partial \mathcal{L}(b, w, \alpha)}{\partial b} = 0 = -\sum_{n=1}^{N} \alpha_n y_{n}\)
  • no loss of optimality if solving with constraint \(\sum_{n=1}^{N} \alpha_n y_{n} = 0\)

Then we can remove \(b\).

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]

Solving Lagrange Dual: Simplification (2/2)

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]

  • inner problem ‘Unconstrained’, at optimal:
    • \(\frac{\partial \mathcal{L}(b, w, \alpha)}{\partial w_i} = 0 = w_i -\sum_{n=1}^{N} \alpha_n y_{n} z_{n,i}\)
  • no loss of optimality if solving with constraint \(w = \sum_{n=1}^{N} \alpha_n y_{n} z_n\)

\[ \begin{split} & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n - w^T w \right) \\ \iff & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \end{split} \]

KTT Optimality Conditions

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \]

  • If primal-dual optimal \((b, w, \alpha)\)
    • primal feasible: \(y_{n} (w^T z_n + b) \ge 1\)
    • dual feasible: \(\alpha_n \ge 0\)
    • dual-inner optimal: \(\sum y_n \alpha_n = 0; w = \sum \alpha_n y_n z_n\)
    • primal-inner optimal (at optimal all ‘Lagrange term’ disappear): \(\alpha_n (1 - y_n (w^T z_n + b)) = 0\). which is called complimentary slackness

called Karush-Kuhn-Tucker (KKT) conditions, necessary for optimality [& sufficient here].

Will use KKT to ‘solve’ \((b, w)\) from optimal \(\alpha\).

Solving Dual SVM

Dual Formulation of Support Vector Machine

\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \]

Standard Hard-Margin SVM dual

\[ \begin{split} \underset{\alpha}{\text{min}} & \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} \alpha_n \alpha_m y_n y_m z_n^T z_m - \sum_{n=1}^{N} \alpha_n \\ \text{subject to } & \sum_{n=1}^{N} y_n \alpha_n = 0 \\ & \alpha_n \ge 0, \text{ for } n = 1, 2, \dots, N \end{split} \]

  • (convex) QP of \(N\) variables & \(N+1\) constraints, as promised
  • How to solve? yeah! we know QP! :)

Dual SVM with QP Solver

\[ \begin{split} \text{optimal } \alpha \leftarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} \alpha^T Q \alpha + p^T u \\ \text{subject to } \quad & a_m^T \alpha \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]

\[ \begin{split} q_{n,m} \quad & = y_n y_m z_n^T z_m \\ p \quad & = -1_N \\ a_{\ge} = y \quad &; c_{\ge} = 0; \\ a_{\le} = -y \quad &; c_{\le} = 0; \\ a_n^T = n \text{-th unit direction} \quad &; c_n = 0 \end{split} \]

Note: many solvers treat equality (\(a_{\ge}, a_{\le}\)) & bound constraints (\(a_n\)) specially for numerical stability.

Dual SVM with Special QP Solver

  • \(q_{n,m} = y_n y_m z_n^T z_m\), often non-zero, dense.
  • if \(N = 30000\), dense \(Q_d\) (\(N\) by \(N\) symmetric) takes > 3G RAM
  • need special solver for
    • not storing whole \(Q_D\).
    • utilizing special constraints properly.
  • to scale up to large \(N\).
  • usually better to use special solver in practice.

optimal \((b, w)\)

KKT conditions

  • if primal-dual optimal \((b, w, \alpha)\)
    • primal feasible: \(y_{n} (w^T z_n + b) \ge 1\)
    • dual feasible: \(\alpha_n \ge 0\)
    • dual-inner optimal: \(\sum y_n \alpha_n = 0; w = \sum \alpha_n y_n z_n\)
    • primal-inner optimal (at optimal all ‘Lagrange term’ disappear): \(\alpha_n (1 - y_n (w^T z_n + b)) = 0\). (complimentary slackness)
  • optimal \(\alpha \implies\) optimal \(w\)? easy above!
  • optimal \(\alpha \implies\) optimal \(b\)? a range from primal feasible & equality from complimentary slackness if one \(\alpha_n > 0 \implies b = y_n - w^T z_n\)

complimentary slackness: \(\alpha_n > 0 \implies\) on fat boundary (SV!)

Messages behind Dual SVM

Support Vectors Revisited

  • on boundary: ‘locates’ fattest hyperplane; others: not needed.
  • examples with \(\alpha_n > 0\): on boundary
  • call \(\alpha_n > 0\) examples \((z_n, y_n)\) support vectors
  • SV (positive \(\alpha_n\)) \(\subseteq\) SV candidates (on boundary)

  • only SV needed to compute \(w: w = \sum_{n=1}^{N} \alpha_n y_n z_n = \sum_{SV} \alpha_n y_n z_n\)
  • only SV needed to compute \(b: b = y_n - w^T z_n\) with any SV \((z_n, y_n)\)

SVM: learn fattest hyperplane by identifying support vectors with dual optimal solution.

Representation of Fattest Hyperplane

SVM PLA
\(w_{SVM}= \sum_{n=1}^{N} \alpha_n y_n z_n\) \(w_{PLA} = \sum_{n=1}^{N} \beta_n y_n z_n\)
\(\alpha_n\) from dual solution \(\beta_n\) by # mistake corrections

\(w =\) linear combination of \(y_n z_n\)

  • also true for GD/SGD-based LogReg/LinReg when \(w_0 = 0\)
  • call \(w\) ‘represented’ by data

哲学上来说:我们想如何表现 \(w\):

  • SVM: represent \(w\) by SVs only.

Summary: Two Forms of Hard-Margin SVM

Primal Hard-Margin SVM Dual Hard-Margin SVM
\(\begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{ for all } n \end{split}\) \(\begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split}\)
\(\tilde{d} + 1\) variables \(N\) variables
\(N\) constraints \(N+1\) simple constraints
suitable when \(\tilde{d} + 1\) small suitable when \(N\) small
physical meaning: locate specially-scaled \((b, w)\) physical meaning: locate SVs \((z_n, y_n)\) & their \(\alpha_n\)

Both eventually result in optimal \((b, w)\) for fattest hyperplane

\[ g_{SVM}(x) = \text{sign} (w^T \Phi(x) + b) \]

Are We Done Yet?

\[ \begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split} \]

  • \(N\) variables, \(N+1\) simple constraints: no dependence on \(\tilde{d}\)?
  • \(q_{n,m} = y_n y_m z_n^T z_m\): inner product in \(\mathbb{R}^{\tilde{d}}\)
    • \(O(\tilde{d})\) via naive computation!

no dependence only if avoiding naive computation! (next lecture :))

Conclusions

  • Motivation of Dual SVM
    • want to remove dependence \(\tilde{d}\)
  • Lagrange Dual SVM
    • KKT conditions link primal/dual
  • Solving Dual SVM
    • another QP, better solved with special solver
  • Messages behind Dual SVM
    • SVs represent fattest hyperplane

Lecture 3 Kernel Support Vector Machine

Kernel Trick

Dual SVM Revisited

goal: SVM without dependence on \(\tilde{d}\)

half way done:

\[ \begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split} \]

  • \(q_{n,m} = y_n y_m z_n^T z_m\): inner product in \(\mathbb{R}^{\tilde{d}}\)
  • need: \(z_n^T z_m = \Phi(x_n)^T \Phi(x_m)\) calculated faster than \(O(\tilde{d})\)
    • 先做转换,再做内积。Assume 10000 dimension, 10000 cost to to the transform, 10000 cost to do the inner product.
    • 可以合起来偷吃步吗?

Fast Inner Product for \(\Phi_2\)

  • 2nd order polynomial transform
    • \(\Phi_2(x) = (1, x_1, x_2, \dots, x_d, x_1^2, x_1x_2, \dots, x_1x_d, x_2x_1, x_2^2, \dots, x_2x_d, \dots, x_d^2)\)
    • Include both \(x_1x_2\) & \(x_2x_1\) for simplicity.

\[ \begin{split} \Phi_2(x)^T \Phi_2(x') \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} \sum_{j=1}^{d} x_ix_jx_i'x_j' \\ \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} x_ix_i' \sum_{j=1}^{d} x_jx_j' \\ \quad & = 1 + x^Tx' + (x^Tx')(x^Tx') \end{split} \]

  • For \(\Phi_2\), transform + inner product can be carefully done in \(O(d)\) instead of \(O(d^2)\)

Kernel: Transform + Inner Product

  • transform \(\Phi \iff\) Kernel function: \(K_{\Phi}(x, x') \equiv \Phi(x)^T \Phi(x')\)
  • \(\Phi_2 \iff K_{\Phi^2}(x, x') = 1 + x^Tx' + (x^Tx')^2\)

  • quadratic coefficient \(q_{n,m} = y_n y_m z_n^T z_m = y_ny_m K(x_n, x_m)\)
  • optimal bias \(b\)? From SV(\(x_s\), \(y_s\))

\[ b = y_s - w^T z_s = y_s - (\sum_{i=1}^{N} \alpha_n y_n z_n)^T z_s = y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)) \]

  • optimal hypothesis \(g_{SVM}\): for test input x,

\[ g_{SVM} = \text{sign} (w^T \Phi(x) + b) = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b) \]

没有任何地方需要在\(z\)空间中进行。

Kernel Trick: plug in efficient kernel function to avoid dependence on \(\tilde{d}\)

Kernel SVM with QP

Kernel Hard-Margin SVM Algorithm
1. \(q_{n,m} = y_n y_m z_n^T z_m\); \(p = -1_N\); \((A, c)\) for equ./bound constraints
2. \(\alpha \leftarrow QP(Q_D, p, A, c)\)
3. \(b \leftarrow (y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)))\) with SV \((x_s, y_s)\)
4. return SVs and their \(\alpha_n\) as well as \(b\) such that for new \(x\),
\(g_{SVM} = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b)\)
  • 1: time complexity \(O(N^2) \cdot\) (kernel evaluation)
  • 2: QP with \(N\) variables and \(N+1\) constraints
  • 3 and 4: time complexity \(O(\# SV) \cdot\) (kernel evaluation)

Kernel SVM: use computational shortcut to avoid \(\tilde{d}\) & predict with SV only.

Polynomial Kernel

General Poly-2 Kernel

\[ \begin{split} \Phi_2(x) = (1, x_1, \dots, x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2}x_1, \dots, \sqrt{2} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2\gamma}x_1, \dots, \sqrt{2\gamma} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2\gamma x^Tx' + (x^Tx')^2 \\ \end{split} \]

\[ K_2(x, x') = (1 + \gamma x^Tx')^2 \text{ with } \gamma > 0 \]

  • \(K_2\): somewhat ‘easier’ to calculate than \(K_{\Phi_2}\)
  • \(\Phi_2\) and \(\\Phi_2\): equivalent power, different inner product \(\implies\) different geometry.
    • 不同内积代表不同距离,不同的距离有不同的几何特性,算出来的margin就不一样。
    • 就算在同样的空间,也会得到不同的边界。
  • \(K_2\) commonly used.

Poly-2 Kernels in Action

  • \(g_{SVM}\) different, SVs different.
    • ‘hard’ to say which is better before learning.
  • change of kernel \(\iff\) change of margin definition

need selecting \(K\), just like selecting \(\Phi\).

General Polynomial Kernel

\[ \begin{split} K_2(x, x') \quad & = (\zeta + \gamma x^Tx')^2 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_3(x, x') \quad & = (\zeta + \gamma x^Tx')^3 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_Q(x, x') \quad & = (\zeta + \gamma x^Tx')^Q \text{ with } \gamma > 0, \zeta \ge 0 \end{split} \]

  • embeds \(\Phi_Q\) specially with parameters
  • allows computing Large-Margin polynomial classification without dependence on \(\tilde{d}\)
    • Will 10th order overfit? Probably, but the bottom line is that you still has large margin to help you to harness the complexity.
  • SVM + Polynomial Kernel: Polynomial SVM

Special Case Linear Kernel

\[ K_1(x, x') = (0 + 1 \dot x^T x)^1 \]

  • \(K_1\): just usual inner product, called linear kernel
  • ‘even easier’: can be solved (often in primal form) efficiently.

linear first, remember? :)

Gaussian Kernel

Kernel of Infinite Dimensional Transform

Infinite dimensional \(\Phi(x)\)? Yes, if \(K(x, x')\) efficiently computable!

\[ \begin{split} \text{when } x = (x), K (x, x') \quad &= \quad \text{exp}(-(x-x')^2) \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \text{exp}(2xx') \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \left( \sum_{i=0}^{\infty} \frac{(2xx')^i}{i!}\right) \\ \quad &= \quad \sum_{i=0}^{\infty} \left( \text{exp}(-(x)^2) \text{exp}(-(x')^2) \sqrt{\frac{2^i}{i!}} \sqrt{\frac{2^i}{i!}} (x)^i (x')^i \right) \\ \quad &= \quad \Phi(x)^T \Phi(x') \end{split} \]

with infinite dimensional \(\Phi(x) = \text{exp}(-x^2) \dot (1, \sqrt{\frac{2^1}{1!}}, \sqrt{\frac{2^2}{2!}}, \dots)\)

more generally, Gaussian kernel \(K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x-x' \end{Vmatrix}^2 )\) with \(\gamma > 0\)

Hypothesis of Gaussian SVM

\[ \begin{split} g_{SVM} &= (\text{sign} \sum_{SV} \alpha_n y_n K(x_n, x) + b) \\ &= (\text{sign} \sum_{SV} \alpha_n y_n \text{exp}(-\gamma \begin{Vmatrix} x - x_n \end{Vmatrix}^2) + b) \\ \end{split} \]

  • linear combination of Gaussian centered at SVs \(x_n\)
  • also called Radial Basis Function (RBF) kernel

Gaussian SVM: find \(\alpha_n\) to combine Gaussian centered at \(x_n\) & achieve large margin in infinite-dim. space

Support Vector Mechanism

Header One Large-Margin hyperplane + higher-order transforms with kernel trick
# not many
boundary sophisticated
  • transformed vector \(z = \Phi(x) \implies\) efficient kernel \(K(x, x')\)
  • store optimal \(w \implies\) store a few SVs and \(\alpha_n\)

new possiblility by Gaussian SVM: infinite-dimensional linear classification, with generalization ‘guarded by’ Large-Margin :)

Gaussian SVM in Action

  • large \(\gamma \implies\) sharp Gaussian \(\implies\) ‘overfit’
  • Warning: SVM can still overfit :(

Gaussian SVM: need careful selection of \(\gamma\).

Comparison of Kernels

Linear Kernel: Cons and Pros

\[ K(x, x') = x^T x \]

Cons Pros
restricted - not always separable?! fast - with special QP solver in primal
safe - linear first, remember
very explainable - \(w\) and SVs say something

Linear kernel: an important basic tool.

Polynomial Kernel: Cons and Pros

\[ K(x, x') = (\zeta + \gamma x^T x')^Q \]

Cons Pros
numerical difficulty for large \(Q\): less restricted than linear
\(|\zeta + \gamma x^Tx'| < 1: K \rightarrow 0\) strong physical control - ‘knows’ degree Q
\(|\zeta + \gamma x^Tx'| > 1: K \rightarrow big\)
three parameters \((\gamma, \zeta, Q)\) - more difficult to select

polynomial kernel: perhaps small-Q only. 当我们心里想好 \(Q\) 的时候.

sometimes efficiently done by linear on \(\Phi_Q(x)\)

Gaussian Kernel: Cons and Pros

\[ K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x - x' \end{Vmatrix}^2) \]

Cons Pros
mysterious - no \(w\) more powerful than linear/poly
slower than linear bounded - less numerical difficult than poly
too powerful! one parameter only - easier to select than poly

Gaussian kernel: one of most popular but shall be used with care.

Other Valid Kernels

  • kernel represents special similarity: \(\Phi(x)^T \Phi(x')\)
  • any similarity \(\implies\) valid kernel? not really
  • necessary and sufficient conditions for valid kernel: Mercer’s condition
    • symmetric
    • let \(k_{ij} = K(x_i, x_j)\), the matrix \(K = ZZ^T\) must always be positive semi-definite

define your own kernel: possible, but hard.

Conclusions

  • Kernel Trick
    • kernel as shortcut of transform + inner product
  • Polynomial Kernel
    • embeds specially-scaled polynomial transform
  • Gaussian Kernel
    • embeds infinite dimensional transform
  • Comparison of Kernels
    • linear for efficiency or Gaussian for power

Lecture 4 Soft-Margin Support Vector Machine

Motivation and Primal Problem

Cons of Hard-Margin SVM

recall: SVM can still overfit :(

  • part of reasons: \(\Phi\)
    • \(\Phi_1\) looks reasonable, although make some mistakes.
    • \(\Phi_4\) separate data perfectly, but looks weird and not reasonable
  • other part: separable

If always insisting on separable (\(\implies\) shatter), have power to overfit to noise.

Give Up on Some Examples

want: give up on some noisy examples

pocket Hard-Margin SVM
\(\underset{b,w}{\text{min}} \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)]\) \(\underset{b,w}{\text{min}} \frac{1}{2} w^T w\)
s.t. \(y_n (w^T z_n + b) \ge 1\) for all \(n\)

Combination:

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 \text{ for correct } n \\ & y_n (w^T z_n + b) \ge -\infty \text{ for incorrect } n \end{split} \]

\(C\): trade-off of large margin & noise tolerance

Soft-Margin SVM (1/2)

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [ y_n \ne \text{sign} (w^Tz_n + b) ] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \infty \dot [ y_n \ne \text{sign} (w^Tz_n + b) ] \end{split} \]

  • \([\dot]\): non-linear, not QP anymore :(
    • what about dual? kernel?
  • cannot distinguish small error (slightly away from fat boundary)
    • large error (far away from fat boundary)
  • record ‘margin violation’ by \(\xi_n\) - linear constraints
  • penalize with margin violation instead of error count - quadratic objective

\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]

Soft-Margin SVM (1/2)

  • record ‘margin violation’ by \(\xi_n\)
  • penalize with margin violation instead of error count

\[ \begin{split} \underset{b,w,\xi}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]

  • parameter \(C\): trade-off of Large-Margin and margin violation
    • large \(C\): want less margin violation
    • small \(C\): want large margin
  • QP of \(\tilde{d} + 1 + N\) variables, \(2N\) constraints

next: remove dependence on \(\tilde{d}\) by Soft-Margin SVM primal \(\implies\) dual?

Dual Problem

Lagrange Dual

Primal:

\[ \begin{split} \underset{b,w,\xi}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]

Lagrange function with Lagrange multipliers \(\alpha_n\) and \(\beta_n\)

\[ \begin{split} \mathcal{L}(b, w, \xi, \alpha, \beta) = & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) \end{split} \]

want: Lagrange dual

\[ \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} \mathcal{L}(b, w, \xi, \alpha, \beta) ) \]

Simplify \(\xi_n\) and \(\beta_n\)

\[ \begin{split} \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) ) \end{split} \]

  • \(\frac{\partial \mathcal{L}}{\partial \xi_n} = 0 = C - \alpha_n - \beta_n\)
  • no loss of optimality if solving with implicit constraint \(\beta_n = C - \alpha_n\) and explicit constraint \(0 \le \alpha_n \le C\)

Messages behind Soft-Margin SVM

Model Selection

Conclusions

  • Motivation and Primal Problem
  • Dual Problem
  • Messages behind Soft-Margin SVM
  • Model Selection