\usepackage{stmaryrd}
library(knitr)
opts_chunk$set(echo = TRUE, cache = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Course Design
Linear Classification Revisited
linear (hyperplane) classifiers:
\[ h(x) = \text{sign} (w^T x) \]
Which Line Is Best
Why Rightmost Hyperplane?
distance to closest \(x_n\) \(\iff\) amount of noise tolerance \(\iff\) robustness of hyperplane
rightmost one: more robust because of larger distance to closest \(x_n\).
Fat Hyperplane
goal: find fattest separating hyperplane.
Large-Margin Separating Hyperplane
So we have the following:
goal: find Large-Margin separating hyperplane.
Distance to Hyperplane: Preliminary
‘shorten’ \(x\) and \(w\)
Distance to Hyperplane
want: distance \((x, b, w)\), with hyperplane \(w^T x' + b = 0\)
consider \(x', x''\) on hyperplane:
\[ \text{distance}(x, b, w) = \begin{vmatrix} \frac{w^T}{ \begin{Vmatrix} w \end{Vmatrix} } (x - x') \end{vmatrix} = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]
Distance to Separating Hyperplane
\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } \begin{vmatrix} w^T x + b \end{vmatrix} \]
\[ y_{n}(w^T x_n + b) > 0 \]
\[ \text{distance}(x, b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n(w^T x + b) \]
\[ \begin{split} \underset{b,w}{\text{max}} &\quad \text{margin}(b, w) \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \text{margin}(b, w) = \underset{n=1, \dots, N}{min}\frac{1}{ \begin{Vmatrix} w \end{Vmatrix} } y_n (w^T x_n + b) \end{split} \]
Margin of Special Separating Hyperplane
From above problem and notice the following 2 points:
\[ \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \implies \text{margin}(b, w) = \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \]
\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad \text{every } y_{n}(w^T x_n + b) > 0 \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]
\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ &\quad \underset{n=1, \dots, N}{\text{min}} y_n (w^T x_n + b) = 1 \end{split} \]
Standard Large-Margin Hyperplane Problem
necessary constraints: \(y_n (w^T x_n + b) \ge 1\) for all \(n\)
We have another new constraints:
\[ \begin{split} \underset{b,w}{\text{max}} &\quad \frac{1}{ \begin{Vmatrix} w \end{Vmatrix}} \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]
final change: \(\text{max }\implies \text{min}\), remove \(\sqrt{\quad}\)
\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to} &\quad y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]
Solving a Particular Standard Problem
data <- data.frame(x0=c(0, 2, 2, 3), x1=c(0, 2, 0, 0), y=c(-1, -1, 1, 1))
plot(data$x0, data$x1, type = "n", xlim = c(-1, 4), ylim = c(-1, 4), asp = 1, xlab = "x_1", ylab = "x_2")
points(data$x0[1:2], data$x1[1:2], pch = 4, col = "red")
points(data$x0[3:4], data$x1[3:4], pch = 1, col = "blue")
abline(a = -1, b = 1)
abline(a = 0, b = 1, lty = 2)
abline(a = -2, b = 1, lty = 2)
Support Vector Machine (SVM)
Support vector machine (SVM): learn fattest hyperplanes (with help of support vectors)
Solving General SVM
\[ \begin{split} \underset{b,w}{\text{min}} &\quad \frac{1}{2} w^T w \\ \text{subject to } &\quad y_n (w^T x_n + b) \ge 1 \end{split} \]
Quadratic Programming
\[ \begin{split} \text{optimal}(b, w) &=? \\ \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{for } n = 1, 2, \dots, N \\ \end{split} \]
\[ \begin{split} \text{optimal } u \rightarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} u^T Q u + p^T u \\ \text{subject to } \quad & a_m^T u \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]
SVM with QP Solver
Why Large-Margin Hyperplane
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \text{ for all } n \end{split} \]
| minimize | constraint | |
|---|---|---|
| regularization | \(E_{in}\) | \(w^Tw \le C\) |
| SVM | \(w^Tw\) | \(E_{in} = 0\) [and more: \(y_n(wx_n+b) \ge 1\)] |
SVM (Large-Margin hyperplane): ‘weight-decay regularization’ within \(E_{in} = 0\)
Large-Margin Restricts Dichotomies
consider ‘Large-Margin algorithm’ \(\mathcal{A}_{\rho}\): either returns \(g\) with margin(\(g\)) \(\ge \rho\), or 0 otherwise.
fewer dichotomies \(\implies\) smaller VC dim \(\implies\) better generalization.
VC Dimenson of Large-Margin Algorithm
fewer dichotomies \(\implies\) smaller ‘VC dim.’
instead of \(d_{VC}(\mathcal{H})\) [data independent, covered VC]
generally, when \(\mathcal{X}\) in radius-R hyperball:
\[ d_{VC}(\mathcal{A}_{\rho}) \le \text{min} (\frac{R^2}{\rho^2}, d) + 1 \le \underbrace{d+1}_{d_{VC}(\text{perceptron})} \]
Benefits of Large-Margin Hyperplanes
| Large-Margin Hyperplanes | Hyperplanes | Hyperplanes + feature transform \(\Phi\) | |
|---|---|---|---|
| # of hypothesis | even fewer | not many | many |
| boundary | simple | simple | sophisticated |
sophisticated: good for possibly better \(E_{in}\)
a new possiblility: non-linear SVM
| Large-Margin Hyperplanes + numerous feature transform \(\Phi\) | |
|---|---|
| # | not many |
| boundary | sophisticated |
Non-Linear Support Vector Machine Revisited
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]
Non-Linear Hard-Margin SVM
Goal: SVM without dependence on \(\tilde{d}\)
TODO: SVM without \(\tilde{d}\)
Warning: Heavy Math!!!
‘Equivalent’ SVM: based on some dual problem of Original SVM
Key Tool: Lagrange Multipliers
| Regularization by Constrained-Minimizing \(E_{in}\) | Regularization by Minimizing E_{aug} |
|---|---|
| \(\underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^Tw \le C\) | \(\underset{w}{\text{min}} E_{aug}(w) =E_{in}(w) + \frac{\lambda}{N} w^T w\) |
\[ \nabla E_{in}(w) + \frac{2\lambda}{N} w = 0 \]
how many \(\lambda\)’s as variables? \(N\).
Starting Point: Constrained to ‘Unconstrained’
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T \underbrace{z_n}_{\Phi(x_n)} + b) \ge 1 \text{ for all } n \end{split} \]
Lagrange Function: with Lagrange Multipliers \(\alpha_n\)
\[ \mathcal{L}(b, w, \alpha) = \underbrace{\frac{1}{2} w^T w}_{\text{objective}} + \sum_{n=1}^{N} \alpha_n \underbrace{(1 - y_n(w^T z_n + b))}_{\text{constraint}} \]
Claim
\[ \text{SVM } \equiv \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) = \underset{b,w}{\text{min}} ( \infty \text{ if violate }; \frac{1}{2} w^T w \text{ if feasible}) \]
Constraint now hidden in \(max\).
Lagrange Dual Problem
for any fixed \(\alpha'\) with all \(\alpha_n' \ge 0\),
\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) \]
because \(\text{max} \ge \text{any}\).
for best \(\alpha' \ge 0\) on RHS,
\[ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) \ge \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]
because best is one of any.
If we solve the dual problem, we get a lower bound of the primal problem.
Strong Duality of Quadratic Programming
\[ \underbrace{ \underset{b,w}{\text{min}} (\underset{\text{all } \alpha_n \ge 0}{\text{max}} \mathcal{L}(b, w, \alpha)) }_{\text{equiv. to original (primal) SVM}} = \underbrace{ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \underset{b,w}{\text{min}} \mathcal{L}(b, w, \alpha')) }_{\text{Lagrange dual problem}} \]
exists primal-dual optimal solution \((b, w, \alpha)\) for both sides.
Solving Lagrange Dual: Simplification (1/2)
\[ \underset{\text{all } \alpha_n \ge 0}{\text{max}} \left( \underset{b,w}{\text{min}} \underbrace{ \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n + b)) }_{\mathcal{L}(b, w, \alpha)} \right) \]
Then we can remove \(b\).
\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]
Solving Lagrange Dual: Simplification (2/2)
\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0}{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n (1 - y_n (w^Tz_n)) \right) \]
\[ \begin{split} & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} \left( \underset{b,w}{\text{min}} \frac{1}{2} w^Tw + \sum_{n=1}^{N} \alpha_n - w^T w \right) \\ \iff & \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \end{split} \]
KTT Optimality Conditions
\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \]
called Karush-Kuhn-Tucker (KKT) conditions, necessary for optimality [& sufficient here].
Will use KKT to ‘solve’ \((b, w)\) from optimal \(\alpha\).
Dual Formulation of Support Vector Machine
\[ \underset{\text{all } \alpha_n \ge 0, \sum y_{n} \alpha_{n} = 0, w = \sum \alpha_n y_{n} z_n }{\text{max}} -\frac{1}{2} \begin{Vmatrix} \sum_{n=1}^{N} \alpha_n y_{n} z_n \end{Vmatrix} + \sum_{n=1}^{N} \alpha_n \]
Standard Hard-Margin SVM dual
\[ \begin{split} \underset{\alpha}{\text{min}} & \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} \alpha_n \alpha_m y_n y_m z_n^T z_m - \sum_{n=1}^{N} \alpha_n \\ \text{subject to } & \sum_{n=1}^{N} y_n \alpha_n = 0 \\ & \alpha_n \ge 0, \text{ for } n = 1, 2, \dots, N \end{split} \]
Dual SVM with QP Solver
\[ \begin{split} \text{optimal } \alpha \leftarrow & QP(Q, p, A, c) \\ \underset{u}{\text{min}} \quad & \frac{1}{2} \alpha^T Q \alpha + p^T u \\ \text{subject to } \quad & a_m^T \alpha \ge c_m, \\ & \text{for } m = 1, 2, \dots, M \\ \end{split} \]
\[ \begin{split} q_{n,m} \quad & = y_n y_m z_n^T z_m \\ p \quad & = -1_N \\ a_{\ge} = y \quad &; c_{\ge} = 0; \\ a_{\le} = -y \quad &; c_{\le} = 0; \\ a_n^T = n \text{-th unit direction} \quad &; c_n = 0 \end{split} \]
Note: many solvers treat equality (\(a_{\ge}, a_{\le}\)) & bound constraints (\(a_n\)) specially for numerical stability.
Dual SVM with Special QP Solver
optimal \((b, w)\)
KKT conditions
complimentary slackness: \(\alpha_n > 0 \implies\) on fat boundary (SV!)
Support Vectors Revisited
SV (positive \(\alpha_n\)) \(\subseteq\) SV candidates (on boundary)
only SV needed to compute \(b: b = y_n - w^T z_n\) with any SV \((z_n, y_n)\)
SVM: learn fattest hyperplane by identifying support vectors with dual optimal solution.
Representation of Fattest Hyperplane
| SVM | PLA |
|---|---|
| \(w_{SVM}= \sum_{n=1}^{N} \alpha_n y_n z_n\) | \(w_{PLA} = \sum_{n=1}^{N} \beta_n y_n z_n\) |
| \(\alpha_n\) from dual solution | \(\beta_n\) by # mistake corrections |
\(w =\) linear combination of \(y_n z_n\)
哲学上来说:我们想如何表现 \(w\):
Summary: Two Forms of Hard-Margin SVM
| Primal Hard-Margin SVM | Dual Hard-Margin SVM |
|---|---|
| \(\begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w \\ \text{subject to } \quad & y_n (w^T x_n + b) \ge 1 \\ & \text{ for all } n \end{split}\) | \(\begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split}\) |
| \(\tilde{d} + 1\) variables | \(N\) variables |
| \(N\) constraints | \(N+1\) simple constraints |
| suitable when \(\tilde{d} + 1\) small | suitable when \(N\) small |
| physical meaning: locate specially-scaled \((b, w)\) | physical meaning: locate SVs \((z_n, y_n)\) & their \(\alpha_n\) |
Both eventually result in optimal \((b, w)\) for fattest hyperplane
\[ g_{SVM}(x) = \text{sign} (w^T \Phi(x) + b) \]
Are We Done Yet?
\[ \begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split} \]
no dependence only if avoiding naive computation! (next lecture :))
Dual SVM Revisited
goal: SVM without dependence on \(\tilde{d}\)
half way done:
\[ \begin{split} \underset{\alpha}{\text{min}} \quad & \frac{1}{2} \alpha^T Q_D \alpha - 1^T \alpha \\ \text{subject to } \quad & y^T \alpha = 0 ;\\ & \alpha_n \ge 0 \text{ for } n = 1, \dots, N \end{split} \]
Fast Inner Product for \(\Phi_2\)
\[ \begin{split} \Phi_2(x)^T \Phi_2(x') \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} \sum_{j=1}^{d} x_ix_jx_i'x_j' \\ \quad & = 1 + \sum_{i=1}^{d} x_ix_i' + \sum_{i=1}^{d} x_ix_i' \sum_{j=1}^{d} x_jx_j' \\ \quad & = 1 + x^Tx' + (x^Tx')(x^Tx') \end{split} \]
Kernel: Transform + Inner Product
\(\Phi_2 \iff K_{\Phi^2}(x, x') = 1 + x^Tx' + (x^Tx')^2\)
optimal bias \(b\)? From SV(\(x_s\), \(y_s\))
\[ b = y_s - w^T z_s = y_s - (\sum_{i=1}^{N} \alpha_n y_n z_n)^T z_s = y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)) \]
\[ g_{SVM} = \text{sign} (w^T \Phi(x) + b) = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b) \]
没有任何地方需要在\(z\)空间中进行。
Kernel Trick: plug in efficient kernel function to avoid dependence on \(\tilde{d}\)
Kernel SVM with QP
| Kernel Hard-Margin SVM Algorithm | |
|---|---|
| 1. \(q_{n,m} = y_n y_m z_n^T z_m\); \(p = -1_N\); \((A, c)\) for equ./bound constraints | |
| 2. \(\alpha \leftarrow QP(Q_D, p, A, c)\) | |
| 3. \(b \leftarrow (y_s - \sum_{i=1}^{N} ( \alpha_n y_n K(x_n, x_s)))\) with SV \((x_s, y_s)\) | |
| 4. return SVs and their \(\alpha_n\) as well as \(b\) such that for new \(x\), | |
| \(g_{SVM} = \text{sign} (\sum_{n=1}^{N} \alpha_n y_n K(x_n, x) + b)\) |
Kernel SVM: use computational shortcut to avoid \(\tilde{d}\) & predict with SV only.
General Poly-2 Kernel
\[ \begin{split} \Phi_2(x) = (1, x_1, \dots, x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2}x_1, \dots, \sqrt{2} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2x^Tx' + (x^Tx')^2 \\ \Phi_2(x) = (1, \sqrt{2\gamma}x_1, \dots, \sqrt{2\gamma} x_d, x_1^2, \dots, x_d^2) \quad & \iff K_{\Phi_2}(x, x') = 1 + 2\gamma x^Tx' + (x^Tx')^2 \\ \end{split} \]
\[ K_2(x, x') = (1 + \gamma x^Tx')^2 \text{ with } \gamma > 0 \]
Poly-2 Kernels in Action
need selecting \(K\), just like selecting \(\Phi\).
General Polynomial Kernel
\[ \begin{split} K_2(x, x') \quad & = (\zeta + \gamma x^Tx')^2 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_3(x, x') \quad & = (\zeta + \gamma x^Tx')^3 \text{ with } \gamma > 0, \zeta \ge 0 \\ K_Q(x, x') \quad & = (\zeta + \gamma x^Tx')^Q \text{ with } \gamma > 0, \zeta \ge 0 \end{split} \]
Special Case Linear Kernel
\[ K_1(x, x') = (0 + 1 \dot x^T x)^1 \]
linear first, remember? :)
Kernel of Infinite Dimensional Transform
Infinite dimensional \(\Phi(x)\)? Yes, if \(K(x, x')\) efficiently computable!
\[ \begin{split} \text{when } x = (x), K (x, x') \quad &= \quad \text{exp}(-(x-x')^2) \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \text{exp}(2xx') \\ \quad &= \quad \text{exp}(-(x)^2) \text{exp}(-(x')^2) \left( \sum_{i=0}^{\infty} \frac{(2xx')^i}{i!}\right) \\ \quad &= \quad \sum_{i=0}^{\infty} \left( \text{exp}(-(x)^2) \text{exp}(-(x')^2) \sqrt{\frac{2^i}{i!}} \sqrt{\frac{2^i}{i!}} (x)^i (x')^i \right) \\ \quad &= \quad \Phi(x)^T \Phi(x') \end{split} \]
with infinite dimensional \(\Phi(x) = \text{exp}(-x^2) \dot (1, \sqrt{\frac{2^1}{1!}}, \sqrt{\frac{2^2}{2!}}, \dots)\)
more generally, Gaussian kernel \(K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x-x' \end{Vmatrix}^2 )\) with \(\gamma > 0\)
Hypothesis of Gaussian SVM
\[ \begin{split} g_{SVM} &= (\text{sign} \sum_{SV} \alpha_n y_n K(x_n, x) + b) \\ &= (\text{sign} \sum_{SV} \alpha_n y_n \text{exp}(-\gamma \begin{Vmatrix} x - x_n \end{Vmatrix}^2) + b) \\ \end{split} \]
Gaussian SVM: find \(\alpha_n\) to combine Gaussian centered at \(x_n\) & achieve large margin in infinite-dim. space
Support Vector Mechanism
| Header One | Large-Margin hyperplane + higher-order transforms with kernel trick |
|---|---|
| # | not many |
| boundary | sophisticated |
new possiblility by Gaussian SVM: infinite-dimensional linear classification, with generalization ‘guarded by’ Large-Margin :)
Gaussian SVM in Action
Gaussian SVM: need careful selection of \(\gamma\).
Linear Kernel: Cons and Pros
\[ K(x, x') = x^T x \]
| Cons | Pros |
|---|---|
| restricted - not always separable?! | fast - with special QP solver in primal |
| safe - linear first, remember | |
| very explainable - \(w\) and SVs say something |
Linear kernel: an important basic tool.
Polynomial Kernel: Cons and Pros
\[ K(x, x') = (\zeta + \gamma x^T x')^Q \]
| Cons | Pros |
|---|---|
| numerical difficulty for large \(Q\): | less restricted than linear |
| \(|\zeta + \gamma x^Tx'| < 1: K \rightarrow 0\) | strong physical control - ‘knows’ degree Q |
| \(|\zeta + \gamma x^Tx'| > 1: K \rightarrow big\) | |
| three parameters \((\gamma, \zeta, Q)\) - more difficult to select |
polynomial kernel: perhaps small-Q only. 当我们心里想好 \(Q\) 的时候.
sometimes efficiently done by linear on \(\Phi_Q(x)\)
Gaussian Kernel: Cons and Pros
\[ K(x, x') = \text{exp}(-\gamma \begin{Vmatrix} x - x' \end{Vmatrix}^2) \]
| Cons | Pros |
|---|---|
| mysterious - no \(w\) | more powerful than linear/poly |
| slower than linear | bounded - less numerical difficult than poly |
| too powerful! | one parameter only - easier to select than poly |
Gaussian kernel: one of most popular but shall be used with care.
Other Valid Kernels
define your own kernel: possible, but hard.
Cons of Hard-Margin SVM
recall: SVM can still overfit :(
If always insisting on separable (\(\implies\) shatter), have power to overfit to noise.
Give Up on Some Examples
want: give up on some noisy examples
| Hard-Margin SVM | |
|---|---|
| \(\underset{b,w}{\text{min}} \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)]\) | \(\underset{b,w}{\text{min}} \frac{1}{2} w^T w\) |
| s.t. \(y_n (w^T z_n + b) \ge 1\) for all \(n\) |
Combination:
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [y_n \ne \text{sign} (w^Tz_n + b)] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 \text{ for correct } n \\ & y_n (w^T z_n + b) \ge -\infty \text{ for incorrect } n \end{split} \]
\(C\): trade-off of large margin & noise tolerance
Soft-Margin SVM (1/2)
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} [ y_n \ne \text{sign} (w^Tz_n + b) ] \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \infty \dot [ y_n \ne \text{sign} (w^Tz_n + b) ] \end{split} \]
\[ \begin{split} \underset{b,w}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]
Soft-Margin SVM (1/2)
\[ \begin{split} \underset{b,w,\xi}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]
next: remove dependence on \(\tilde{d}\) by Soft-Margin SVM primal \(\implies\) dual?
Lagrange Dual
Primal:
\[ \begin{split} \underset{b,w,\xi}{\text{min}} \quad & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ \text{s.t.} \quad & y_n (w^T z_n + b) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all } n \end{split} \]
Lagrange function with Lagrange multipliers \(\alpha_n\) and \(\beta_n\)
\[ \begin{split} \mathcal{L}(b, w, \xi, \alpha, \beta) = & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) \end{split} \]
want: Lagrange dual
\[ \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} \mathcal{L}(b, w, \xi, \alpha, \beta) ) \]
Simplify \(\xi_n\) and \(\beta_n\)
\[ \begin{split} \underset{\alpha_n \ge 0, \beta_n \ge 0}{\text{max}} ( \underset{b, w, \xi}{\text{min}} & \frac{1}{2} w^T w + C \dot \sum_{n=1}^{N} \xi_n \\ & + \sum_{n=1}^{N} \alpha_n \dot (1 - \xi_n - y_n (w^T z_n + b) ) + \sum_{n=1}^{N} \beta_n \dot (-\xi_n) ) \end{split} \]