Lecture 5 Training v.s. Testing
Recap and Preview
If \(|\mathcal{H}| = M\) is finite, \(N\) large enough, for whatever \(g\) picked by \(\mathcal{A}\), \(E_{out}(g) \approx E_{in}(g)\).
If \(\mathcal{A}\) finds one \(g\) with \(E_{in}(g) \approx 0\), PAC guarantees for \(E_{out}(g) \approx 0\).
\[
E_{out}(h) \underbrace{\approx}_{test} E_{in}(h) \underbrace{\approx}_{train} 0
\]
Two Central Questions:
Can we make sure that \(E_{out}(h)\) is close enough to \(E_{in}(g)\)?
Can we make \(E_{in}(g)\) small enough?
What role does \(\underbrace{|\mathcal{H}|}_{M}\) play for these two questions?
Trade-off on \(M\)
small \(M\)
- Yes! \(Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)\)
- No! too few choices!
large \(M\)
- No! \(Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)\) is big.
- Yes! many choices
What if \(M = \infty\)
Todo:
- establish a finite quantity that replaces \(M\) \[
Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2 m_{\mathcal{H}} exp(-2 \epsilon^2 N)
\]
- justify the feasiblity of learning for infinite \(M\)
- study \(m_{\mathcal{H}}\) to understand its trade-off for ‘right’ \(\mathcal{H}\)
These will be covered in next 3 lectures :-)
Effective Number of Lines
Where did M come from
- BAD events \(B_m\): \(|E_{in}(B_m) - E_{out}(B_m)| > \epsilon\)
- To give \(\mathcal{A}\) freedom choice, need to bound \(Pr(B_1 \cup B_2 \cup \dots B_M)\)
- Worse case: all \(B_m\) non-overlapping. It fails when M is infinite. \[
Pr(B_1 \cup B_2 \cup \dots B_M) \leq Pr(B_1) + \dots + Pr(B_M)
\]
Where did uniform bound fail?
- BAD events \(B_m\): \(|E_{in}(B_m) - E_{out}(B_m)| > \epsilon\) overlapping for similar hypothesis \(h_1 \approx h_2\)
- Why?
- \(E_{out}(h_1) \approx E_{out}(h_2)\)
- for most \(D\), \(E_{in}(h_1) \approx E_{in}(h_2)\)
- union bound over-estimation
- Solution: group similar hypothesis by kind.
How Many lines Are There?
\(\mathcal{H} = {all\ lines\ in\ R^2}\)
- how many lines? \(\infty\)
- how many kinds of lines from \(\mathbf{x_1}\) point of view? 2 kinds:
- \(h_1(x_1) = +1\) and \(h_2(x_1) = -1\)
- how many kinds of lines for 4 points? At most 14.
Effective Number of Lines
It means max number of lines w.r.t N inputs \(x_1, \dots, x_N\)
- must be \(\leq 2^N\)
- finite ‘grouping’ of infinitely-many lines \(\in \mathcal{H}\)
- wish:
\[
Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ eff(N)\ exp(-2 \epsilon^2 N)
\]
If
- eff(N) can replace \(M\) and
- eff(N) << \(2^N\)
Then learning is possible for infinite lines :-)
Effective Number of Hypothesis
Dichotomy: mini hypothesis
Definition: hypothesis ‘limited’ to the eyes of \(x_1, \dots, x_n\), i.e. \[
h(x_1, \dots, x_n) = (h(x_1), \dots, h(x_n)) \in \{\times, \circ\}^N
\]
- Dichotomy set: \(\mathcal{H}(x_1, \dots, x_n)\): all dichotomies implemented by \(\mathcal{H}\) on \(x_1, \dots, x_n\).
- Dichotomy set is bounded by \(2^N\), \(\mathcal{H}\) can be infinite.
\(|\mathcal{H}(x_1, \dots, x_n)|\) is a candidate for replacing \(M\).
Growth Function
In order to remove the dependance between the size of dichotomy set and the inputs, define \(m_H(N) = max|\mathcal{H}(x_1, \dots, x_n)|\). For example, for PLA, \(m_H(1) = 2, m_H(2) = 4, m_H(3) = 8, m_H(4) = 14\).
It is called growth function and bounded by \(2^N\).
Growth Function for Positive Rays
- \(\mathcal{X} = \mathbb{R}\)
- \(\mathcal{H}\) contains \(h\), where each \(h(x) = sign(x-1)\) for threshold \(a\).
- positive half of 1D perceptrons.
- \(m_{\mathcal{H}}(N)=N+1 \ll 2^N\)
Growth Function for Positive Intervals
- \(\mathcal{X} = \mathbb{R}\)
- \(\mathcal{H}\) contains \(h\), where each \(h(x) = sign(x-1)\) iff \(x \in [l,r)\), -1 otherwise. \[
m_{\mathcal{H}}(N) = \binom{N+1}{2} + 1 = \frac{N^2}{2} + \frac{N}{2} + 1 \ll 2^N
\]
Growth Function for convex set
- If \(x_1, \dots, x_n\) on a big circle, then every dichotomy can be implemented. \[
m_{\mathcal{H}}(N) = 2^N
\]
- call those \(N\) inputs ‘shattered’ by \(\mathcal{H}\).
Break Point
The 4 growth functions
- positive ray, \(m_{\mathcal{H}}(N)=N+1\)
- postitive intervals, \(\frac{N^2}{2} + \frac{N}{2} + 1\)
- convex set, \(2^N\)
- 2D perceptrons: \(m_{\mathcal{H}}(N) < 2^N\) in some cases
Can we replace \(M\) by \(\mathcal{H}\)? Polynomial good, exponential bad…
\[
Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \overset{?}{\leq} 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N)
\]
Break point of
- three inputs: ‘exists’ shatter; for inputs, ‘for all’ no shatter
- if no \(k\) inputs can be shattered by \(\mathcal{H}\), the \(k\) is a break point.
- Any \(k+i\) is break points.
- conjecture: break point \(k\), \(m_{\mathcal{H}}(N) = O(N^{k-1})\)
Conclusion
- Recap and Preview:
- two questions, \(E_{out} \approx E_{in}\) and \(E_{in}(g) \approx 0\)
- Effective No. of lines
- Effective No. of Hypothesis
- Break point
- When \(m_{\mathcal{H}}\) becomes non-exponential
Lecture 6 Theory of Generalization
Restriction of Break Point
Bounding Function: Basic Cases
- bounding function \(B(N, k)\): max possible \(m_{\mathcal{H}}\) when break point \(= k\)
- combinatorial quantity: max No. of length-N vectors with \((\circ, \times)\) while ‘no shatter’ any length-\(k\) subvectors.
- new goal \(B(N, K) \leq poly(N)\)
Bounding Function: Inductive Cases
Estimating \(B(4, 3)\), after exhaustively searching the \(2^{2^4}\), the winner is 11. For 4 points, there are 16 possible \((\circ, \times)\) combinations, each of them can either be in or not in the dichotomy set, so there are \(2^{16}\) possible dichotomy sets.
Estimating \(B(4, 3)\), \(B(4, 3) = 11 = 2\alpha + \beta\). \(\alpha+\beta\): dichotomies on \((x_1, x_2, x_3)\). \(\alpha+\beta \leq B(3, 3)\)
If we only look at the orange part, no any two points can be shattered. So \(\alpha \leq B(3, 2)\).
We have \(B(4, 3) \leq B(3, 3) + B(3, 2)\). And \(B(N, k) \leq B(N-1, k) + B(N-1, k-1)\).
Bounding Function: The Theorem
\[
B(N,k) \leq \sum_{i=0}^{k-1} \binom{N}{i}
\]
The bottom line is \(m_{\mathcal{H}}\) is poly(N) if break point exists.
Three Break Points
2D perceptrons: \(m_{\mathcal{H}}(4) = 14 < 2^4\): break point at 4.
A pictorial Proof
want
\[
Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N)
\]
we can get \[
Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\cdot 2\ m_{\mathcal{H}}(2N) \ exp(-2 \frac{1}{16} \epsilon^2 N)
\]
Step1: Replace \(E_{out}\) by \(E_{in}'\)
- How? sample verification set \(D'\) of size N to calculate \(E_{in}'\)
- Bad \(h\) of \(E_{in}-E_{out}\) can probably infer bad \(h\) of \(E_{in}-E_{in}'\)
Step2: Decompose \(\mathcal{H}\) by kind
\[
BAD \leq 2Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2)
\leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2)
\]
Step3: Use Hoeffding without Replacement
VC bound:
\[
BAD
\leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2)
\leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N)
\]
Conclusion
- Restriction of Break Point: breaks consequent points
- Bounding Function: Basic Cases, \(B(N,k)\) bounds \(m_{\mathcal{H}}(N)\)
- Bounding Function: Inductive Cases, \(B(N,k)\) is poly(N)
- A pictorial Proof: \(m_{\mathcal{H}}(N)\) can replace \(M\) with a few changes.
Lecture 7 The VC Dimension
Definition of VC Dimension
Recap: More on Growth Function
\(m_{\mathcal{H}}(N)\) of break point \(k \leq B(N,k)=\sum_{i=0}^{k-1} \binom{N}{i} \sim O(N^{k-1})\)
Recap: More on VC bound
For any \(g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}\) and ‘statistival’ large \(\mathcal{D}\)
\[
\begin{split}
\mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \\
\leq \mathbb{P}_{\mathcal{D}}(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon) \\
\leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N) \\
\leq 4(2N)^{k-1}exp(-\frac{1}{8}\epsilon^2N)
\end{split}
\]
If:
- \(m_{\mathcal{H}}\) breaks at \(k\), (good \(\mathcal{H}\))
- \(N\) large enough, (good \(\mathcal{D}\))
- \(\mathcal{A}\) picks good \(g\) with small \(E_{in}\), good \(\mathcal{A}\)
VC Dimension
Definition:
- \(d_{VC}(\mathcal{H})\), VC dimension of \(\mathcal{H}\): largest \(N\) \(m_{\mathcal{H}}(N) = 2^N\)
- the most input \(\mathcal{H}\) that can shatter
- \(N \leq d_{vc} \Longrightarrow \mathcal{H}\) can shatter some \(N\) inputs
- \(k \geq d_{vc} \Longrightarrow k\) is a break point
\[
m_{\mathcal{H}}(N) \leq N^{d_{VC}}, if N \geq 2, d_{VC} \geq 2
\]
The Four VC Dimensions
- positive ray, \(d_{VC} = 1\)
- positive intervals, \(d_{VC} = 2\)
- convex set, \(d_{VC} = \infty\)
- 2D perceptrons, \(d_{VC} = 3\)
VC Dimension and Learning
- regardless of learning algorithm
- regardless of input distribution \(P\)
- regardless of target function \(f\)
FUn time No conclusion
VC Dimension of Perceptrons
2D PLA revisited
- linearly separable \(\mathcal{D}\) -> PLA can converge -> \(T\) iterations -> \(E_{in}(g) = 0\)
- with \(x_n \sim P\) and \(y_n = f(x_n)\) -> \(\mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq \dots\) because \(d_{VC} = 3\) -> \(N\) is large -> \(E_{in}(g) \approx E_{out}(g) \approx 0\)
VC Dimension of Perceptrons
- 1D (pos/neg rays): \(d_{VC}=2\)
- 2D: \(d_{VC}=3\)
- d-D: : \(d_{VC} \overset{?}{=} d+1\)
- \(d_{VC} \geq d+1\)
- \(d_{VC} \leq d+1\)
\(d_{VC} \geq d+1\)
\(d_{VC} \leq d+1\)
- linear dependency restricts dichotomy \[
w^Tx_4 = w^Tx_2 + w^Tx_3 - w^Tx_1 > 0
\]
Physical Intuition of VC Dimenson
Degrees of Freedom
Two Old Friends
- Positive Rays, free parameters: \(a\)
- Positive Intervals, free parameters: \(l, r\)
- practical rule of thumb: \(d_{VC} \approx\) # of free parameters(but not always).
\(M\) and \(d_{VC}\)
- small \(d_{VC}\)
- large \(d_{VC}\)
- using the right \(d_{VC}\) (or $)
Interpreting VC Dimension
VC Bound Rephrase: Penalty for Model Complexity
For any \(g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}\) and ‘statistival’ large \(\mathcal{D}\)
\[
\mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq
4(2N)^{d_{VC}}exp(-\frac{1}{8}\epsilon^2N)
= \delta
\] \[
\epsilon = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}
\] with probablity \(1-\delta\), : Generalization error \(|E_{in}(g)-E_{out}(g)| \leq \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}\) \[
E_{out}(g) \leq E_{in}(g)+\sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}
\]
THE VC message
\(\Omega(N, \mathcal{H}, \delta) = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}\) is called model complexity penalty.
- \(d_{VC} \uparrow\), \(E_{in} \downarrow\) but \(\Omega \uparrow\)
- \(d_{VC} \downarrow\), \(E_{in} \uparrow\) but \(\Omega \downarrow\)
- best \(d_{VC}\) is in the middle
- powerful \(\mathcal{H}\) not always good!
VC Bound Rephrase: Sample Complexity
- Need \(10000d_{VC}\) in theory
- Practical rule of thumb: \(10d_{VC}\)
Looseness of VC Bound
- Hoeffding for unknown \(E_out\), any distribution any target.
- \(m_{\mathcal{H}}\) instead of \(|\mathcal{H}(x_1, \dots, x_n)|\), any data
- \(N^{d_{VC}}\) instead of \(m_{\mathcal{H}}\), any \(\mathcal{H}\) of same \(d_{VC}\)
- union bound on worst cases, any choice made by \(\mathcal{A}\)
- But hardly better, and similar loose for all models
- Philosophical message of VC bound is important for improving ML
Conclusions
- Definition VC Dimension: max non-break point
- VC Dimension of Perceptrons: \(d+1\)
- Physical Intuition of VC Dimenson: \(\approx\) # free parameters
- Interpreting VC Dimension: model and sample complexity
Lecture 8 Noise and Error
Noise and Probabilistic Target
What if there is noise?
Noise
- noise in y: good customer mislabeled as bad
- noise in y: same customer different labels
- noise in x: inaccurate information
Does VC bound work under noise?
Probabilistic Marbels
- Probabilistic (noise) marbles
- marble \(x \sim P(x)\)
- probabilistic color, \(y \ne h(x)\) with \(y \sim P(y|x)\)
- VC holds for
- \(x \overset{i.i.d}{\sim} P(x)\)
- \(y \overset{i.i.d}{\sim} P(y|x)\)
- \((x, y) \overset{i.i.d}{\sim} P(x, y)\)
Target Distribution \(P(y|x)\)
- can be viewed as ‘idea mini target’ + noise, e.g.
- \(P(\circ|x) = 0.7, P(\times|x) = 0.3\)
- idea mini target \(f(x) = \circ\)
- ‘flipping’ noise level = 0.3
- deterministic target \(f\): special case of target distribution
- \(P(y|x) = 1\) for \(y = f(x)\)
- \(P(y|x) = 0\) for \(y \ne f(x)\)
- goal of learning: predict idea mini target (w.r.t \(P(y|x)\)) on often-seen inputs (w.r.t \(P(x)\))
New Learning flow
VC still works, pocket algorithm explained :)
Error Measure
final hypothesis \(g \approx f\)
Point Error Measure
\[
E_{out}(g) = \underset{X \sim P}{\mathcal{E}} \underbrace{[g(x) \ne f(x)]}_{error(g(x), f(x))}
\]
\[
E_{in}(g) = \frac{1}{N} \sum_{n=1}^{N}err(g(x_n), f(x_n))
\]
\[
E_{out}(g) = \underset{X \sim P}{\mathcal{E}} err(g(x), f(x))
\]
will mainly consider pointwise err for simplicity
Two Important Pointwise Error Measures
- 0/1 error
- \(err(\tilde{y}, y) = [\hat{y} \ne y]\)
- correct or not
- often for classification
- squared error
- \(err(\tilde{y}, y) = (\hat{y}-y)^2\)
- how far \(\hat{y}\) from \(y\)
- often for regression
How does err ‘guide’ learning?
Ideal Mini-Target
\[
P(y=1|x) = 0.2, P(y=2|x) = 0.7, P(y=3|x) = 0.1
\]
- 0/1 error, \(err(\tilde{y}, y) = [\tilde{y} \ne y]\)
- \(\tilde{y} = 1\), avg. err 0.8
- \(\tilde{y} = 2\), avg. err 0.3
- \(\tilde{y} = 3\), avg. err 0.9
- \(\tilde{y} = 1.9\), avg. err 1
- \(f(x) = \underset{y\in\mathcal{y}}{argmax}(P(y|x))\)
- squared error, \(err(\tilde{y}, y) = (\tilde{y}-y)^2\)
- \(\tilde{y} = 1\), avg. err 1.1
- \(\tilde{y} = 2\), avg. err 0.3
- \(\tilde{y} = 3\), avg. err 1.5
- \(\tilde{y} = 1.9\), avg. err 0.29
- \(f(x) = \sum_{y\in\mathcal{y}} y \cdot P(y|x)\)
Learning Flow with Error Measure
Extended VC theory/‘Philosophy’ works for most \(\mathcal{H}\) and err.
Algorithmic Error Measure
Choice of Error Measure
two types of error: false accept and false reject
0/1 error penalizes both types equally.
- supermarket: fingerprint for discount
- false reject: very unhappy customer, lose future business, (penalty 10)
- false accept: give away a minor discount, intruder left fingerprint (penalty 1)
- CIA: fingerprint for entrance
- false accept: very serious consequences! (penalty 1000)
- false reject: unhappy employee, but so what? :) (penalty 1)
Take-home Message for Now
err is application/user-dependent
Algorithm Error Measures \(\widehat{error}\)
- true: just err
- plausible:
- 0/1 minimum ‘flipping noise’ - NP-hard to optimize
- squared: minimum Gaussian noise
- friendly: easy to optimize for \(\mathcal{A}\)
- closed-form solution
- convex objective function
err: application goal
\(\widehat{err}\): a key part of many \(mathcal{A}\)
Weighted Classification
\[
\begin{array}{c|c}
&
\begin{matrix}
h(x) \\
1 \qquad -1
\end{matrix}
\\
\hline
\begin{matrix}
f(x) & 1 \\
& -1
\end{matrix}
&
\begin{matrix}
0 & 1 \\
1000 & 0
\end{matrix}
\end{array}
\]
out-of-sample \(E_out(h)\) =
- 1, if \(y_n = +1\)
- 1000, if \(y_n = -1\)
in-sample is similar
Minimizing \(E_{in}\) for weighted Classification
Naive Thoughts:
- PLA: doesn’t matter if linear separable.
- pocket: modify pocket-replacement rule
- if \(w_{t+1}\) reaches smaller \(E_{in}^{w}\) than \(\hat{w}\)
Systematic Route: Connect \(E_{in}^w\) and \(E_{in}^{0/1}\)
- idea: duplicate the data point with -1 label 1000 times.
weighted pocket algorithm
using ‘virtual copying’, weighted pocket algorithm include:
- weighted PLA: randomly check -1 example mistakes with 1000 times more probability
- weighted pocket replacement: if \(w_{t+1}\) reaches smaller \(E_{in}^{w}\) than \(\hat{w}\), replace \(\hat{w}\) by \(w_{t+1}\)
systematic route (called reduction): can be applied to many other algorithm
Fun time
Unbalanced data
Conclusions
- Noise and Probabilistic Target
- can replace \(f(x)\) with \(P(y|x)\)
- Error Measure
- Algorithmic Error Measure
- user-dependent => plausible or friendly
- Weighted Classification
- easily done by virtual ‘example copying’
Lecture 9 Linear Regression
Linear Regression Problem
Credit Limit Problem
credit limit? $100,000
Linear Regression Hypothesis
- For \(x = (x_0, x_1, \dots, x_d)\) ‘features of customer’, approximate the desired credit limit with a weighted sum:
\[
y \approx \sum_{i = 0}^{d} w_i x_i
\]
The Error Measure
square error \(err(\hat{y}, y) = (\hat{y}^2 - y)^2\)
\[
E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (\underbrace{h(x_n)}_{w^T x_n} - y_n)^2
\]
\[
E_{out}(w) = \underset{(x, y) \sim P}{\mathcal{E}} (w^T x - y)
\]
Linear Regression Algorithm
Matrix from of \(E_{in}(w)\)
\[
E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T x_n - y_n)^2
= \frac{1}{N}
\begin{Vmatrix}
x_1^T w - y_1 \\
x_2^T w - y_2 \\
\dots \\
x_N^T w - y_N \\
\end{Vmatrix}^2
= \frac{1}{N}
\begin{Vmatrix}
Xw-y
\end{Vmatrix}^2
\]
- \(E_{in}(w)\): continuous, differentiable, convex
- necessary condition of ‘best’ \(w\) - not possible to ‘roll down’
- task: find \(w_{LIN}\) such that \(\nabla E_{in}(w_{LIN}) = 0\)
The Gradient \(\nabla E_{in}(w)\)
\[
E_{in}(w) =
= \frac{1}{N}
\begin{Vmatrix}
Xw-y
\end{Vmatrix}^2
= \frac{1}{N}
(w^T X^T X w - 2 w^T X^T y + y^Ty)
= \frac{1}{N}
(w^T A w - 2 w^T b + c)
\]
For vector \(w\), we have
\[
\nabla E_{in}(w) =
\frac{1}{N} (2Aw - 2b)
=
\frac{2}{N} (X^TXw - X^Ty)
= 0
\]
Optimal Linear Regression Weights
- invertible \(X^T X\)
- easy! unique solution \(w_{LIN} = \underbrace{(X^T X)^{-1}X^T}_{pseudo-inverse X^T} y\)
- often the case because \(N >> d+1\)
- singular \(X^T X\)
- practical suggestion:
- use well-implemented routine to solve \(X^+\) instead of \((X^T X)^{-1}X^T\) for numerical stability.
Linear Regression Algorithm
- from \(\mathcal{D}\), construct input matrix \(X\) and output vector \(y\) by
\[
X =
\underbrace{
\begin{bmatrix}
--x_1^T-- \\
--x_2^T-- \\
\dots \\
--x_N^T-- \\
\end{bmatrix}
}_{N\times(d+1)}
y =
\underbrace{
\begin{bmatrix}
--y_1^T-- \\
--y_2^T-- \\
\dots \\
--y_N^T-- \\
\end{bmatrix}
}_{N\times1}
\]
- calculate pseudo-inverse \(\underbrace{X^+}_{(d+1)\times N}\)
- return \(\underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y\)
Generalization Issue
\[
\underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y
\]
- No!
- analytic (closed-form) solution, ‘instantaneous’
- not improving \(E_{in}\)
- Yes!
- good \(E_{in}\)? Yes, optimal
- good \(E_{out}\)? yes, finite \(d_{vc}\) like perceptrons
- improving iteratively? somewhat, whithin an iterative pseudo-inverse routine
- if \(E_{out}(w_{LIN})\) is good, learning ‘happened’!
Benefit of analytic solution: ‘simpler-than-VC’ Guarantee
\[
\overline{E_{in}} = \underset{\mathcal{D} \sim P^N}{\mathcal{E}}{E_{in}(w_{LIN} w.r.t. \mathcal{D})}
\overset{\text{to be shown}}{=} \text{noise level} \cdot (1 - \frac{d+1}{N})
\]
We have:
\[
E_{in}(w_{LIN})
= \frac{1}{N}
\begin{Vmatrix}
y - \hat{y}
\end{Vmatrix}^2
= \frac{1}{N}
\begin{Vmatrix}
y - XX^{+}y
\end{Vmatrix}^2
= \frac{1}{N}
\begin{Vmatrix}
(I - XX^{+})y
\end{Vmatrix}^2
\]
We call \(XX^+\) hat matrix \(H\) because it puts \(\wedge\) on \(y\)
Geometric View of Hat Matrix
- in \(\mathbb{R}^N\)
- \(\hat{y} = Xw_{LIN}\) within the span of \(X\) columns
- \(y-\hat{y}\) smallest: \(y-\hat{y} \bot\) span
- \(H\): project \(y\) to \(\hat{y} \in\) span
- \(I-H\): transform \(y\) to \(y-\hat{y} \bot\) span
- claim: trace(\(I-H\)) = \(N-(d+1)\)
An Illustrative ‘Proof’
- if \(y\) comes from some ideal \(f(X) \in\) span plus noise
- noise transfromed by \(I-H\) to be \(y-\hat{y}\)
\[
E_{in}(w_{LIN})
= \frac{1}{N}
\begin{Vmatrix}
y - \hat{y}
\end{Vmatrix}^2
= \frac{1}{N}
\begin{Vmatrix}
(I-H)\text{noise}
\end{Vmatrix}^2
= \frac{1}{N}(N-(d+1))
\begin{Vmatrix}
\text{noise}
\end{Vmatrix}^2
\]
We have:
\[
\overline{E_{in}} =
\text{noise level} \cdot (1 - \frac{d+1}{N})
\]
\[
\overline{E_{out}} =
\text{noise level} \cdot (1 + \frac{d+1}{N})
\]
The Learning Curve
- Both converge to \(\sigma^2\) (noise level) for \(N \rightarrow \infty\)
- expected generalization error: \(\frac{2(d+1)}{N}\)
- similar to worst-case guarantee from VC
- linear regression (LinReg): learning happened!
Linear Regression for Binary Classification
Linear Classification v.s. Linear Regression
- Linear Classification:
- \(\mathcal{Y} = {-1, +1}\)
- \(h(x) = \text{sign}(w^Tx)\)
- \(\text{err}(\hat{y}, y) = [\hat{y} \ne y]\)
- NP-hard to solve in general
- Linear Regression
- \(\mathcal{Y} = \mathbb{R}\)
- \(h(x) = (w^Tx)\)
- \(\text{err}(\hat{y}, y) = (\hat{y} - y)^2\)
- efficient analytic solution
- \({-1, 1} \in \mathbb{R}\): linear regression for classification?
- run LinReg on binary classification data \(\mathcal{D}\) (efficient)
- return \(g(x) = \text{sign}(w_{LIN}^Tx)\)
- explanation of this heuristic?
Relation of Two Errors
- \(\text{err}_{0/1} = [sign(w^Tx) \ne y]\)
- \(\text{err}_{sqr} = (w^Tx-y)^2\)
- \(\text{err}_{0/1} \leq \text{err}_{sqr}\)
Linear Regression for Binary Classification
- \(\text{err}_{0/1} \leq \text{err}_{sqr}\)
\[
\text{classification} E_{out}(w) \overset{VC}{\leq}
\text{classification} E_{in}(w) + \sqrt{\dots\dots} \leq
\text{regression} E_{in}(w) + \sqrt{\dots\dots}
\]
- (loose) upper bound \(err_{sqr}\) as \(\widehat{err}\) to approximate \(\text{err}_{0/1}\)
- if we have small enough \(err_{sqr}\), the \(\text{classification} E_{out}\) will also be small
- trade bound tightness for efficiency
- \(w_{LIN}\): useful baseline classifer, or as initial PLA/pocket vector
Conclusions
- Linear Regression Problem
- use hyperplane to approximate real values
- Linear Regression Algorithm
- analytic solution with pseudo-inverse
- Generalization Issue
- \(E_{out} - E_{in} \approx \frac{2(d+1)}{N}\) on average
- Linear Regression for Binary Classification
- 0/1 error \(\leq\) squared error
Lecture 10 Logistic Regression
Logistic Regression Problem
Heart Attack Prediction Problem
binary classification: heart disease? yes or no ?
- ideal \(f(x) = \text{sign}(P(+1|x) - 0.5) \in {-1, +1}\)
‘Soft’ binary classification: similar problem heart attack? 80% risk
- \(f(x) = P(+1|x) \in [0, 1]\)
Soft Binary Classification
target function $f(x) = P(+1|x) $
- ideal (noiseless) data
- \((x_1, y_1' = 0.9 = P(+1|x_1))\)
- \((x_2, y_2' = 0.2 = P(+1|x_2))\)
- …
- \((x_N, y_N' = 0.6 = P(+1|x_N))\)
- actual (noisy) data
- \((x_1, y_1' = 1 = [\circ \overset{?}{\sim} P(y|x_1)]\)
- \((x_2, y_2' = 0 = [\circ \overset{?}{\sim} P(y|x_2)]\)
- …
- \((x_N, y_N' = 0 = [\circ \overset{?}{\sim} P(y|x_N)]\)
We have same data as hard binary Classification, but different target function
Logistic Hypothesis
- For \(x = (x_0, x_1, \dots, x_d)\) ‘features of patient’, calculate a weighted ‘risk score’:
\[
s = \sum_{i=0}^{d} w_i x_i
\]
- convert the score to estimated probability by logistic function \(\theta(s)\)
- logistic hypothesis: \(h(x) = \theta(w^T x)\)
Logistic Function

\[
\theta(s) = \frac{e^s}{1+ e^s} = \frac{1}{1+e^{-s}}
\]
- It’s smooth, monotonic sigmoid function of \(s\)
- logistic regression: use the following to approximate target function \(f(x) = P(y|x)\)
\[
h(x) = \frac{1}{1+ exp(-w^T x)}
\]
Logistic Regression Error
Three Linear Models
- Linear Classification
- \(h(x) = \text{sign}(s)\)
- plausible err = 0/1 (small flipping noise)
- Linear Regression
- \(h(x) = s\)
- friendly err = squared (easy to minimize)
- Logistic Regression
- \(h(x) = \theta(s)\)
- err = ?
Likelihood
\[
f(x) = P(+1|x) \iff
P(y|x) =
\begin{cases}
f(x) & \quad \text{for } y = +1 \\
1-f(x) & \quad \text{for } y = -1 \\
\end{cases}
\]
- Consider \(\mathcal{D} = {(x_1, \circ), (x_2, \times), \dots, (x_N, \times)}\)
- probability that \(f\) generates \(\mathcal{D}\) is
\[
P(x_1)P(\circ | x_1) \times
P(x_2)P(\times | x_2) \times
\dots
\times P(x_N)P(\times | x_N)
\]
- replace the above equation with \(f(x)\)
\[
P(x_1) \times f(x_1)
P(x_2) (1 - f(x_2)) \times
\dots
\times P(x_N) (1 - f(x_N))
\]
- Likelihood that \(h\) generates \(\mathcal{D}\) is
\[
P(x_1) \times h(x_1)
P(x_2) (1 - h(x_2)) \times
\dots
\times P(x_N) (1 - h(x_N))
\]
- if \(h \approx f\), then likelihood(\(h\)) \(\approx\) probability using \(f\)
- probability using \(f\) usually large
Likelihood of Logistic Hypothesis
- likelihood(\(h\)) \(\approx\) probability using \(f \approx\) large
- \(g = \underset{h}{\text{argmax}} \text{likelihood}(h)\)
- when logistic: \(h(x) = \theta(w^Tx)\)
\[
\text{likelihood}(h) = P(x_1) h(+x_1) P(x_2) h(-x_2) \cdots P(x_N) h(-X_N)
\]
So we have
\[
\text{likelihood}(\text{logistic } h) \propto \prod_{n=1}^{N} h(y_n x_n)
\]
Cross-Entropy Error
\[
\underset{w}{\text{max}}\ \text{likelihood } (w) \propto \prod_{n=1}^{N} \theta(y_n w^T x_n)
\]
\[
\underset{w}{\text{max}}\ \text{ln } \prod_{n=1}^{N} \theta(y_n w^T x_n)
\]
\[
\underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} - \text{ln } \theta(y_n w^T x_n)
\]
\[
\underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n))
\]
\[
\underset{w}{\text{min}}\ \frac{1}{N} \underbrace{\sum_{n=1}^{N} \text{err}(w, x_n, y_n)}_{E_{in}(w)}
\]
- cross-entropy error: \(\text{err}(w, x_n, y_n) = \text{ln } (1 + exp(-y_n w^T x_n))\)
Gradient of Logistic Regression Error
Minimizing \(E_{in}(w)\)
- \(E_{in}(w)\): continuous, differentiable, twice-differentiable, convex
- how to minimize? locate valley, want \(\nabla E_{in}(w) = 0\)
- first: derive \(\nabla E_{in}(w) = 0\)
The Gradient \(\nabla E_{in}(w)\)
\[
E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n))
\]
\[
\nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n)
\]
Note that in Andrew’s cource, the \(\nabla E_{in}(w)\) is as following. Which is actually the same as above if we replace \(y_n = 1\) and \(y_n = -1\), and use \(h(-x) = 1 - h(x)\) property. But note, in Andrew’s cource, \(y_n\) can be 1 or 0, not \(\pm 1\).
\[
\nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} (h(x_n) - y_n) x_n
\]
Minimizing \(E_{in}(w)\)
\[
\nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n) = 0
\]
- scaled \(\theta\)-weighted sum of \(-y_n x_n\)
- all \(\theta(i) = 0\), only if \(y_n w^T x_n \gg 0\) — Linear separable \(\mathcal{D}\)
- weighted sum = 0: non-linear equation of \(w\)
PLA revisited: iterative optimization
PLA: start from some \(w_0\), and correct its mistakes on \(\mathcal{D}\)
For \(t = 0, 1, \dots\)
- find a mistake of \(w_t\) called \((x_{n(t)}, y_{n(t)})\)
\[
\text{sign}(w_t^Tx_{n(t)}) \ne y_{n(t)}
\]
- try to correct the mistake by
\[
w_{t+1} \leftarrow w_t + y_{n(t)}x_{n(t)}
\]
when stop, return last \(w\) as \(g\)
For \(t = 0, 1, \dots\)
- (equivalently) pick some \(n\), and update \(w_t\) by
\[
w_{t+1} \leftarrow w_t + \underbrace{1}_{\eta} \cdot \underbrace{[\text{sign}(w_t^Tx_n) \ne y_n]y_{n(t)}x_{n(t)}}_{v}
\]
when stop, return last \(w\) as \(g\)
- \(v\) is the updates direction
- \(\eta\) how big is the step in the direction
- choice of \((\eta, v)\) and stopping condition defines iterative optimization approach
Gradient Descent
Iterative Optimization
For \(t = 0, 1, \dots\)
\[
w_{t+1} \leftarrow w_t + \eta v
\]
when stop, return last \(w\) as \(g\)
- PLA: \(v\) comes from mistake correction
- smooth \(E_{in}(w)\) for logistic regression: choose \(v\) to get the ball roll ‘downhill’
- direction \(v\): (assumed) of unit length
- a greedy approach for some given \(\eta > 0\):
\[
\underset{|v| = 1}{\text{min}} E_{in} (\underbrace{w_t + \eta v}_{w_{t+1}})
\]
Linear Approximation
- still non-linear optimization, now with constraints
- not any easier than \(\text{min}_w E_{in} (w)\)
- local approximation by linear formula makes problem easier.
\[
E_{in} (w_t + \eta v) \approx E_{in}(w_t) + \eta v^T \nabla E_{in}(w_t)
\]
- if \(\eta\) really small (Taylor expansion)
an approximate greedy approach for some given small \(\eta\):
\[
\underset{|v| = 1}{\text{min}} E_{in} (w_t) + \eta v^T \nabla E_{in}(w_t)
\]
Gradient Descent
\[
v = - \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|}
\]
- gradient descent: for small \(\eta\)
\[
w_{t+1} \leftarrow w_t - \eta \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|}
\]
- a simple and popular optimization
Choice of \(\eta\)
- too small: too slow
- too large: too unstable
- just right: use changing \(\eta\)
\(\eta\) better be monotonic of \(\begin{Vmatrix}\nabla E_{in}(w_t)\end{Vmatrix}\)
Simple heuristic for changing \(\eta\)
\[
w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t)
\]
- called the fixed learning rate
Putting Everything Together
Initialize \(w_0\)
For \(t = 0, 1, \dots\)
- compute
\[
\nabla E_{in}(w_t) = 1 / N \sum_{n=1}^{N} \theta(-y_n w_t^T x_n) (-y_n x_n)
\]
- update by
\[
w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t)
\]
… until \(\nabla E_{in}(w_t) \approx 0\) or for enough iterations.
- Similer time complexity to pocket per iteration
Conclusions
- Logistic Regression Problem
- \(P(+1|x)\) as target and \(\theta(w^Tx)\) as hypothesis
- Logistic Regression Error
- cross-entropy (negative log likelihood)
- Gradient of Logistic Regression Error
- \(\theta\)-weighted sum of data vectors
- Gradient Descent
- roll downhill by \(-\nabla E_{in}(w_t)\)
Lecture 11 Linear Models for Classification
Linear Models for Binary Classification
Linear Models Revisited
linear scoring function: \(s = w^Tx\)
- Linear Classification
- \(h(x) = \text{sign}(s)\)
- plausible err = 0/1 (discrete \(E_{in}(w)\) NP hard to solve)
- Linear Regression
- \(h(x) = s\)
- friendly err = squared (quadratic convex \(E_{in}(w)\): closed form solution)
- Logistic Regression
- \(h(x) = \theta(s)\)
- plausible err = cross-entropy (smooth convex \(E_{in}(w)\): gradient descent)
Can linear regression or logistic regression help linear classfication? * linear classfication is hard to solve, but the other two are easy to solve * Can we use the other two to replace the first one?
Error Functions Revisited
linear scoring function: \(s = w^Tx\)
for binary classfication \(\mathcal{y} \in {-1, +1}\)
- Linear Classification
- \(h(x) = \text{sign}(s)\)
- err\((h, x, y) = [h(x) \ne y]\)
- \(\text{err}_{0/1}(s, y) = [\text{sign}(s) \ne y] = [\text{sign}(ys) \ne 1]\)
- Linear Regression
- \(h(x) = s\)
- err\((h, x, y) = (h(x) - y)^2\)
- \(\text{err}_{SQR}(s, y) = (s - y)^2 = (ys-1)^2\)
- Logistic Regression
- \(h(x) = \theta(s)\)
- err\((h, x, y) = \text{ln } h(yx)\)
- \(\text{err}_{CE}(s, y) = \text{ln} (1+exp(-ys))\)
- \((ys)\): classfication correctness score
Visualizing Error Functions
- \(0/1\): 1 if and only if \(ys \leq 0\)
- sqr: large if \(ys \ll 1\), but over-charge \(ys \gg 1\), small \(\text{err}_{SQR} \rightarrow \text{small } \text{err}_{0/1}\)
- ce: monotonic of \(ys\) small err CE \(\iff\) small error 0/1.
- scaled ce: a proper upper bound of 0/1 small CE \(\iff\) small error 0/1.
- upper bound: useful for designing Algorithmic error \(\widehat{err}\)
Theoretical Implication of Upper Bound
- For any \(ys\) where \(s = W^T x\)
\[
\begin{split}
\text{err}_{0/1}(s, y) \leq \text{err}_{SCE}(s, y) = \frac{1}{ln2} \text{err}_{CE}(s, y) \\
\iff \\
E_{in}^{0/1}(w) \leq E_{in}^{SCE}(w) = \frac{1}{ln2} E_{in}^{CE}(w) \\
E_{out}^{0/1}(w) \leq E_{out}^{SCE}(w) = \frac{1}{ln2} E_{out}^{CE}(w) \\
\end{split}
\]
\[
E_{out}^{0/1}(w) \leq E_{in}^{0/1}(w) + \Omega^{0/1}
\leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1}
\]
\[
E_{out}^{0/1}(w) \leq \frac{1}{ln2} E_{out}^{CE}(w)
\leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1}
\]
\[
\text{small } E_{in}^{CE}(w) \Rightarrow \text{small } E_{out}^{0/1}(w)
\]
- logistic/linear reg. for linear classification.
Regression for Classification
- run logistic/linear reg. on \(\mathcal{D}\) with \(y_n \in {-1, +1}\) to get \(w_{reg}\)
- return \(g(x) = \text{sign}(w_{reg}^T x)\)
- linear regression
- pros: ‘easiest’ optimization
- cons: loose bound of err 0/1 for large \(|ys|\)
- logistic regression
- pros: ‘easy’ optimization
- cons: loose bound of err 0/1 for very negative \(-ys\)
- PLA
- pros: efficient + strong guarantee if linear separable
- cons: works only if linear separable, otherwise needing pocket heuristic
- linear regression sometimes used to set \(w_0\) for PLA/pocket/logistic regression.
- logistic regression often preferred over pocket.
Stochastic Gradient Descent
Two Iterative Optimization Schemes
For \(t = 0, 1, \dots\)
\[
w_{t+1} \leftarrow w_t + \eta v
\]
when stop, return last \(w\) as \(g\)
- PLA: pick \((x_n, y_n)\) and decide \(w_{t+1}\) by one example
- \(O(1)\) time per iteration.
- Logistic regression (pocket)
- check \(\mathcal{D}\) and decide \(w_{t+1}\) by all examples
- \(O(N)\) time per iteration.
Is it possible to make logistic regression \(O(1)\)?
Logistic Regression Revisited
\[
w_{t+1} \leftarrow w_t + \eta \frac{1}{N} \sum_{n=1}^{N} \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla E_{in}(w_t)}
\]
Stochastic Gradient Descent (SGD)
Stochastic gradient = true gradient + zero-mean ‘noise’ directions
- idea: replace true gradient by stochastic gradient
- after enough steps, average true gradient \(\approx\) average stochastic gradient
- pros: simple and cheaper computation
- useful for big data and online learning
- cons: less stable in nature.
SGD logistic regression looks familiar?
\[
w_{t+1} \leftarrow w_t + \eta \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla \text{err}(w, x_n, y_n)}
\]
PLA Revisited
SGD logistic regression:
\[
w_{t+1} \leftarrow w_t + \eta \dot \theta (-y_n w_t^{T} x_n) (y_n x_n)
\]
PLA:
\[
w_{t+1} \leftarrow w_t + 1 \dot [y_n \ne \text{sign}(w_t^T x_n)] (y_n x_n)
\]
- SGD logistic regression \(\approx\) ‘soft’ PLA
- more errors, updates more. less errors ,updates less
- PLA \(\approx\) SGD logistic regression with \(\eta = 1\) when \(w_t^T x_n\) large.
Two practical rule of thumb:
- stop condition? \(t\) large enough
- \(\eta\)? 0.1 when \(x\) in proper range
Multiclass via Logistic Regression
Multiclass Classification
- \(\mathcal{y} = {\star, \diamond, \circ, \times}\) 4-class classfication.
- many applications in practice, especially for ‘recognition’.
- Use tool for \({\times, \circ}\) for \({\star, \diamond, \circ, \times}\)
One Class at a time
- \(\square\) or not? \({\square = \circ, \diamond = \triangle = \star = \times}\)
- \(\diamond\) or not? \({\diamond = \circ, \square = \triangle = \star = \times}\)
- \(\triangle\) or not? \({\triangle = \circ, \square = \diamond = \star = \times}\)
- \(\star\) or not? \({\star = \circ, \square = \triangle = \diamond = \times}\)
Multiclass Prediction: Combine Binary Classifiers
- We are sure some area is definitely what shape
- What about tie?
One Class at a time softly
- \(P(\square|x)? \quad \{\square = \circ, \diamond = \triangle = \star = \times\}\)
- \(P(\diamond|x)? \quad \{\diamond = \circ, \square = \triangle = \star = \times\}\)
- \(P(\triangle|x)? \quad \{\triangle = \circ, \square = \diamond = \star = \times\}\)
- \(P(\star|x)? \quad \{\star = \circ, \square = \triangle = \diamond = \times\}\)
Multiclass Prediction: Combine Binary Classifiers
\[
g(x) = \text{argmax}_{k \in \mathcal{Y}} \theta(w_{[k]}^Tx)
\]
OR, equivalently
\[
g(x) = \text{argmax}_{k \in \mathcal{Y}} w_{[k]}^Tx
\]
One-Versus-All (OVA) Decomposition
- for \(k \in \mathcal{Y}\), obtain \(w_{[k]}\) by running logistic regression on
- \(\mathcal{D}_{[k]} = \{(x_n, y_n = 2[y_n = k] - 1)\}_{n=1}^N\)
- return \(g(x) = \text{argmax}_{k \in \mathcal{Y}} (w_{[k]}^T x)\)
- pros: efficient, can be coupled with any logistic regression-like approaches.
- cons: often unbalanced \(\mathcal{D}_{[k]}\) when \(K\) large
- logistic regression will produce naive predictor: predict everything to be \(\times\).
- extension: multinomial (‘coupled’) logistic regression.
OVA: a simple Multiclass meta-algorithm to keep in your toolbox.
Multiclass via Binary Classification
Source of Unbalance: One Versus All
idea: make binary classification problems more balanced by one versus one.
One versus One at a Time
- \(\square or \diamond\) ? {\(\square = \circ, \diamond = \times, \star = \triangle = \text{nil}\)}
- …
Multiclass Prediction: Combine Pairwise Classifiers
\[
g(x) = \text{tournament champion } \{w_{[k, l]}^T x\}
\]
voting of classifiers.
One-versus-one OVO Decomposition
- for \([k, l] \in \mathcal{Y} \times \mathcal{Y}\), obtain \(w_{[k, l]}\) by running logistic regression on
- \(\mathcal{D}_{[k, l]} = \{(x_n, y'_n = 2[y_n = k] - 1): y_n = k \text{or } y_n = l\}\)
- return \(g(x) = \text{tournament champion } \{w_{[k, l]}^T x\}\)
- pros: efficient (‘smaller’ Training problems), stable, can be coupled with any binary classification approaches.
- cons: use \(O(K^2)w_{[k, l]}\)
- more space, slower prediction, more training.
OVO: another simple Multiclass meta-algorithm to keep in your toolbox.
Conclusions
- Linear Models for Binary Classification
- three models useful in different ways
- Stochastic Gradient Descent
- follow negative stochastic gradient
- Multiclass via Logistic Regression
- predict with maximum estimated P(k|x)
- Multiclass via Binary Classification
- predict the tournament champion
Lecture 12 Nonlinear Transformation
Quadratic Hypothesis
Linear Hypothesis
- visually: ‘line’-like boundary
- mathematically: linear scores \(s = w^T x\)
limitations:
- theoretically: \(d_{vc}\) under control
- practically: on some \(\mathcal{D}\), large \(E_{in}\) for every line.
How to break the limit of linear hypothesis?
Circular Separable
- \(\mathcal{D}\) not linear separable
- but circular separable by a circle of radius \(\sqrt{0.6}\) centered at origin:
\[
h_{SEP}(x) = \text{sign } (-x_1^2 - x_2^2 + 0.6)
\]
re-derive Circular-PLA, Circular-Regression, blahblah … all over again?
Circular Separable and Linear Separable
\[
h(x) =
\text{sign } (
\underbrace{0.6}_{\tilde{w_0}} \cdot \underbrace{1}_{\tilde{z_0}} +
\underbrace{(-1)}_{\tilde{w_1}} \cdot \underbrace{x_1^2}_{\tilde{z_1}} +
\underbrace{(-1)}_{\tilde{w_2}} \cdot \underbrace{x_2^2}_{\tilde{z_2}}
)
= \text{sign } ( \tilde{w}^T z )
\]
- \(\{(x_n, y_n)\}\) cicular separable \(\implies \{(x_n, y_n)\}\) linear separable.
- \(x \in \mathcal{X} \overset{\Phi}{\mapsto} z \in \mathcal{Z}\): nonlinear feature transform \(\Phi\).
- circular separable in \(\mathcal{X} \mapsto\) linear separable in \(\mathcal{Z}\)
- What about vice versa?
Linear Hypothesis in Z space
\[
(z_0, z_1, z_2) = z = \Phi(x) = (1, x_1^2, x_2^2)
\]
\[
h(x) = \tilde{h}(z) = \text{sign}(\tilde{w}^T \Phi(x))
= \text{sign} (\tilde{w_0} + \tilde{w_1}x_1^2 + \tilde{w_2}x_2^2)
\]
\((\tilde{w_0}, \tilde{w_1}, \tilde{w_2})\)
- \((0.6, -1, -1)\): circle (\(\circ\) inside)
- \((-0.6, +1, +1)\): circle (\(\circ\) outside)
- \((0.6, -1, -2)\): ellipse
- \((0.6, -1, +2)\): hyperbola
- \((0.6, +1, +2)\): constant \(\circ\)
line in \(\mathcal{Z} \iff\) special quadratic curves in \(\mathcal{X}\) space.
- E.g., the center of a circle must be the origin
General Quadratic Hypothesis set
a bigger \(\mathcal{Z}\) with \(\Phi_2(x) = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)\)
Perceptrons in \(\mathcal{Z}\) space \(\iff\) quadratic hypothesis in \(\mathcal{X}\) space
\[
\mathcal{H}(\Phi_2) = {h(x): h(x) = \tilde{h}(\Phi_2(x)) \text{for some linear } \tilde{h} on \mathcal{Z}}
\]
- Can implement all possible quadratic curve boundries: circle, ellipse, rotated ellipse, hyperbola, parabola
\[
\text{ellipse } 2 (x_1 + x_2 - 3)^2 + (x_1 - x_2 - 4)^2 = 1
\]
\[
\implies \tilde{w}^T = (33, -20, -4, 3, 2, 3)
\]
- Include lines and constants as degenerated cases
- next: learn a good quadratic hypothesis \(g\)
Structured Hypothesis Sets
Polynomial Transform Revisited
\[
\begin{split}
\Phi_0(x) = 1 \\
\Phi_1(x) = (\Phi_0(x), \quad x_1, x_2, \dots, x_d) \\
\Phi_2(x) = (\Phi_1(x), x_1^2, x_1x_2, \dots, x_d^2) \\
\dots \dots \\
\Phi_Q(x) = (\Phi_{Q-1}(x), x_1^Q, x_1^{Q-1}, \dots, x_d^Q)
\end{split}
\]
\[
\mathcal{H}_{\Phi_0} \subset \mathcal{H}_{\Phi_1} \subset \mathcal{H}_{\Phi_2}
\subset \cdots \subset \mathcal{H}_{\Phi_Q}
\]
Structured Hypothesis Sets
\[
\mathcal{H}_0 \subset \mathcal{H}_1 \subset \mathcal{H}_2 \subset
\]
\[
d_{VC}(\mathcal{H}_0) \leq d_{VC}(\mathcal{H}_1) \subset d_{VC}(\mathcal{H}_2) \leq \cdots
\]
\[
E_{in}(g_0) \leq E_{in}(g_1) \leq E_{in}(g_2) \leq
\]
- Use a high dimension from beginning is not a good idea.
Linear Model First
- tempting sin: use \(\mathcal{H}_{1126}\), low \(E_{in}(g_{1126})\) to fool your boss.
- really? :( a dangerous path of no return
- safe route: \(\mathcal{H}_1\) first
- if \(E_{in}(g_1)\) good enough, live happily thereafter :)
- otherwise, move right of the curve with nothing lost except ‘wasted’ computation.
- linear model first: simple, efficient, safe and workable.
Conclusions
- Quadratic Hypothesis
- linear hypothesis on quadratic-transformed data
- Nonlinear Transform
- happy linear modeling after \(\mathcal{Z} = \Phi(\mathcal{X})\)
- Price of Nonlinear Transform
- computation/storage/model complexity
- Structured Hypothesis Sets
- linear/simple model first
Lecture 13 Hazard of Overfitting
nonlinear (regression, classification) via nonlinear feature transform \(\Phi\) plus linear (regression, classification) with price of model complexity. However, it will causes overfitting. In this lecture:
- How does it happen?
- How to overcome it?
What is overfitting?
Bad Generalization
- regression for \(x \in \mathbb{R}\) with \(N = 5\) examples
- target $f(x) = $ 2nd order polynomial
- label \(y_n = f(x_n) +\) very small noise
- linear regression in \(\mathcal{Z}\)-space + \(\Phi\) = 4th order polynomial
- unique solution passing all examples \(\implies E_{in}(g) = 0\)
- \(E_{out}(g)\) huge
bad generalization: low \(E_{in}\), high \(E_{out}\).
Bad Generalization and Overfitting
- take \(d_{VC} = 1126\) for learning: bad generalization
- \(E_{in} - E_{out}\) large
- switch from \(d_{VC} = d_{VC}^{\star}\) to \(d_{VC} = 1126\): overfitting
- \(E_{in}\) up, \(E_{out}\) down.
- hard to fix
- switch from \(d_{VC} = d_{VC}^{\star}\) to \(d_{VC} = 1\): underfitting
- \(E_{in}\) down, \(E_{out}\) up.
- easy to fix
bad generalization: low \(E_{in}\), high \(E_{out}\). overfitting: lower \(E_{in}\), higher \(E_{out}\)
Cause of Overfitting: A Driving Analogy
‘good fit’ \(\implies\) ‘overfit’
| overfit |
commit a car accident |
| use excessive \(d_{VC}\) |
‘drive too fast’ |
| noise |
bumpy road |
| limited data size \(N\) |
limited observations about road condition |
next: how does noise & data size affect overfitting?
The role of Noise and Data size
Case Study
10-th order target function + noise 50-th order target function + noiseless
overfitting from best \(g_2 \in \mathcal{H}_2\) to best \(g_{10} \in \mathcal{H}_{10}\) ?
10-th order target function + noise
\[
\begin{array}{c|c c}
& g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\
\hline
E_{in} & 0.050 & 0.034 \\
E_{out} & 0.127 & 9.0
\end{array}
\]
50-th order target function + noiseless
\[
\begin{array}{c|c c}
& g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\
\hline
E_{in} & 0.029 & 0.00001 \\
E_{out} & 0.120 & 7680
\end{array}
\]
Irony of Two learners
- Learner \(O\)verfit: pick \(g_{10} \in \mathcal{H}_{10}\)
- Learner \(R\)estricted: pick \(g_{2} \in \mathcal{H}_{2}\)
- When both know that target = 10th
- R ‘give up’ ability to fit
But \(R\) wins in \(E_{out}\) a lot. Philosophy: concession for advantages? 以退为进!
Learning Curves Revisited
- When \(\mathcal{H}_{2}\), \(E_{in}\) is a little bit larger than \(E_{out}\).
- \(\mathcal{H}_{10}\): lower \(E_{out}\) when \(N \leftarrow \infty\), but much larger generalization error for small \(N\).
- Grey area: \(O\) overfits! \(E_{in} \downarrow\), \(E_{out} \uparrow\). 聪明反被聪明误!
- \(R\) always wins in \(E_{out}\) if \(N\) small! 不要钻牛角尖,简单的hypothesis反而效果比较好。
No Noise Case
- Learner \(O\)verfit: pick \(g_{10} \in \mathcal{H}_{10}\)
- Learner \(R\)estricted: pick \(g_{2} \in \mathcal{H}_{2}\)
- When both know that there is no noise
Is there really no noise? ‘Target complexity’ acts like noise.
Deterministic Noise
A Detailed Experiment
\[
y = f(x) + \epsilon \sim \text{Gaussian} (\sum_{q=0}^{Q_f} \alpha_{q} x^q, \sigma^2)
\]
- Gaussian iid noise \(\epsilon\) with noise \(\sigma^2\).
- Some ‘uniform’ distribution on \(f(x)\) with complexity level \(Q_{f}\)
- data size \(N\)
The Overfit Measure
- \(g_{2} \in \mathcal{H}_{2}\)
- \(g_{10} \in \mathcal{H}_{10}\)
- \(E_{in}(g_{10}) \leq E_{in}(g_{2})\) for sure.
Overfit measure \(E_{out}(g_{10}) - E_{out}(g_{2})\)
The Results
- impact of \(\sigma^2\) versus \(N\)
- Less noise (small ), more data points \(N\), less overfit.
- impact of \(Q_{f}\) versus \(N\)
- Less complex of objective function (smaller \(Q_{f}\)), more data points \(N\), less overfit.
- The logo of this class
- Middle two figures: overfit
- left: nonlinear transformation
- right: linear learning
Impact of Noise and Data Size
- impact of \(\sigma^2\) versus \(N\): stochastic noise
- impact of \(Q_{f}\) versus \(N\): deterministic noise
- four reasons of serious overfitting:
- data size \(N \downarrow\)
- stochastic noise \(\uparrow\)
- deterministic noise \(\uparrow\)
- excessive power \(\uparrow\), VC dimension is too high
overfitting ‘easily’ happens.
Deterministic Noise
- if \(f \notin \mathcal{H}\): something of \(f\) cannot be captured by \(\mathcal{H}\)
- deterministic noise: difference between best \(h^* \in \mathcal{H}\) and \(f\)
- acts like stochastic noise: not new, pseudo-random generator
- difference to stochastic noise:
- depends on \(\mathcal{H}\)
- fixed for a given x
Philosophy: when teaching a kid, perhaps better not to use examples from a complicated target function? :)
Dealing with Overfitting
Driving Analogy Revisited
| overfit |
commit a car accident |
| use excessive \(d_{VC}\) |
‘drive too fast’ |
| noise |
bumpy road |
| limited data size \(N\) |
limited observations about road condition |
| start from simple model |
drive slowly |
| data cleaning/pruning |
use more accurate road information |
| data hint |
exploit more road information |
| regularization |
put the breaks |
| validation |
monitor the dashboard |
All very practical techniques to combat overfitting.
Data Cleaning/Pruning
- if ‘detects’ the outlier 5 at the top by
- too close to other \(\circ\), or too far from other \(\times\)
- wrong by current classifer
- …
- possible action 1: correct the label (data cleaning)
- possible action 2: remove the example (data pruning)
Possibly helps: but effect varies
Data Hinting
- slightly shifted/rotated digits carry the same meaning
- possible action: add virtual examples by shifting/rotating the given digits (data hinting)
Possibly helps: but watch out - virtual example not \(\overset{iid}{\sim} P(x, y)\)!
Conclusions
- What is overfitting?
- lower \(E_{in}\), but higher \(E_{out}\)
- The role of Noise and Data size
- overfitting ‘easily’ happens
- Deterministic Noise
- what \(\mathcal{H}\) cannot capture acts like noise
- Dealing with Overfitting
- data cleaning/pruning/hinting, and more
Lecture 14 Regularization (规则化/规律化)
Overfitting happens with excessive power, stochastic/deterministic noise, and limited data.
Regularized Hypothesis set
Regularization: The Magic
overfit \(\implies\) ‘regularized fit’
- idea: ‘step back’ from \(\mathcal{H}_{10}\) to \(\mathcal{H}_{2}\)
- name history: function approximation for ill-posed problems
how to step back?
Stepping Back as Constraint
\(Q\)-th order polynomial transform for \(x \in \mathbb{R}\):
\[
\Phi_{Q}(x) = (1, x, x^2, \dots, x^Q)
\]
+linear regression, denote \(\tilde{w}\) by w
- hypothesis w in \(\mathcal{H}_{10}\): \(w_0 + w_1 x + w_2 x^2 + \dots + w_{10} x^{10}\)
- hypothesis w in \(\mathcal{H}_{2}\): \(w_0 + w_1 x + w_2 x^2\)
- that is, \(\mathcal{H}_{2} = \mathcal{H}_{10}\) AND ‘Constraint that \(w_3 = \dots = w_{10} = 0\)’
- step back = ‘constraints’
Regression with Constraint
- \(\mathcal{H}_{10} \equiv {w \in \mathbb{R}^{10+1}}\)
- regression with \(\mathcal{H}_{10}\): \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
- \(\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}\)
- regression with \(\mathcal{H}_{2}\):
- \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
- \(s.t. w_3 = \dots = w_{10} = 0\)
step back = constrained optimization of \(E_{in}\)
Why don’t you just use \(w in \mathbb{R}^{2+1}\)
Regression with Looser Constraint
- \(\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}\)
- regression with \(\mathcal{H}_{2}\):
- \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
- \(s.t. w_3 = \dots = w_{10} = 0\)
Relax to:
- \(\mathcal{H'}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} \ge 8 \text{of } w_{q}= 0}\)
- regression with \(\mathcal{H'}_{2}\):
- \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
- \(\text{s.t.} \sum_{q=0}^{10} [w_q \ne 0] \le 3\)
- more flexible than \(\mathcal{H}_2\): \(\mathcal{H}_{2} \in \mathcal{H'}_{2}\)
- less risky than \(\mathcal{H}_10\): \(\mathcal{H'}_{2} \in \mathcal{H'}_{10}\)
Bad news for sparse hypothesis set \(\mathcal{H'}_{2}\): NP-hard to solve.
- Boolean operation, discrete.
Regression with Softer Constraint
Relax to:
\[
\begin{split}
\mathcal{H}(C) \equiv { w \in \mathbb{R}^{10+1} \text{while} |w|^2 \le C } \\
\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w) \text{s.t. } \sum_{q=0}^{10} w_q^2 \le C
\end{split}
\]
- \(\mathcal{H}(C)\): overlaps but not exactly the same as \(\mathcal{H'}_2\)
- soft and smooth Structure over \(C \ge 0\):
- \(\mathcal{H}(0) \in \mathcal{H}(1.126) \in \dots \in \mathcal{H}(1126) \in \dots \in \mathcal{H}(\infty) = \mathcal{H}_{10}\)
regularized hypothesis \(w_{REG}\): optimal solution from regularized hypothesis set \(\mathcal{H}(C)\)
Weighted Decay Regularization
Matrix Form of Regularized Regression Problem
\[
\begin{split}
\underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T z_n - y_n)^2 \\
\text{s.t. } \sum_{q=0}^{Q} w_q^2 \le C
\end{split}
\]
- \(\sum_n = (Zw - y)^T(Zw - y)\)
- \(w^Tw \le C\): feasible \(w\) within a radius-\(\sqrt(C)\) hypersphere
How to solve the constrained optimization problem?
The Lagrange Multiplier
\[
\underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} (Zw - y)^T(Zw - y)
\text{s.t. } w^Tw \le C
\]
- decreasing direction: \(- \nabla E_{in}(w)\)
- normal vector of \(w^T w = C\): \(w\)
- if \(- \nabla E_{in}(w)\) and \(w\) not parallel: can decrease \(E_{in}(w)\) without violating the constraint
- at optimal solution \(w_{REG}\), \(- \nabla E_{in}(w) \propto w_{REG}\)
- want: find Lagrange multiplier \(\lambda > 0\) and \(w_{REG}\) such that \(\nabla E_{in}(w_{REG}) - \frac{2 \lambda}{N} w_{REG} = 0\)
Augmented Error
- if oracle tells you \(\lambda > 0\), then solving \(\nabla E_{in}(w_{REG}) + \frac{2 \lambda}{N} w_{REG} = 0\)
\[
\frac{2}{N}(Z^T Z w_{REG} - Z^T y) + \frac{2 \lambda}{N} w_{REG} = 0
\]
\[
w_{REG} \leftarrow (Z^T Z + \lambda I)^{-1} Z^T y
\]
- \((Z^T Z + \lambda I)\) always invertible.
- called ridge regression in Statistics.
Another view, equivalent to minimizing augmented error \(E_{aug}(w)\):
\[
E_{in}(w) + \frac{\lambda}{N} w^{T} w
\]
- regularization with augmented error instead of constrained \(E_{in}\)
\[
w_{REG} \leftarrow \underset{w}{\text{argmin}} E_{aug}(w) \text{for given} \lambda
\text{ or } \lambda = 0
\]
- minimizing unconstrained \(E_{aug}\) effectively minimizes some \(C\)-constrained \(E_{in}\)
The Results
- \(\lambda = 0\), overfitting
- \(\lambda = 0.0001\), OK
- \(\lambda = 0.01\), OK
- \(\lambda = 1\), underfitting, complexity is not enough.
- Philosophy: a little regularization goes a long way.
call ‘\(+\frac{\lambda}{N} w^T w\)’ weight-decay regularization:
- larger \(\iff\)
- prefer shorter \(w \iff\)
- effectively smaller \(C\)
Go with any transform + linear model
Some details: Legendre Polynomials
\[
\underset{w\in\mathbb{R}^{Q+1}}{\text{min}}
\frac{1}{N} \sum_{n=0}^{N} (w^T \Phi(x_n) - y_n^2)^2 +
\frac{1}{N} \sum_{q=0}^{Q} w_q^2
\]
naive polynomial transform:
\[
\Phi(x) = (1, x, x^2, \dots, x^Q)
\]
When \(x_n \in [-1, +1]\), \(x_n^q\) really small, needing large \(w_q\).
- We need large \(w_Q\) for big \(Q\). We penalize too much for the high dimension.
normalized polynomial transform:
\[
(1, L_1(x), L_2(x), \dots, L_Q(x))
\]
orthonormal basis functions called Legendre Polynomials
Regularization and VC Theory
Regularization and VC Theory
- Regularization by constrained-minimizing \(E_{in}\)
\[
\underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^T w \le C
\]
- Regularization by Minimizing \(E_{aug}\)
\[
\underset{w}{\text{min}} E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw
\]
- VC Guarantee of constrained-minimizing \(E_{in}\)
\[
E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C))
\]
minimizing \(E_{aug}\): indirectly getting VC guarantee without confining to \(\mathcal{H}(C)\).
Another View of Augmented Error
- Augmented Error: \(E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw\)
VC Bound: \(E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C))\)
- regularizer \(w^Tw = \Omega(w)\): complexity of a single hypothesis
- generalization price \(\Omega(\mathcal{H})\): complexity of a hypothesis set
if \(\frac{\lambda}{N}\Omega(w)\) ‘represents’ \(\Omega(\mathcal{H})\) well, \(E_{aug}\) is a better proxy of \(E_{out}\) than \(E_{in}\)
- minimizing \(E_{aug}\):
- (heuristically) operating with the better proxy;
- (technically) enjoying the flexibility of whole \(\mathcal{H}\)
Effective VC Dimenson
\[
\underset{w\in\mathbb{R}^{\tilde{d}+1}}{\text{min}}
= E_{in}(w) + \frac{\lambda}{N}\Omega(w)
\]
- model complexity?
- \(d_{VC}(\mathcal{H}) = \tilde{d} + 1\), because {\(w\)} ‘all considered’ during minimization
- {\(w\)} ‘actually needed’: \(\mathcal{H}(C)\), with some \(C\) equivalent to \(\lambda\)
- \(d_{VC}(\mathcal{H}(C))\): effective VC dimension \(d_{EFF}(\mathcal{H}, \underbrace{\mathcal{A}}_{\text{min} E_{aug}})\)
explanation of regularization: \(d_{VC}(\mathcal{H})\) large, while \(d_{EFF}(\mathcal{H}, \mathcal{A})\) small if \(\mathcal{A}\) regularized.
General Regularizers
General Regularizers \(\Omega(w)\)
want: constraint in the ‘direction’ of target function
- target-dependent: some properties of target, if known (domain knowledge)
- symmetry regularizer: \(\sum\) [q is odd] \(w_q^2\)
- plausible: direction towards smoother or simple
- regularization is to overcome noise
- stochastic/deterministic noise both non-smooth
- sparsity (L1) regularizer: \(\sum |w_q|\)
- friendly: easy to opmitize
- weight-decay (L2) regularizer: \(\sum w_q^2\)
bad? no worries, guard by \(\lambda\)
- augmented error = error \(\widehat{\text{err}}\) + regularizer \(\Omega\)
- regularizer: target-dependent, plausible or friendly. ringing a bell?
error measure: user-dependent, plausible or friendly.
L2 and L1 Regularizer
- L2 Regularizer
- \(\Omega(w) = \sum_{q=0}^{Q} w_q^2 = \begin{Vmatrix} w \end{Vmatrix}_2^2\)
- convex, differentiable everywhere
- easy to optimize
- L1 Regularizer
- \(\Omega(w) = \sum_{q=0}^{Q} |w_q| = \begin{Vmatrix} w \end{Vmatrix}_1\)
- convex, not differentiable everywhere
- sparsity in solution
- optimal solution is often at the corner, meaning some \(w\) is 0.
L1 useful if needing sparse solution.
The Optimal \(\lambda\)
- more noise \(\iff\) more regularization needed
- more bumpy road \(\iff\) putting brakes more
- noise unknown - important to make proper choices
How to choose? Stay tuned for the next lecture!
Conclusions
- Regularized Hypothesis set
- original \(\mathcal{H}\) + constraint
- Weighted Decay Regularization
- add \(\frac{\lambda}{N} w^Tw\) in \(E_{aug}\)
- Regularization and VC Theory
- regularization decreases \(d_{\text{EFF}}\)
- General Regularizers
- target-dependent, plausible or friendly.
Lecture 15 Validation
minimizes augmented error, where the added regularizer effectively limits model complexity.
Model Selection Problem
So Many Models Learned
Even just for Binary Classification …
- \(\mathcal{A} \in\) {PLA, pocket, linear regression, logistic regression}
- \(T \in\) {100, 1000, 10000}
- \(\eta \in\) {1, 0.01, 0.0001}
- \(\Phi \in\) {linear, quadratic, poly-10, Legendre-poly-10}
- \(\Omega(w) \in\) {L2 regularizer, L1 regularizer, symmetry, regularizer}
- \(\lambda \in\) {0, 0.01, 1}
in addition to your favorite combination, may need to try other combination to get a good \(g\).
Model Selection Problem
How to select? visually? No! See lecture 12.
- Not always can be visualized.
- The high VC dimension in the human brain.
Model Selection by Best \(E_{in}\)
- select by best \(E_{in}\)?
- \(m* = \underset{1 \le m \le M}{E_m = E_{in}(\mathcal{m}(\mathcal{D}))}\)
- \(\Phi_{1126}\) always more preferred over \(\Phi_1\)
- \(\lambda = 0\) always more preferred over \(\lambda = 0.1\) - overfitting
- if \(\mathcal{A}_1\) minimizes over \(\mathcal{H}_1\) and \(\mathcal{A}_2\) minimizes over \(\mathcal{H}_2\),
- \(\implies g_{m*}\) achieves minimal \(E_{in}\) over \(\mathcal{H}_{1} \cup \mathcal{H}_{2}\)
- \(\implies\) Extra model complexity, learning pays \(d_{VC}(\mathcal{H}_{1} \cup \mathcal{H}_{2})\)
- bad generalization?
selecting by \(E_{in}\) is dangerous.
Model Selection by Best \(E_{test}\)
- select by best \(E_{test}\), which is evaluated on a fresh \(\mathcal{D}_{test}\)?
- \(m* = \underset{1 \le m \le M}{E_m = E_{test}(\mathcal{m}(\mathcal{D}))}\)
- generalization guarantee (finite-bin Hoeffding):
- yes! strong guarantee! \[
E_{out}(g_{m*}) \le E_{test}(g_{m*}) + O(\sqrt{\frac{\text{log}M}{N_{\text{test}}}})
\]
- But where is \(\mathcal{D}_{test}\)? Your boss’s safe, maybe?
selecting by \(E_{test}\) is infeasible and cheating!
Comparison between \(E_{in}\) and \(E_{test}\)
- in-sample error \(E_{in}\)
- calculate from \(\mathcal{D}\)
- feasible on hand
- ‘contaminated’ as \(\mathcal{D}\) also used by \(\mathcal{A}_{m}\) to select \(g_{m}\)
- test error \(E_{test}\)
- calculate from \(\mathcal{D}_{test}\)
- infeasible in boss’s safe
- ‘clean’ as \(\mathcal{D}_{test}\) never used for selection before
- something in between: \(E_{val}\)
- calculate from $_{val} \(\mathcal{D}\)
- feasible on hand
- clean if \(\mathcal{D}_{val}\) never used by \(\mathcal{A}_{m}\) before.
selecting by \(E_{val}\): legal cheating.
Validation
Validation Set \(\mathcal{D}_{val}\)
\[
\underbrace{\mathcal{D}}_{\text{size } N} \leftarrow \underbrace{\mathcal{D}_{train}}_{\text{size } N-K}
\cup
\underbrace{\mathcal{D}_{val}}_{\text{size } K}
\]
- \(\mathcal{D}_{val} \subset \mathcal{D}\): called validation set - ‘on hand’ simulation of test set to connect \(E_{val}\) with \(E_{out}\):
- \(\mathcal{D}_{val} \underset{iid}{\sim} P(x, y)\): select \(K\) examples from \(\mathcal{D}\) at random
- to make sure \(\mathcal{D}_{val}\) ‘clean’: feed only \(\mathcal{D}_{train}\) to \(\mathcal{A}_m\) for model selection.
- \(g_{m} = \mathcal{A}_m(\mathcal{D})\)
- \(g_{m}^- = \mathcal{A}_m(\mathcal{D}_{train})\)
\[
E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}})
\]
Model Selection by Best \(E_{val}\)
\[
m* = \underset{\text{argmin}}{1 \le m \le M} (E_{in} = E_{val}(\mathcal{A}_m(\mathcal{D}_{train})))
\]
- generalization guarantee for all \(m\):
- \(E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}})\)
- heuristic gain from \(N-K\) to \(N\):
- \(E_{out}(\underbrace{g_{m*}}_{\mathcal{A}_{m*}(\mathcal{D})}) \le E_{out}(\underbrace{g_{m*}^-}_{\mathcal{A}_{m}(\mathcal{D}_{train})})\)
- This is because learning curve.
- The general flow:
- divide \(\mathcal{D}\) to \(\mathcal{D}_{val}\) and \(\mathcal{D}_{train}\).
- Obtain \(M\) models \(g_1^- \dots g_M^-\)
- pick the best one \(\{\mathcal{H}_{m*}, \mathcal{E}_{m*}\}\)
- use \(\mathcal{D}\) to find \(g_{m*}\) in \(\mathcal{H}_{m*}\)
\[
E_{out}(g_{m*}) \le E_{out}(g_{m*}^-) \le E_{val}(g_{m*}^-) + O(\sqrt{\frac{\\text{log} M}{K}})
\]
Validation in Practice
use validation to select between \(\mathcal{H}_{\Phi_5}\) and \(\mathcal{H}_{\Phi_{10}}\)
- in-sample: selection with \(E_{in}\)
- optimal: cheating-selection with \(E_{test}\)
- sub-\(g\): selection with \(E_{val}\) and report \(g_{m*}^-\)
- full-\(g\): selection with \(E_{val}\) and report \(g_{m*}\)
- \(E_{out}(g_{m*}) \le E_{out}(g_{m*}^-)\) indeed.
Why is sub-\(g\) worse than in-sample some time?
- \(\mathcal{D}_{val} \uparrow\) and \(\mathcal{D}_{train} \downarrow\).
The Dilemma about \(K\)
reasoning of validation:
\[
E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-)
\underset{(\text{large } K)}{\approx} E_{val}(g^-)
\]
- large \(K\): every \(E_{val} \approx E_{out}\), but all \(g_m^-\) much worse than \(g_m\)
- small \(K\): every \(g_m^- \approx g_m\), but \(E_{val}\) far from \(E_{out}\)
Practical rule of thumb: \(K = \frac{N}{5}\)
Fun Time
Validation might not be slow! Because we use less \(\mathcal{D}_{train}\).
Leave-One-Out Cross Validation
Extreme Case: \(K = 1\)
reasoning of validation:
\[
E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-)
\underset{(\text{large } K)}{\approx} E_{val}(g^-)
\]
- take \(K = 1\)? \(\mathcal{D}_{val}^{(n)} = \{(x_n, y_n\}\) and \(E_{val}^{(n)}(g_n^-) = \text{err}(g_{n}^-(x_n), y_n) = e_n\)
- make \(e_n\) closer to \(E_{out}(g)\)? - average over possible \(E_{val}^{(n)}\)
- Leave-One-Out cross validation estimate:
\[
E_{loocv}(\mathcal{H}, \mathcal{A}) = \frac{1}{N} \sum_{n=1}^{N} e_{n}
= \frac{1}{N} \sum_{n=1}^{N} \\text{err} (g_{n}^-(x_{n}), y_{n})
\]
- It can tell us if our algorithm \(\mathcal{A}\) performs good in data set \(\mathcal{D}\).
- hope: \(E_{loocv}(\mathcal{H}, \mathcal{A}) \approx E_{out}(g)\)
Illustration of Leave-One-Out
\[
E_{loocv}(\text{linear}) = \\frac{1}{3} (e_1 + e_2 + e_3)
\]
\[
E_{loocv}(\text{constant}) = \\frac{1}{3} (e_1 + e_2 + e_3)
\]
- Which one would you choose?
- If we only has three data points, the constant hypothesis will be better.
- \(m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{loocv}(\mathcal{H}_m, \mathcal{A}_m)\)
Theoretical Guarantee of Leave-One-Out Estimate
Does \(E_{loocv}(\mathcal{H}, \mathcal{A})\) say something about \(E_{out}(g)\)?
Yes! for average \(E_{out}\) on size \(N-1\) data.
\[
\underset{\mathcal{D}}{\mathcal{E}} E_{loocv}(\mathcal{H}, \mathcal{A})
= \underset{\mathcal{D}}{\mathcal{E}} \frac{1}{N} \sum_{n=1}^{N} e_n
= \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}}{\mathcal{E}} e_n
= \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} \underset{(x_n, y_n)}{\mathcal{E}} \text{err} (g_n^-(x_n), y_n)
= \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} E_{out}(g^-)
= \frac{1}{N} \sum_{n=1}^{N} \bar{E_{out}(N-1)}
= \bar{E_{out}(N-1)} = \bar{E_{out}(g^-)}
\]
expected \(E_{loocv}(\mathcal{H}, \mathcal{A})\) says something about expected \(E_{out}(g^-)\)
- Often called ‘almost unbiased estimate of \(E_{out}(g)\)’
Leave-One-Out in Practice
- Select by \(E_{in}\), not smooth, overfit.
- Select by \(E_{loocv}\), smooth, avoid overfit.
- \(E_{loocv}\) much better than \(E_{in}\)
V-Fold Cross Validation
Disadvantage of Leave-One-Out Estimate
Computation issue:
\[
E_{loocv}(\mathcal{H}, \mathcal{A})
= \frac{1}{N} \sum_{n=1}^{N} e_n
= \frac{1}{N} \sum_{n=1}^{N} \text{err} (g_n^-(x_n), y_n)
\]
- \(N\) ‘additional’ training per model, not always feasible in practice.
- E.g., we have \(\begin{vmatrix}\mathcal{D}\end{vmatrix} = 1000\), we need 1000 models, each one is trained from 999 points.
- except ‘special case’ like analytic solution for linear regression.
Stability issue - due to variance of single-point estimates.
- The Leave-One-Out curve is jagged shape. Not good for select best algorithm, We don’t want mistakenly select a low point which is just lucky to have low \(E_{cv}\)
V-fold Cross Validation
how to decrease computation need for cross validation?
- essence of Leave-One-Out cross validation: partition of \(\mathcal{D}\) to \(N\) parts, taking \(N-1\) for training and 1 for validation orderly
- \(V\)-fold cross-validation: random-partition of \(\mathcal{D}\) to \(V\) equal parts, take \(V-1\) for training and 1 for validation orderly
\[
E_{cv}(\mathcal{H}, \mathcal{A})
= \frac{1}{V} \sum_{v=1}^{V} E_{val}^{(v)}(g_v^-)
\]
- selection by \(E_{cv}: m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{cv}(\mathcal{H}_m, \mathcal{A}_m)\)
Practical rule of thumb: \(V = 10\)
Final Words on Validation
‘Selecting’ Validation Tool
- \(V\)-fold generally preferred over single validation if computation allows
- 5-Fold or 10-Fold generally works well: not necessary to trade \(V\)-fold with Leave-One-Out
Nature of Validation
- all training models: select among hypotheses, like qualification match for World cup
- all validation Schemes: select among finalists, like knock-out stage
- all testing methods: just evaluate
Validation still more optimistic than testing
- validation is still selection, it’s probably to contaminate the data, model complexity.
Do not fool yourself and others, report test result, not best validation result.
Conclusions
- Model Selection Problem
- Dangerous by \(E_{in}\) and dishonest by \(E_{test}\)
- Validation
- Select with \(E_{val}(\mathcal{D}_{train})\) while returning \(\mathcal{A}_{m*}(\mathcal{D})\)
- Leave-One-Out Cross Validation
- huge computation by for almost unbiased estimate
- V-Fold Cross Validation
- reasonable computation and performance
Lecture 16 Three Learning Principles (锦囊妙计)
Occam’s Razor
An explanation of the data should be made as simple as possible, but no simpler - Albert Einstein
entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) - Willian of Occam (1287 - 1347)
‘Occam’s razor’ for trimming down unnecessary explanation.
Occam’s Razor for Learning
- The simplest model that fits the data is also the most plausible.
- Two questions?
- What does it mean for a model to be simple
- How do we know that simpler is better
Simple Model
- Simple hypothesis \(h\)
- small \(\Omega(h)\) = looks simple
- specifically by few parameters
- simple mode \(\mathcal{H}\)
- small \(\Omega(\mathcal{H})\) = not many
- contains small number of hypothesis
- connection
- \(\begin{vmatrix} \mathcal{H} \end{vmatrix}\) of size \(2^{\mathcal{l}}\) \(\implies\) \(h\) specified by \(\mathcal{l}\) bits
- small \(\Omega(\mathcal{H}) \implies \Omega(h)\)
- simple: small hypothesis/model complexity
- either we use simple hypothesis set
- or we use regularization
Simple is Better
In addition to math proof that you have seen, Philosophically:
- simple \(\mathcal{H}\)
- \(\implies\) smaller \(m_{\mathcal{H}}(N)\)
- 对随意一组资料,通过丢铜板的方式产生输出。(乱乱的资料)
- 没有规律。
- \(\implies\) less likely to fit data perfectly \(\frac{m_{\mathcal{H}}(N)}{2^N}\)
Sampling Bias
Presidential Story
- 1948 US President election: Truman versus Dewey
- a newspaper phone-poll of how people voted, and set the title ‘Dewey Defeats Truman’ based on polling
The Big Smile Came from …
Truman, and yes he won
suspect of the mistake: * editorial bug? - no * bad luck of polling (\(\delta\))? - no
hint: phones were expensive
Sampling Bias
If the data is sampled in a biased way, learning will produce a simlarity biased outcome.
- technical explanation: data from \(P_1(x, y)\) but test under \(P_2 \ne P_1\): VC fails
- Philosophical explanation: study Math hard but test English: no strong test guarantee
- ‘minor’ VC assumption: data and testing both iid from \(P\)
Sampling Bias in Learning
A True Personal Story
Dealing with Sampling Bias
- Practical rule of thumb: match test scenario as much as possible.
- e.g. if test: ‘last’ user records ‘after’ \(\mathcal{D}\)
- training: emphasize later examples (KDDCup 2011)
- validation: use ‘late’ user records
- last puzzle: danger when learning ‘credit card approval’ with existing bank records?
- We don’t know if the new user will follow the behavior of the existing users.
Data Snooping
Visual Data Snooping
Visualize: \(\mathcal{X} = \mathbb{R}^2\)
- full \(\Phi_2: z = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)\), \(d_{VC} = 6\)
- or \(z = (1, x_1^2, x_2^2)\), \(d_{VC} = 2\)?
- of better \(z = (1, x_1^2 + x_2^2)\), \(d_{VC} = 2\)?
- or even better \(z = \text{sign}(0.6 - x_1^2 - x_2^2)\)?
Careful about your brain’s ‘model complexity’.
- For VC-safety, \(\Phi\) shall be decided without ‘Snooping’ data.
Data Snooping by Mere Shifting-Scaling
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.
Data Snooping by Data Reusing
Research Scenario
- benchmark data \(\mathcal{D}\)
- paper1: propose \(\mathcal{H}_1\) that works well on \(\mathcal{D}\)
- paper2: find room for improvement, propose \(\mathcal{H}_2\) - and publish only if better than \(\mathcal{H}_1\) on \(\mathcal{D}\)
- paper3: find room for improvement, propose \(\mathcal{H}_3\) - and publish only if better than \(\mathcal{H}_2\) on \(\mathcal{D}\)
- If all papers from the same author in one big paper: bad generalization due to \(d_{VC}(\cup_m \mathcal{H}_m)\)
- step-wise: later author snooped data by reading earlier papers
If you torture the data long enough, it will confess :)
Dealing with Data Snooping
- truth - very hard to avoid, unless being extremely honest
- extremely honest: lock your test data in safe
- less honest: reserve validation and use cautiously
- be blind: avoid making modeling decision by data (作决定之前尽量不要看资料)
- be suspicious: interpret research results (including your own) by proper feeling of contamination
One secret to wining KDDCups: careful balance between data-driven modeling (snooping) and validation (no-snooping).
Power of Three
Three Related Fields
Power of Three
- Data Mining
- use (huge) data to find property that is interesting
- difficult to distinguish ML and DM in reality
- Artificial Intelligence
- Compute something that shows intelligent behavior
- ML is one possible route to realize AI
- Statistics
- use data to make inference about an unknown process
- statistics contains many useful tools for ML
Three Theoretical Bounds
- Hoeffding: \(\text{P[BAD]} \le 2 \text{exp} (-2 \epsilon^2 N)\)
- one hypothesis
- useful for verifying/testing
- Multi-Bin Hoeffding: \(\text{P[BAD]} \le 2 M \text{exp} (-2 \epsilon^2 N)\)
- M hypothesis
- useful for validation
- VC: \(\text{P[BAD]} \le 4 m_{\mathcal{H}}(2N) \text{exp} (\dots)\)
- all \(\mathcal{H}\)
- useful for training
Three Linear Models
- PLA/pocket: \(h(x) = sign(s)\)
- plausible err = 0/1 (small flipping noise) minimize specially
- Linear regression \(h(x) = sign(s)\)
- friendly err = squared (easy to minimize) minimize analytically
- logistic regression \(h(x) = \theta(s)\)
- plausible err = CE (maxmimum likelihood) minimize iteratively
Three Key Tools
- Feature Transformation
- \(E_{in}(w) \rightarrow E_{in}(\tilde{w})\), \(d_{VC}(\mathcal{H}) \rightarrow d_{VC}(\mathcal{H}_{\Phi})\)
- by using more complicated \(\Phi\)
- lower \(E_{in}\)
- higher \(d_{VC}\)
- Regularization
- \(E_{in}(w) \rightarrow E_{in}(w_{REG})\), \(d_{VC}(\mathcal{H}) \rightarrow d_{\text{EFF}}(\mathcal{H}, \mathcal{A})\)
- by augmenting regularizer \(\Omega\)
- lower \(d_{\text{EFF}}\)
- higher \(E_{in}\)
- Validation
- \(E_{in}(h) \rightarrow E_{val}(g)\), \(\mathcal{H} \rightarrow \{g_1^-, \dots, g_M^-\}\)
- by reserving \(K\) examples as \(\mathcal{D}_{val}\)
- fewer choices
- fewer examples
Three Learning Principles
- Occam’s Razer
- Sampling Bias
- Data Snooping
Three Future Directions
- More Transform
- More Regularization
- Less Label
Conclusions
- Occam’s Razor
- Sampling Bias
- match test scenario as much as possible
- Data Snooping
- any use of data is ‘contamination’
- Power of Three
- relatives, bounds, models, tools, principles