Week 2
Types of Learning
Lecture 4, Feasible of Learning
Lecture 5 Training v.s. Testing
Lecture 6 Theory of Generalization
Lecture 7 The VC Dimension
Lecture 8 Noise and Error
Lecture 9 Linear Regression
Lecture 10 Logistic Regression
Lecture 11 Linear Models for Classification
Lecture 12 Nonlinear Transformation
Lecture 13 Hazard of Overfitting
Lecture 14 Regularization (规则化/规律化)
Lecture 15 Validation
Lecture 16 Three Learning Principles (锦囊妙计)

Week 2

1. When can machine learn?

Hypothesis: ‘perceptron’

Each “tall” $w$ represents a hypothesis $h$ and multiply the “tall” $x$.

Perceptron is equivalent to linear (binary) classifier.

Perceptron Learning Algorithm

want g is close to f (hard when f unknow)

Ideally, \[ g(x_n) = f(x_n) = y_n \]

Difficulty: H is infinite space.

Idea: start from some g0, and correct its mistakes on D.

Algorithm:

Find a mistakes $x_n(t)$
Updates $w(t+1) <- w(t) + y_n(t)x_n(t)$

Geometric explaination: if the angle between w and x is too big, we make it smaller, otherwise, make it bigger.

A fault confessed is half redressed.

Guarantee of PLA

The normalized inner product of w_f and w_t will increase monotonously. It will stop at certain iteration.

Non separable data

Pros: simple to implement, fast, work for any dimension d. Cons: * Assume linearly separable. * not fully sure how long halting takes (rho depends on wf)

PLA Algorithm

Run enough iteration.

Types of Learning

Learning with different output space

Binary classification * Credit card approval * Email spam/not spam * Patient sick/not sick * ad profitable/not profitable * answer correct/incorrect

Learning with different labels

Learning with different protocols

Learning with different input space

Lecture 4, Feasible of Learning

Learning is impossible?

No free launch!

Probability to the Rescue

Bin model.

in sample $\nu$ is likely close to unknow $\mu$.

Hoeffding’s Inequality:

\[ Pr(|\mu-\nu| \geq \epsilon) \leq exp(-2\epsilon^{2}N) \]

Probably Approximately Correct (PAC).

Connection to Learning

Learning.

Unknown f target f(x)

h is wrong/orange equivalent h(x) not equals to f(x) h is right/green Check h on D = {(xn, yn)}

Added components

$E_{in}$: in sample

$E_{out}$: out sample

The Formal Guarantee

Valid for all $N$ and $\epsilon$.
Doesn’t depend on $E_{out}(h)$, $f$ and $P$ can stay unknown.
$E_{in}(h) = E_{out}(h)$ is PAC.

Connection to real learning

BAD Data for Many $h$

no ‘freedom of choice’ by A, è¸©å°é·ã

Bound of BAD data

If $|\mathcal{H}| = M$ is finite, $N$ large enough, for whatever $g$ picked by $\mathcal{A}$, $E_{out}(g) \approx E_{in}(g)$.

If $\mathcal{A}$ finds one $g$ with $E_{in}(g) \approx 0$, PAC guarantees for $E_{out}(g) \approx 0$.

Lecture 5 Training v.s. Testing

Recap and Preview

If $|\mathcal{H}| = M$ is finite, $N$ large enough, for whatever $g$ picked by $\mathcal{A}$, $E_{out}(g) \approx E_{in}(g)$.

If $\mathcal{A}$ finds one $g$ with $E_{in}(g) \approx 0$, PAC guarantees for $E_{out}(g) \approx 0$.

\[ E_{out}(h) \underbrace{\approx}_{test} E_{in}(h) \underbrace{\approx}_{train} 0 \]

Two Central Questions:

Can we make sure that $E_{out}(h)$ is close enough to $E_{in}(g)$?
Can we make $E_{in}(g)$ small enough?

What role does $\underbrace{|\mathcal{H}|}_{M}$ play for these two questions?

Trade-off on $M$

small $M$
1. Yes! $Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)$
2. No! too few choices!
large $M$
1. No! $Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)$ is big.
2. Yes! many choices
What if $M = \infty$

Todo:

establish a finite quantity that replaces $M$ \[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2 m_{\mathcal{H}} exp(-2 \epsilon^2 N) \]
justify the feasiblity of learning for infinite $M$
study $m_{\mathcal{H}}$ to understand its trade-off for ‘right’ $\mathcal{H}$

These will be covered in next 3 lectures :-)

Effective Number of Lines

Where did M come from

BAD events $B_m$: $|E_{in}(B_m) - E_{out}(B_m)| > \epsilon$
To give $\mathcal{A}$ freedom choice, need to bound $Pr(B_1 \cup B_2 \cup \dots B_M)$
Worse case: all $B_m$ non-overlapping. It fails when M is infinite. \[ Pr(B_1 \cup B_2 \cup \dots B_M) \leq Pr(B_1) + \dots + Pr(B_M) \]

Where did uniform bound fail?

BAD events $B_m$: $|E_{in}(B_m) - E_{out}(B_m)| > \epsilon$ overlapping for similar hypothesis $h_1 \approx h_2$
Why?
- $E_{out}(h_1) \approx E_{out}(h_2)$
- for most $D$, $E_{in}(h_1) \approx E_{in}(h_2)$
union bound over-estimation
Solution: group similar hypothesis by kind.

How Many lines Are There?

$\mathcal{H} = {all\ lines\ in\ R^2}$

how many lines? $\infty$
how many kinds of lines from $\mathbf{x_1}$ point of view? 2 kinds:
- $h_1(x_1) = +1$ and $h_2(x_1) = -1$
how many kinds of lines for 4 points? At most 14.

Effective Number of Lines

It means max number of lines w.r.t N inputs $x_1, \dots, x_N$

must be $\leq 2^N$
finite ‘grouping’ of infinitely-many lines $\in \mathcal{H}$
wish:

\[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ eff(N)\ exp(-2 \epsilon^2 N) \]

eff(N) can replace $M$ and
eff(N) << $2^N$

Then learning is possible for infinite lines :-)

Effective Number of Hypothesis

Dichotomy: mini hypothesis

Definition: hypothesis ‘limited’ to the eyes of $x_1, \dots, x_n$, i.e. \[ h(x_1, \dots, x_n) = (h(x_1), \dots, h(x_n)) \in \{\times, \circ\}^N \]
Dichotomy set: $\mathcal{H}(x_1, \dots, x_n)$: all dichotomies implemented by $\mathcal{H}$ on $x_1, \dots, x_n$.
Dichotomy set is bounded by $2^N$, $\mathcal{H}$ can be infinite.
$|\mathcal{H}(x_1, \dots, x_n)|$ is a candidate for replacing $M$.

Growth Function

In order to remove the dependance between the size of dichotomy set and the inputs, define $m_H(N) = max|\mathcal{H}(x_1, \dots, x_n)|$. For example, for PLA, $m_H(1) = 2, m_H(2) = 4, m_H(3) = 8, m_H(4) = 14$.

It is called growth function and bounded by $2^N$.

Growth Function for Positive Rays

$\mathcal{X} = \mathbb{R}$
$\mathcal{H}$ contains $h$, where each $h(x) = sign(x-1)$ for threshold $a$.
positive half of 1D perceptrons.
$m_{\mathcal{H}}(N)=N+1 \ll 2^N$

Growth Function for Positive Intervals

$\mathcal{X} = \mathbb{R}$
$\mathcal{H}$ contains $h$, where each $h(x) = sign(x-1)$ iff $x \in [l,r)$, -1 otherwise. \[ m_{\mathcal{H}}(N) = \binom{N+1}{2} + 1 = \frac{N^2}{2} + \frac{N}{2} + 1 \ll 2^N \]

Growth Function for convex set

If $x_1, \dots, x_n$ on a big circle, then every dichotomy can be implemented. \[ m_{\mathcal{H}}(N) = 2^N \]
call those $N$ inputs ‘shattered’ by $\mathcal{H}$.

Break Point

The 4 growth functions

positive ray, $m_{\mathcal{H}}(N)=N+1$
postitive intervals, $\frac{N^2}{2} + \frac{N}{2} + 1$
convex set, $2^N$
2D perceptrons: $m_{\mathcal{H}}(N) < 2^N$ in some cases

Can we replace $M$ by $\mathcal{H}$? Polynomial good, exponential bad…

\[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \overset{?}{\leq} 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N) \]

Break point of

three inputs: ‘exists’ shatter; for inputs, ‘for all’ no shatter
if no $k$ inputs can be shattered by $\mathcal{H}$, the $k$ is a break point.
Any $k+i$ is break points.
conjecture: break point $k$, $m_{\mathcal{H}}(N) = O(N^{k-1})$

Conclusion

Recap and Preview:
- two questions, $E_{out} \approx E_{in}$ and $E_{in}(g) \approx 0$
Effective No. of lines
Effective No. of Hypothesis
Break point
- When $m_{\mathcal{H}}$ becomes non-exponential

Lecture 6 Theory of Generalization

Restriction of Break Point

max possible $m_{\mathcal{H}}$ when $N=3$ and $k=2$
- 4 dichotomies, shatter two points, no, but 5 yes.
break point $k$ restricts maximum possible $m_{\mathcal{H}}$ a lot for $N > k$
idea: \[ m_{\mathcal{H}} \leq m_{\mathcal{H}}\ given\ k \leq ploy(N) \]

Bounding Function: Basic Cases

bounding function $B(N, k)$: max possible $m_{\mathcal{H}}$ when break point $= k$
combinatorial quantity: max No. of length-N vectors with $(\circ, \times)$ while ‘no shatter’ any length-$k$ subvectors.
new goal $B(N, K) \leq poly(N)$

Bounding Function: Inductive Cases

Estimating $B(4, 3)$, after exhaustively searching the $2^{2^4}$, the winner is 11. For 4 points, there are 16 possible $(\circ, \times)$ combinations, each of them can either be in or not in the dichotomy set, so there are $2^{16}$ possible dichotomy sets.
Estimating $B(4, 3)$, $B(4, 3) = 11 = 2\alpha + \beta$. $\alpha+\beta$: dichotomies on $(x_1, x_2, x_3)$. $\alpha+\beta \leq B(3, 3)$
If we only look at the orange part, no any two points can be shattered. So $\alpha \leq B(3, 2)$.
We have $B(4, 3) \leq B(3, 3) + B(3, 2)$. And $B(N, k) \leq B(N-1, k) + B(N-1, k-1)$.

Bounding Function: The Theorem

\[ B(N,k) \leq \sum_{i=0}^{k-1} \binom{N}{i} \]

The bottom line is $m_{\mathcal{H}}$ is poly(N) if break point exists.

Three Break Points

2D perceptrons: $m_{\mathcal{H}}(4) = 14 < 2^4$: break point at 4.

A pictorial Proof

want

\[ Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N) \]

we can get \[ Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\cdot 2\ m_{\mathcal{H}}(2N) \ exp(-2 \frac{1}{16} \epsilon^2 N) \]

Step1: Replace $E_{out}$ by $E_{in}'$

How? sample verification set $D'$ of size N to calculate $E_{in}'$
Bad $h$ of $E_{in}-E_{out}$ can probably infer bad $h$ of $E_{in}-E_{in}'$

Step2: Decompose $\mathcal{H}$ by kind

\[ BAD \leq 2Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \]

Step3: Use Hoeffding without Replacement

VC bound:

\[ BAD \leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N) \]

Conclusion

Restriction of Break Point: breaks consequent points
Bounding Function: Basic Cases, $B(N,k)$ bounds $m_{\mathcal{H}}(N)$
Bounding Function: Inductive Cases, $B(N,k)$ is poly(N)
A pictorial Proof: $m_{\mathcal{H}}(N)$ can replace $M$ with a few changes.

Lecture 7 The VC Dimension

Definition of VC Dimension

Recap: More on Growth Function

$m_{\mathcal{H}}(N)$ of break point $k \leq B(N,k)=\sum_{i=0}^{k-1} \binom{N}{i} \sim O(N^{k-1})$

Recap: More on VC bound

For any $g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}$ and ‘statistival’ large $\mathcal{D}$

\[ \begin{split} \mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \\ \leq \mathbb{P}_{\mathcal{D}}(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon) \\ \leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N) \\ \leq 4(2N)^{k-1}exp(-\frac{1}{8}\epsilon^2N) \end{split} \]

If:

$m_{\mathcal{H}}$ breaks at $k$, (good $\mathcal{H}$)
$N$ large enough, (good $\mathcal{D}$)
$\mathcal{A}$ picks good $g$ with small $E_{in}$, good $\mathcal{A}$

VC Dimension

Definition:

$d_{VC}(\mathcal{H})$, VC dimension of $\mathcal{H}$: largest $N$ $m_{\mathcal{H}}(N) = 2^N$
the most input $\mathcal{H}$ that can shatter
$N \leq d_{vc} \Longrightarrow \mathcal{H}$ can shatter some $N$ inputs
$k \geq d_{vc} \Longrightarrow k$ is a break point

\[ m_{\mathcal{H}}(N) \leq N^{d_{VC}}, if N \geq 2, d_{VC} \geq 2 \]

The Four VC Dimensions

positive ray, $d_{VC} = 1$
positive intervals, $d_{VC} = 2$
convex set, $d_{VC} = \infty$
2D perceptrons, $d_{VC} = 3$

VC Dimension and Learning

regardless of learning algorithm
regardless of input distribution $P$
regardless of target function $f$

FUn time No conclusion

VC Dimension of Perceptrons

2D PLA revisited

linearly separable $\mathcal{D}$ -> PLA can converge -> $T$ iterations -> $E_{in}(g) = 0$
with $x_n \sim P$ and $y_n = f(x_n)$ -> $\mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq \dots$ because $d_{VC} = 3$ -> $N$ is large -> $E_{in}(g) \approx E_{out}(g) \approx 0$

VC Dimension of Perceptrons

1D (pos/neg rays): $d_{VC}=2$
2D: $d_{VC}=3$
d-D: : $d_{VC} \overset{?}{=} d+1$
- $d_{VC} \geq d+1$
- $d_{VC} \leq d+1$

$d_{VC} \geq d+1$

A special matrix

$d_{VC} \leq d+1$

linear dependency restricts dichotomy \[ w^Tx_4 = w^Tx_2 + w^Tx_3 - w^Tx_1 > 0 \]

Physical Intuition of VC Dimenson

Degrees of Freedom

Two Old Friends

Positive Rays, free parameters: $a$
Positive Intervals, free parameters: $l, r$
practical rule of thumb: $d_{VC} \approx$ # of free parameters(but not always).

$M$ and $d_{VC}$

small $d_{VC}$
large $d_{VC}$
using the right $d_{VC}$ (or $)

Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

For any $g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}$ and ‘statistival’ large $\mathcal{D}$

\[ \mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 4(2N)^{d_{VC}}exp(-\frac{1}{8}\epsilon^2N) = \delta \] \[ \epsilon = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})} \] with probablity $1-\delta$, : Generalization error $|E_{in}(g)-E_{out}(g)| \leq \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}$ \[ E_{out}(g) \leq E_{in}(g)+\sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})} \]

THE VC message

$\Omega(N, \mathcal{H}, \delta) = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}$ is called model complexity penalty.

$d_{VC} \uparrow$, $E_{in} \downarrow$ but $\Omega \uparrow$
$d_{VC} \downarrow$, $E_{in} \uparrow$ but $\Omega \downarrow$
best $d_{VC}$ is in the middle
powerful $\mathcal{H}$ not always good!

VC Bound Rephrase: Sample Complexity

Need $10000d_{VC}$ in theory
Practical rule of thumb: $10d_{VC}$

Looseness of VC Bound

Hoeffding for unknown $E_out$, any distribution any target.
$m_{\mathcal{H}}$ instead of $|\mathcal{H}(x_1, \dots, x_n)|$, any data
$N^{d_{VC}}$ instead of $m_{\mathcal{H}}$, any $\mathcal{H}$ of same $d_{VC}$
union bound on worst cases, any choice made by $\mathcal{A}$
But hardly better, and similar loose for all models
Philosophical message of VC bound is important for improving ML

Conclusions

Definition VC Dimension: max non-break point
VC Dimension of Perceptrons: $d+1$
Physical Intuition of VC Dimenson: $\approx$ # free parameters
Interpreting VC Dimension: model and sample complexity

Lecture 8 Noise and Error

Noise and Probabilistic Target

What if there is noise?

Noise

noise in y: good customer mislabeled as bad
noise in y: same customer different labels
noise in x: inaccurate information

Does VC bound work under noise?

Probabilistic Marbels

Probabilistic (noise) marbles
- marble $x \sim P(x)$
- probabilistic color, $y \ne h(x)$ with $y \sim P(y|x)$
VC holds for
- $x \overset{i.i.d}{\sim} P(x)$
- $y \overset{i.i.d}{\sim} P(y|x)$
- $(x, y) \overset{i.i.d}{\sim} P(x, y)$

Target Distribution $P(y|x)$

can be viewed as ‘idea mini target’ + noise, e.g.
- $P(\circ|x) = 0.7, P(\times|x) = 0.3$
- idea mini target $f(x) = \circ$
- ‘flipping’ noise level = 0.3
deterministic target $f$: special case of target distribution
- $P(y|x) = 1$ for $y = f(x)$
- $P(y|x) = 0$ for $y \ne f(x)$
goal of learning: predict idea mini target (w.r.t $P(y|x)$) on often-seen inputs (w.r.t $P(x)$)

New Learning flow

VC still works, pocket algorithm explained :)

Error Measure

final hypothesis $g \approx f$

how well? previously, considered out-of-sample measure \[ E_{out}(g) = \underset{X \sim P}{\epsilon}(g(x) \ne f(x)) \]
more generally, error measure $E(g, f)$
natrually considered
- out-of-sample: averaged over unknow $x$
- pointwise: evaluated on one $x$
- classification: [prediction $\ne$ target], often also called ‘0/1 error’

Point Error Measure

\[ E_{out}(g) = \underset{X \sim P}{\mathcal{E}} \underbrace{[g(x) \ne f(x)]}_{error(g(x), f(x))} \]

in-sample

\[ E_{in}(g) = \frac{1}{N} \sum_{n=1}^{N}err(g(x_n), f(x_n)) \]

out-of-sample

\[ E_{out}(g) = \underset{X \sim P}{\mathcal{E}} err(g(x), f(x)) \]

will mainly consider pointwise err for simplicity

Two Important Pointwise Error Measures

0/1 error
- $err(\tilde{y}, y) = [\hat{y} \ne y]$
- correct or not
- often for classification
squared error
- $err(\tilde{y}, y) = (\hat{y}-y)^2$
- how far $\hat{y}$ from $y$
- often for regression

How does err ‘guide’ learning?

Ideal Mini-Target

\[ P(y=1|x) = 0.2, P(y=2|x) = 0.7, P(y=3|x) = 0.1 \]

0/1 error, $err(\tilde{y}, y) = [\tilde{y} \ne y]$
- $\tilde{y} = 1$, avg. err 0.8
- $\tilde{y} = 2$, avg. err 0.3
- $\tilde{y} = 3$, avg. err 0.9
- $\tilde{y} = 1.9$, avg. err 1
- $f(x) = \underset{y\in\mathcal{y}}{argmax}(P(y|x))$
squared error, $err(\tilde{y}, y) = (\tilde{y}-y)^2$
- $\tilde{y} = 1$, avg. err 1.1
- $\tilde{y} = 2$, avg. err 0.3
- $\tilde{y} = 3$, avg. err 1.5
- $\tilde{y} = 1.9$, avg. err 0.29
- $f(x) = \sum_{y\in\mathcal{y}} y \cdot P(y|x)$

Learning Flow with Error Measure

Extended VC theory/‘Philosophy’ works for most $\mathcal{H}$ and err.

Algorithmic Error Measure

Choice of Error Measure

two types of error: false accept and false reject

0/1 error penalizes both types equally.

supermarket: fingerprint for discount
- false reject: very unhappy customer, lose future business, (penalty 10)
- false accept: give away a minor discount, intruder left fingerprint (penalty 1)
CIA: fingerprint for entrance
- false accept: very serious consequences! (penalty 1000)
- false reject: unhappy employee, but so what? :) (penalty 1)

Take-home Message for Now

err is application/user-dependent

Algorithm Error Measures $\widehat{error}$

true: just err
plausible:
- 0/1 minimum ‘flipping noise’ - NP-hard to optimize
- squared: minimum Gaussian noise
friendly: easy to optimize for $\mathcal{A}$
- closed-form solution
- convex objective function

err: application goal

$\widehat{err}$: a key part of many $mathcal{A}$

Weighted Classification

\[ \begin{array}{c|c} & \begin{matrix} h(x) \\ 1 \qquad -1 \end{matrix} \\ \hline \begin{matrix} f(x) & 1 \\ & -1 \end{matrix} & \begin{matrix} 0 & 1 \\ 1000 & 0 \end{matrix} \end{array} \]

out-of-sample $E_out(h)$ =

1, if $y_n = +1$
1000, if $y_n = -1$

in-sample is similar

Minimizing $E_{in}$ for weighted Classification

Naive Thoughts:

PLA: doesn’t matter if linear separable.
pocket: modify pocket-replacement rule
- if $w_{t+1}$ reaches smaller $E_{in}^{w}$ than $\hat{w}$

Systematic Route: Connect $E_{in}^w$ and $E_{in}^{0/1}$

idea: duplicate the data point with -1 label 1000 times.

weighted pocket algorithm

using ‘virtual copying’, weighted pocket algorithm include:

weighted PLA: randomly check -1 example mistakes with 1000 times more probability
weighted pocket replacement: if $w_{t+1}$ reaches smaller $E_{in}^{w}$ than $\hat{w}$, replace $\hat{w}$ by $w_{t+1}$

systematic route (called reduction): can be applied to many other algorithm

Fun time

Unbalanced data

Conclusions

Noise and Probabilistic Target
- can replace $f(x)$ with $P(y|x)$
Error Measure
- affect ‘ideal’ target
Algorithmic Error Measure
- user-dependent => plausible or friendly
Weighted Classification
- easily done by virtual ‘example copying’

Lecture 9 Linear Regression

Linear Regression Problem

Credit Limit Problem

credit limit? $100,000

Linear Regression Hypothesis

For $x = (x_0, x_1, \dots, x_d)$ ‘features of customer’, approximate the desired credit limit with a weighted sum:

\[ y \approx \sum_{i = 0}^{d} w_i x_i \]

linear regression hypothesis: $h(x) = w^T x$
$h(x)$: like perceptron, but without the sign

The Error Measure

square error $err(\hat{y}, y) = (\hat{y}^2 - y)^2$

in sample

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (\underbrace{h(x_n)}_{w^T x_n} - y_n)^2 \]

out-of-sample

\[ E_{out}(w) = \underset{(x, y) \sim P}{\mathcal{E}} (w^T x - y) \]

Linear Regression Algorithm

Matrix from of $E_{in}(w)$

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T x_n - y_n)^2 = \frac{1}{N} \begin{Vmatrix} x_1^T w - y_1 \\ x_2^T w - y_2 \\ \dots \\ x_N^T w - y_N \\ \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} Xw-y \end{Vmatrix}^2 \]

$E_{in}(w)$: continuous, differentiable, convex
necessary condition of ‘best’ $w$ - not possible to ‘roll down’
task: find $w_{LIN}$ such that $\nabla E_{in}(w_{LIN}) = 0$

The Gradient $\nabla E_{in}(w)$

\[ E_{in}(w) = = \frac{1}{N} \begin{Vmatrix} Xw-y \end{Vmatrix}^2 = \frac{1}{N} (w^T X^T X w - 2 w^T X^T y + y^Ty) = \frac{1}{N} (w^T A w - 2 w^T b + c) \]

For vector $w$, we have

\[ \nabla E_{in}(w) = \frac{1}{N} (2Aw - 2b) = \frac{2}{N} (X^TXw - X^Ty) = 0 \]

Optimal Linear Regression Weights

invertible $X^T X$
- easy! unique solution $w_{LIN} = \underbrace{(X^T X)^{-1}X^T}_{pseudo-inverse X^T} y$
- often the case because $N >> d+1$
singular $X^T X$
- many optimal solutions
practical suggestion:
- use well-implemented routine to solve $X^+$ instead of $(X^T X)^{-1}X^T$ for numerical stability.

Linear Regression Algorithm

from $\mathcal{D}$, construct input matrix $X$ and output vector $y$ by

\[ X = \underbrace{ \begin{bmatrix} --x_1^T-- \\ --x_2^T-- \\ \dots \\ --x_N^T-- \\ \end{bmatrix} }_{N\times(d+1)} y = \underbrace{ \begin{bmatrix} --y_1^T-- \\ --y_2^T-- \\ \dots \\ --y_N^T-- \\ \end{bmatrix} }_{N\times1} \]

calculate pseudo-inverse $\underbrace{X^+}_{(d+1)\times N}$
return $\underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y$

Generalization Issue

\[ \underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y \]

No!
- analytic (closed-form) solution, ‘instantaneous’
- not improving $E_{in}$
Yes!
- good $E_{in}$? Yes, optimal
- good $E_{out}$? yes, finite $d_{vc}$ like perceptrons
- improving iteratively? somewhat, whithin an iterative pseudo-inverse routine
if $E_{out}(w_{LIN})$ is good, learning ‘happened’!

Benefit of analytic solution: ‘simpler-than-VC’ Guarantee

\[ \overline{E_{in}} = \underset{\mathcal{D} \sim P^N}{\mathcal{E}}{E_{in}(w_{LIN} w.r.t. \mathcal{D})} \overset{\text{to be shown}}{=} \text{noise level} \cdot (1 - \frac{d+1}{N}) \]

We have:

\[ E_{in}(w_{LIN}) = \frac{1}{N} \begin{Vmatrix} y - \hat{y} \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} y - XX^{+}y \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} (I - XX^{+})y \end{Vmatrix}^2 \]

We call $XX^+$ hat matrix $H$ because it puts $\wedge$ on $y$

Geometric View of Hat Matrix

in $\mathbb{R}^N$
- $\hat{y} = Xw_{LIN}$ within the span of $X$ columns
- $y-\hat{y}$ smallest: $y-\hat{y} \bot$ span
- $H$: project $y$ to $\hat{y} \in$ span
- $I-H$: transform $y$ to $y-\hat{y} \bot$ span
claim: trace($I-H$) = $N-(d+1)$

An Illustrative ‘Proof’

if $y$ comes from some ideal $f(X) \in$ span plus noise
noise transfromed by $I-H$ to be $y-\hat{y}$

\[ E_{in}(w_{LIN}) = \frac{1}{N} \begin{Vmatrix} y - \hat{y} \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} (I-H)\text{noise} \end{Vmatrix}^2 = \frac{1}{N}(N-(d+1)) \begin{Vmatrix} \text{noise} \end{Vmatrix}^2 \]

We have:

\[ \overline{E_{in}} = \text{noise level} \cdot (1 - \frac{d+1}{N}) \]

\[ \overline{E_{out}} = \text{noise level} \cdot (1 + \frac{d+1}{N}) \]

The Learning Curve

Both converge to $\sigma^2$ (noise level) for $N \rightarrow \infty$
expected generalization error: $\frac{2(d+1)}{N}$
- similar to worst-case guarantee from VC
linear regression (LinReg): learning happened!

Linear Regression for Binary Classification

Linear Classification v.s. Linear Regression

Linear Classification:
- $\mathcal{Y} = {-1, +1}$
- $h(x) = \text{sign}(w^Tx)$
- $\text{err}(\hat{y}, y) = [\hat{y} \ne y]$
- NP-hard to solve in general
Linear Regression
- $\mathcal{Y} = \mathbb{R}$
- $h(x) = (w^Tx)$
- $\text{err}(\hat{y}, y) = (\hat{y} - y)^2$
- efficient analytic solution
${-1, 1} \in \mathbb{R}$: linear regression for classification?
- run LinReg on binary classification data $\mathcal{D}$ (efficient)
- return $g(x) = \text{sign}(w_{LIN}^Tx)$
explanation of this heuristic?

Relation of Two Errors

$\text{err}_{0/1} = [sign(w^Tx) \ne y]$
$\text{err}_{sqr} = (w^Tx-y)^2$
$\text{err}_{0/1} \leq \text{err}_{sqr}$

Linear Regression for Binary Classification

$\text{err}_{0/1} \leq \text{err}_{sqr}$

\[ \text{classification} E_{out}(w) \overset{VC}{\leq} \text{classification} E_{in}(w) + \sqrt{\dots\dots} \leq \text{regression} E_{in}(w) + \sqrt{\dots\dots} \]

(loose) upper bound $err_{sqr}$ as $\widehat{err}$ to approximate $\text{err}_{0/1}$
- if we have small enough $err_{sqr}$, the $\text{classification} E_{out}$ will also be small
trade bound tightness for efficiency
$w_{LIN}$: useful baseline classifer, or as initial PLA/pocket vector

Conclusions

Linear Regression Problem
- use hyperplane to approximate real values
Linear Regression Algorithm
- analytic solution with pseudo-inverse
Generalization Issue
- $E_{out} - E_{in} \approx \frac{2(d+1)}{N}$ on average
Linear Regression for Binary Classification
- 0/1 error $\leq$ squared error

Lecture 10 Logistic Regression

Logistic Regression Problem

Heart Attack Prediction Problem

binary classification: heart disease? yes or no ?

ideal $f(x) = \text{sign}(P(+1|x) - 0.5) \in {-1, +1}$

‘Soft’ binary classification: similar problem heart attack? 80% risk

$f(x) = P(+1|x) \in [0, 1]$

Soft Binary Classification

target function $f(x) = P(+1|x) $
ideal (noiseless) data
- $(x_1, y_1' = 0.9 = P(+1|x_1))$
- $(x_2, y_2' = 0.2 = P(+1|x_2))$
- …
- $(x_N, y_N' = 0.6 = P(+1|x_N))$
actual (noisy) data
- $(x_1, y_1' = 1 = [\circ \overset{?}{\sim} P(y|x_1)]$
- $(x_2, y_2' = 0 = [\circ \overset{?}{\sim} P(y|x_2)]$
- …
- $(x_N, y_N' = 0 = [\circ \overset{?}{\sim} P(y|x_N)]$
We have same data as hard binary Classification, but different target function

Logistic Hypothesis

For $x = (x_0, x_1, \dots, x_d)$ ‘features of patient’, calculate a weighted ‘risk score’:

\[ s = \sum_{i=0}^{d} w_i x_i \]

convert the score to estimated probability by logistic function $\theta(s)$
logistic hypothesis: $h(x) = \theta(w^T x)$

Logistic Function

\[ \theta(s) = \frac{e^s}{1+ e^s} = \frac{1}{1+e^{-s}} \]

It’s smooth, monotonic sigmoid function of $s$
logistic regression: use the following to approximate target function $f(x) = P(y|x)$

\[ h(x) = \frac{1}{1+ exp(-w^T x)} \]

Logistic Regression Error

Three Linear Models

Linear Classification
- $h(x) = \text{sign}(s)$
- plausible err = 0/1 (small flipping noise)
Linear Regression
- $h(x) = s$
- friendly err = squared (easy to minimize)
Logistic Regression
- $h(x) = \theta(s)$
- err = ?

Likelihood

\[ f(x) = P(+1|x) \iff P(y|x) = \begin{cases} f(x) & \quad \text{for } y = +1 \\ 1-f(x) & \quad \text{for } y = -1 \\ \end{cases} \]

Consider $\mathcal{D} = {(x_1, \circ), (x_2, \times), \dots, (x_N, \times)}$
probability that $f$ generates $\mathcal{D}$ is

\[ P(x_1)P(\circ | x_1) \times P(x_2)P(\times | x_2) \times \dots \times P(x_N)P(\times | x_N) \]

replace the above equation with $f(x)$

\[ P(x_1) \times f(x_1) P(x_2) (1 - f(x_2)) \times \dots \times P(x_N) (1 - f(x_N)) \]

Likelihood that $h$ generates $\mathcal{D}$ is

\[ P(x_1) \times h(x_1) P(x_2) (1 - h(x_2)) \times \dots \times P(x_N) (1 - h(x_N)) \]

if $h \approx f$, then likelihood($h$) $\approx$ probability using $f$
probability using $f$ usually large

Likelihood of Logistic Hypothesis

likelihood($h$) $\approx$ probability using $f \approx$ large
$g = \underset{h}{\text{argmax}} \text{likelihood}(h)$
when logistic: $h(x) = \theta(w^Tx)$
- $1-h(x) = h(-x)$

\[ \text{likelihood}(h) = P(x_1) h(+x_1) P(x_2) h(-x_2) \cdots P(x_N) h(-X_N) \]

So we have

\[ \text{likelihood}(\text{logistic } h) \propto \prod_{n=1}^{N} h(y_n x_n) \]

Cross-Entropy Error

\[ \underset{w}{\text{max}}\ \text{likelihood } (w) \propto \prod_{n=1}^{N} \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{max}}\ \text{ln } \prod_{n=1}^{N} \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} - \text{ln } \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n)) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \underbrace{\sum_{n=1}^{N} \text{err}(w, x_n, y_n)}_{E_{in}(w)} \]

cross-entropy error: $\text{err}(w, x_n, y_n) = \text{ln } (1 + exp(-y_n w^T x_n))$

Gradient of Logistic Regression Error

Minimizing $E_{in}(w)$

$E_{in}(w)$: continuous, differentiable, twice-differentiable, convex
how to minimize? locate valley, want $\nabla E_{in}(w) = 0$
first: derive $\nabla E_{in}(w) = 0$

The Gradient $\nabla E_{in}(w)$

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n)) \]

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n) \]

Note that in Andrew’s cource, the $\nabla E_{in}(w)$ is as following. Which is actually the same as above if we replace $y_n = 1$ and $y_n = -1$, and use $h(-x) = 1 - h(x)$ property. But note, in Andrew’s cource, $y_n$ can be 1 or 0, not $\pm 1$.

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} (h(x_n) - y_n) x_n \]

Minimizing $E_{in}(w)$

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n) = 0 \]

scaled $\theta$-weighted sum of $-y_n x_n$
- all $\theta(i) = 0$, only if $y_n w^T x_n \gg 0$ — Linear separable $\mathcal{D}$
- weighted sum = 0: non-linear equation of $w$

PLA revisited: iterative optimization

PLA: start from some $w_0$, and correct its mistakes on $\mathcal{D}$

For $t = 0, 1, \dots$

find a mistake of $w_t$ called $(x_{n(t)}, y_{n(t)})$

\[ \text{sign}(w_t^Tx_{n(t)}) \ne y_{n(t)} \]

try to correct the mistake by

\[ w_{t+1} \leftarrow w_t + y_{n(t)}x_{n(t)} \]

when stop, return last $w$ as $g$

For $t = 0, 1, \dots$

(equivalently) pick some $n$, and update $w_t$ by

\[ w_{t+1} \leftarrow w_t + \underbrace{1}_{\eta} \cdot \underbrace{[\text{sign}(w_t^Tx_n) \ne y_n]y_{n(t)}x_{n(t)}}_{v} \]

when stop, return last $w$ as $g$

$v$ is the updates direction
$\eta$ how big is the step in the direction
choice of $(\eta, v)$ and stopping condition defines iterative optimization approach

Gradient Descent

Iterative Optimization

For $t = 0, 1, \dots$

\[ w_{t+1} \leftarrow w_t + \eta v \]

when stop, return last $w$ as $g$

PLA: $v$ comes from mistake correction
smooth $E_{in}(w)$ for logistic regression: choose $v$ to get the ball roll ‘downhill’
- direction $v$: (assumed) of unit length
a greedy approach for some given $\eta > 0$:

\[ \underset{|v| = 1}{\text{min}} E_{in} (\underbrace{w_t + \eta v}_{w_{t+1}}) \]

Linear Approximation

still non-linear optimization, now with constraints
- not any easier than $\text{min}_w E_{in} (w)$
local approximation by linear formula makes problem easier.

\[ E_{in} (w_t + \eta v) \approx E_{in}(w_t) + \eta v^T \nabla E_{in}(w_t) \]

if $\eta$ really small (Taylor expansion)

an approximate greedy approach for some given small $\eta$:

\[ \underset{|v| = 1}{\text{min}} E_{in} (w_t) + \eta v^T \nabla E_{in}(w_t) \]

Gradient Descent

optimal $v$:

\[ v = - \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|} \]

gradient descent: for small $\eta$

\[ w_{t+1} \leftarrow w_t - \eta \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|} \]

a simple and popular optimization

Choice of $\eta$

too small: too slow
too large: too unstable
just right: use changing $\eta$

$\eta$ better be monotonic of $\begin{Vmatrix}\nabla E_{in}(w_t)\end{Vmatrix}$

Simple heuristic for changing $\eta$

\[ w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t) \]

called the fixed learning rate

Putting Everything Together

Initialize $w_0$

For $t = 0, 1, \dots$

compute

\[ \nabla E_{in}(w_t) = 1 / N \sum_{n=1}^{N} \theta(-y_n w_t^T x_n) (-y_n x_n) \]

update by

\[ w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t) \]

… until $\nabla E_{in}(w_t) \approx 0$ or for enough iterations.

Similer time complexity to pocket per iteration

Conclusions

Logistic Regression Problem
- $P(+1|x)$ as target and $\theta(w^Tx)$ as hypothesis
Logistic Regression Error
- cross-entropy (negative log likelihood)
Gradient of Logistic Regression Error
- $\theta$-weighted sum of data vectors
Gradient Descent
- roll downhill by $-\nabla E_{in}(w_t)$

Lecture 11 Linear Models for Classification

Linear Models for Binary Classification

Linear Models Revisited

linear scoring function: $s = w^Tx$

Linear Classification
- $h(x) = \text{sign}(s)$
- plausible err = 0/1 (discrete $E_{in}(w)$ NP hard to solve)
Linear Regression
- $h(x) = s$
- friendly err = squared (quadratic convex $E_{in}(w)$: closed form solution)
Logistic Regression
- $h(x) = \theta(s)$
- plausible err = cross-entropy (smooth convex $E_{in}(w)$: gradient descent)

Can linear regression or logistic regression help linear classfication? * linear classfication is hard to solve, but the other two are easy to solve * Can we use the other two to replace the first one?

Error Functions Revisited

linear scoring function: $s = w^Tx$

for binary classfication $\mathcal{y} \in {-1, +1}$

Linear Classification
- $h(x) = \text{sign}(s)$
- err$(h, x, y) = [h(x) \ne y]$
- $\text{err}_{0/1}(s, y) = [\text{sign}(s) \ne y] = [\text{sign}(ys) \ne 1]$
Linear Regression
- $h(x) = s$
- err$(h, x, y) = (h(x) - y)^2$
- $\text{err}_{SQR}(s, y) = (s - y)^2 = (ys-1)^2$
Logistic Regression
- $h(x) = \theta(s)$
- err$(h, x, y) = \text{ln } h(yx)$
- $\text{err}_{CE}(s, y) = \text{ln} (1+exp(-ys))$
$(ys)$: classfication correctness score

Visualizing Error Functions

$0/1$: 1 if and only if $ys \leq 0$
sqr: large if $ys \ll 1$, but over-charge $ys \gg 1$, small $\text{err}_{SQR} \rightarrow \text{small } \text{err}_{0/1}$
ce: monotonic of $ys$ small err CE $\iff$ small error 0/1.
scaled ce: a proper upper bound of 0/1 small CE $\iff$ small error 0/1.
upper bound: useful for designing Algorithmic error $\widehat{err}$

Theoretical Implication of Upper Bound

For any $ys$ where $s = W^T x$

\[ \begin{split} \text{err}_{0/1}(s, y) \leq \text{err}_{SCE}(s, y) = \frac{1}{ln2} \text{err}_{CE}(s, y) \\ \iff \\ E_{in}^{0/1}(w) \leq E_{in}^{SCE}(w) = \frac{1}{ln2} E_{in}^{CE}(w) \\ E_{out}^{0/1}(w) \leq E_{out}^{SCE}(w) = \frac{1}{ln2} E_{out}^{CE}(w) \\ \end{split} \]

VC on 0/1

\[ E_{out}^{0/1}(w) \leq E_{in}^{0/1}(w) + \Omega^{0/1} \leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1} \]

VC on CE:

\[ E_{out}^{0/1}(w) \leq \frac{1}{ln2} E_{out}^{CE}(w) \leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1} \]

Either way, we have

\[ \text{small } E_{in}^{CE}(w) \Rightarrow \text{small } E_{out}^{0/1}(w) \]

logistic/linear reg. for linear classification.

Regression for Classification

run logistic/linear reg. on $\mathcal{D}$ with $y_n \in {-1, +1}$ to get $w_{reg}$
return $g(x) = \text{sign}(w_{reg}^T x)$

linear regression
- pros: ‘easiest’ optimization
- cons: loose bound of err 0/1 for large $|ys|$
logistic regression
- pros: ‘easy’ optimization
- cons: loose bound of err 0/1 for very negative $-ys$
PLA
- pros: efficient + strong guarantee if linear separable
- cons: works only if linear separable, otherwise needing pocket heuristic
linear regression sometimes used to set $w_0$ for PLA/pocket/logistic regression.
logistic regression often preferred over pocket.

Stochastic Gradient Descent

Two Iterative Optimization Schemes

For $t = 0, 1, \dots$

\[ w_{t+1} \leftarrow w_t + \eta v \]

when stop, return last $w$ as $g$

PLA: pick $(x_n, y_n)$ and decide $w_{t+1}$ by one example
- $O(1)$ time per iteration.
Logistic regression (pocket)
- check $\mathcal{D}$ and decide $w_{t+1}$ by all examples
- $O(N)$ time per iteration.

Is it possible to make logistic regression $O(1)$?

Logistic Regression Revisited

\[ w_{t+1} \leftarrow w_t + \eta \frac{1}{N} \sum_{n=1}^{N} \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla E_{in}(w_t)} \]

want: update direction $v \approx - \nabla E_{in}(w_t)$ while computing $v$ by one single $(x_n, y_n)$
technique on removing $\frac{1}{N} \sum_{n=1}^{N}:$
- view as expectation $\mathcal{E}$ over uniform choice of $n$!
- Stochastic gradient: $\nabla_w \text{err}(w, x_n, y_n)$ with random $n$
- true gradient: $\nabla_w E_{in}(w) = \underset{\text{random } n}{\mathcal{E}} \text{err}(w, x_n, y_n)$

Stochastic Gradient Descent (SGD)

Stochastic gradient = true gradient + zero-mean ‘noise’ directions

idea: replace true gradient by stochastic gradient
after enough steps, average true gradient $\approx$ average stochastic gradient
pros: simple and cheaper computation
- useful for big data and online learning
cons: less stable in nature.

SGD logistic regression looks familiar?

\[ w_{t+1} \leftarrow w_t + \eta \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla \text{err}(w, x_n, y_n)} \]

PLA Revisited

SGD logistic regression:

\[ w_{t+1} \leftarrow w_t + \eta \dot \theta (-y_n w_t^{T} x_n) (y_n x_n) \]

PLA:

\[ w_{t+1} \leftarrow w_t + 1 \dot [y_n \ne \text{sign}(w_t^T x_n)] (y_n x_n) \]

SGD logistic regression $\approx$ ‘soft’ PLA
- more errors, updates more. less errors ,updates less
PLA $\approx$ SGD logistic regression with $\eta = 1$ when $w_t^T x_n$ large.

Two practical rule of thumb:

stop condition? $t$ large enough
$\eta$? 0.1 when $x$ in proper range

Multiclass via Logistic Regression

Multiclass Classification

$\mathcal{y} = {\star, \diamond, \circ, \times}$ 4-class classfication.
many applications in practice, especially for ‘recognition’.
Use tool for ${\times, \circ}$ for ${\star, \diamond, \circ, \times}$

One Class at a time

$\square$ or not? ${\square = \circ, \diamond = \triangle = \star = \times}$
$\diamond$ or not? ${\diamond = \circ, \square = \triangle = \star = \times}$
$\triangle$ or not? ${\triangle = \circ, \square = \diamond = \star = \times}$
$\star$ or not? ${\star = \circ, \square = \triangle = \diamond = \times}$

Multiclass Prediction: Combine Binary Classifiers

We are sure some area is definitely what shape
What about tie?

One Class at a time softly

$P(\square|x)? \quad \{\square = \circ, \diamond = \triangle = \star = \times\}$
$P(\diamond|x)? \quad \{\diamond = \circ, \square = \triangle = \star = \times\}$
$P(\triangle|x)? \quad \{\triangle = \circ, \square = \diamond = \star = \times\}$
$P(\star|x)? \quad \{\star = \circ, \square = \triangle = \diamond = \times\}$

Multiclass Prediction: Combine Binary Classifiers

\[ g(x) = \text{argmax}_{k \in \mathcal{Y}} \theta(w_{[k]}^Tx) \]

OR, equivalently

\[ g(x) = \text{argmax}_{k \in \mathcal{Y}} w_{[k]}^Tx \]

One-Versus-All (OVA) Decomposition

for $k \in \mathcal{Y}$, obtain $w_{[k]}$ by running logistic regression on
- $\mathcal{D}_{[k]} = \{(x_n, y_n = 2[y_n = k] - 1)\}_{n=1}^N$
return $g(x) = \text{argmax}_{k \in \mathcal{Y}} (w_{[k]}^T x)$

pros: efficient, can be coupled with any logistic regression-like approaches.
cons: often unbalanced $\mathcal{D}_{[k]}$ when $K$ large
- logistic regression will produce naive predictor: predict everything to be $\times$.
extension: multinomial (‘coupled’) logistic regression.

OVA: a simple Multiclass meta-algorithm to keep in your toolbox.

Multiclass via Binary Classification

Source of Unbalance: One Versus All

idea: make binary classification problems more balanced by one versus one.

One versus One at a Time

$\square or \diamond$ ? {$\square = \circ, \diamond = \times, \star = \triangle = \text{nil}$}
…

Multiclass Prediction: Combine Pairwise Classifiers

\[ g(x) = \text{tournament champion } \{w_{[k, l]}^T x\} \]

voting of classifiers.

One-versus-one OVO Decomposition

for $[k, l] \in \mathcal{Y} \times \mathcal{Y}$, obtain $w_{[k, l]}$ by running logistic regression on
- $\mathcal{D}_{[k, l]} = \{(x_n, y'_n = 2[y_n = k] - 1): y_n = k \text{or } y_n = l\}$
return $g(x) = \text{tournament champion } \{w_{[k, l]}^T x\}$

pros: efficient (‘smaller’ Training problems), stable, can be coupled with any binary classification approaches.
cons: use $O(K^2)w_{[k, l]}$
- more space, slower prediction, more training.

OVO: another simple Multiclass meta-algorithm to keep in your toolbox.

Very handy and stable!

Conclusions

Linear Models for Binary Classification
- three models useful in different ways
Stochastic Gradient Descent
- follow negative stochastic gradient
Multiclass via Logistic Regression
- predict with maximum estimated P(k|x)
Multiclass via Binary Classification
- predict the tournament champion

Lecture 12 Nonlinear Transformation

Quadratic Hypothesis

Linear Hypothesis

visually: ‘line’-like boundary
mathematically: linear scores $s = w^T x$

limitations:

theoretically: $d_{vc}$ under control
practically: on some $\mathcal{D}$, large $E_{in}$ for every line.

How to break the limit of linear hypothesis?

Circular Separable

$\mathcal{D}$ not linear separable
but circular separable by a circle of radius $\sqrt{0.6}$ centered at origin:

\[ h_{SEP}(x) = \text{sign } (-x_1^2 - x_2^2 + 0.6) \]

re-derive Circular-PLA, Circular-Regression, blahblah … all over again?

Circular Separable and Linear Separable

\[ h(x) = \text{sign } ( \underbrace{0.6}_{\tilde{w_0}} \cdot \underbrace{1}_{\tilde{z_0}} + \underbrace{(-1)}_{\tilde{w_1}} \cdot \underbrace{x_1^2}_{\tilde{z_1}} + \underbrace{(-1)}_{\tilde{w_2}} \cdot \underbrace{x_2^2}_{\tilde{z_2}} ) = \text{sign } ( \tilde{w}^T z ) \]

$\{(x_n, y_n)\}$ cicular separable $\implies \{(x_n, y_n)\}$ linear separable.
$x \in \mathcal{X} \overset{\Phi}{\mapsto} z \in \mathcal{Z}$: nonlinear feature transform $\Phi$.
circular separable in $\mathcal{X} \mapsto$ linear separable in $\mathcal{Z}$
What about vice versa?

Linear Hypothesis in Z space

\[ (z_0, z_1, z_2) = z = \Phi(x) = (1, x_1^2, x_2^2) \]

\[ h(x) = \tilde{h}(z) = \text{sign}(\tilde{w}^T \Phi(x)) = \text{sign} (\tilde{w_0} + \tilde{w_1}x_1^2 + \tilde{w_2}x_2^2) \]

$(\tilde{w_0}, \tilde{w_1}, \tilde{w_2})$

$(0.6, -1, -1)$: circle ($\circ$ inside)
$(-0.6, +1, +1)$: circle ($\circ$ outside)
$(0.6, -1, -2)$: ellipse
$(0.6, -1, +2)$: hyperbola
$(0.6, +1, +2)$: constant $\circ$

line in $\mathcal{Z} \iff$ special quadratic curves in $\mathcal{X}$ space.

E.g., the center of a circle must be the origin

General Quadratic Hypothesis set

a bigger $\mathcal{Z}$ with $\Phi_2(x) = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)$

Perceptrons in $\mathcal{Z}$ space $\iff$ quadratic hypothesis in $\mathcal{X}$ space

\[ \mathcal{H}(\Phi_2) = {h(x): h(x) = \tilde{h}(\Phi_2(x)) \text{for some linear } \tilde{h} on \mathcal{Z}} \]

Can implement all possible quadratic curve boundries: circle, ellipse, rotated ellipse, hyperbola, parabola

\[ \text{ellipse } 2 (x_1 + x_2 - 3)^2 + (x_1 - x_2 - 4)^2 = 1 \]

\[ \implies \tilde{w}^T = (33, -20, -4, 3, 2, 3) \]

Include lines and constants as degenerated cases
next: learn a good quadratic hypothesis $g$

Nonlinear Transform

Good Quadratic Hypothesis

$\mathcal{Z}$ space perceptrons $\iff$ $\mathcal{X}$ quadratic hypothesis
good perceptron $\iff$ good quadratic hypothesis
separating perceptron $\iff$ separating quadratic hypothesis
want: good perceptron in $\mathcal{Z}$ space
know: get good perceptron in $\mathcal{X}$ space with data $(x_n, y_n)$
todo: get good perceptron in $\mathcal{Z}$ space with data $(z_n = \Phi_2(x_n), y_n)$

Nonlinear Transform Steps

transform original data $(x_n, y_n)$ to $(z_n = \Phi_2(x_n), y_n)$ by $\Phi$
Get a good perceptron $\tilde{w}$ using $(z_n, y_n)$ and your favorite linear classification algorithm $\mathcal{A}$
return $g(x) = \text{sign}(\tilde{w}^T \Phi(x))$

Nonlinear Model via Nonlinear $\Phi$ + Linear Models

Two choices: feature transform $\Phi$
Linear model $\mathcal{A}$, not just binary classification
Pandora’s box: can now freely do quadratic PLA, quadratic regression, cubic, …, polynomial regression.

Feature Transform

not new, not just polynomial:
In digits recognition:
- raw (pixels) $\overset{\text{domain knowledge}}{\implies}$ concrete (intensity, symmetry)
the force: too good to be true?

Price of Nonlinear Transform

Computation/Storage Price

$Q$-th order polynomial transform:

\[ \begin{split} ( 1, \\ x_1, x_2, \dots, x_d, \\ x_1^2, x_1x_2, \dots, x_d^2, \\ \dots, \\ x_1^Q, x_1^{Q-1}x_2, \dots, x_d^Q) \end{split} \]

$\underbrace{1}_{\tilde{w_0}} + \underbrace{\tilde{d}}_{\text{others}}$ dimensions = # ways of $\leq Q$-combination for $d$ kinds with repetitions
= $\binom{Q+d}{Q} = \binom{Q+d}{d} = O(Q^d)$
= efforts needed for computing/storing $z = \Phi_{Q}(x)$ and $\tilde{x}$
$Q$ large $\implies$ difficult to compute/store

Model Complexity Price

$\underbrace{1}_{\tilde{w_0}} + \underbrace{\tilde{d}}_{\text{others}}$ dimensions = $O(Q^d)$
number of free parameters $\tilde{w_i} = \tilde{d} + 1 \approx d_{VC}(\mathcal{H}_{\Phi_Q})$
$d_{VC}(\mathcal{H_{\Phi_{Q}}}) \leq \tilde{d} + 1$, why?
- any $\tilde{d} + 2$ inputs not shattered in $\mathcal{Z}$
- $\implies$ any $\tilde{d} + 2$ inputs not shattered in $\mathcal{X}$
$Q$ large $\implies$ large $d_{VC}$

Generalization Issue

Which do you prefer?
- $\Phi_1$ (original $x$), visually preferable.
- $\Phi_4$, $E_{in}(g) = 0$ but overkill.

Can we make sure that $E_{in}(g)$ is close enough to $E_{out}(g)$?
Can we make $E_{in}(g)$ small enough?

Higher $\tilde{d}(Q)$, good for 2, bad for 1
Lower $\tilde{d}(Q)$, good for 1, bad for 2
how to pick $Q$. visually, maybe?

Danger of Visual Choices

First of all, can you really ‘visualize’ when $\mathcal{X} = \mathbb{R}^10$?

Visualize $\mathcal{X} = \mathbb{R}^2$

full $\Phi_2: z = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)$, $d_{VC} = 6$
or $z = (1, x_1^2, x_2^2)$, $d_{VC} = 2$?
of better $z = (1, x_1^2 + x_2^2)$, $d_{VC} = 2$?
or even better $z = \text{sign}(0.6 - x_1^2 - x_2^2)$?

We use our own brain to make the transform. This is human learning not machine learning. åå°ä¸»è§å½±å“.

careful about your brain’s ‘model complexity’ çèµ·æ¥VC Dimensonå¾å°ï¼æ¯å ä¸ºæ²¡ææèè¢éçVC Dimensonç®è¿å»ã The results are too optimistic.

For VC-safety, $\Phi$ shall be decided without ‘peeking’ data.

Structured Hypothesis Sets

Polynomial Transform Revisited

\[ \begin{split} \Phi_0(x) = 1 \\ \Phi_1(x) = (\Phi_0(x), \quad x_1, x_2, \dots, x_d) \\ \Phi_2(x) = (\Phi_1(x), x_1^2, x_1x_2, \dots, x_d^2) \\ \dots \dots \\ \Phi_Q(x) = (\Phi_{Q-1}(x), x_1^Q, x_1^{Q-1}, \dots, x_d^Q) \end{split} \]

\[ \mathcal{H}_{\Phi_0} \subset \mathcal{H}_{\Phi_1} \subset \mathcal{H}_{\Phi_2} \subset \cdots \subset \mathcal{H}_{\Phi_Q} \]

Structured Hypothesis Sets

\[ \mathcal{H}_0 \subset \mathcal{H}_1 \subset \mathcal{H}_2 \subset \]

\[ d_{VC}(\mathcal{H}_0) \leq d_{VC}(\mathcal{H}_1) \subset d_{VC}(\mathcal{H}_2) \leq \cdots \]

\[ E_{in}(g_0) \leq E_{in}(g_1) \leq E_{in}(g_2) \leq \]

Use a high dimension from beginning is not a good idea.

Linear Model First

tempting sin: use $\mathcal{H}_{1126}$, low $E_{in}(g_{1126})$ to fool your boss.
- really? :( a dangerous path of no return
safe route: $\mathcal{H}_1$ first
- if $E_{in}(g_1)$ good enough, live happily thereafter :)
- otherwise, move right of the curve with nothing lost except ‘wasted’ computation.
linear model first: simple, efficient, safe and workable.

Conclusions

Quadratic Hypothesis
- linear hypothesis on quadratic-transformed data
Nonlinear Transform
- happy linear modeling after $\mathcal{Z} = \Phi(\mathcal{X})$
Price of Nonlinear Transform
- computation/storage/model complexity
Structured Hypothesis Sets
- linear/simple model first

Lecture 13 Hazard of Overfitting

nonlinear (regression, classification) via nonlinear feature transform $\Phi$ plus linear (regression, classification) with price of model complexity. However, it will causes overfitting. In this lecture:

How does it happen?
How to overcome it?

What is overfitting?

Bad Generalization

regression for $x \in \mathbb{R}$ with $N = 5$ examples
target $f(x) = $ 2nd order polynomial
label $y_n = f(x_n) +$ very small noise
linear regression in $\mathcal{Z}$-space + $\Phi$ = 4th order polynomial
unique solution passing all examples $\implies E_{in}(g) = 0$
$E_{out}(g)$ huge

bad generalization: low $E_{in}$, high $E_{out}$.

Bad Generalization and Overfitting

take $d_{VC} = 1126$ for learning: bad generalization
- $E_{in} - E_{out}$ large
switch from $d_{VC} = d_{VC}^{\star}$ to $d_{VC} = 1126$: overfitting
- $E_{in}$ up, $E_{out}$ down.
- hard to fix
switch from $d_{VC} = d_{VC}^{\star}$ to $d_{VC} = 1$: underfitting
- $E_{in}$ down, $E_{out}$ up.
- easy to fix

bad generalization: low $E_{in}$, high $E_{out}$. overfitting: lower $E_{in}$, higher $E_{out}$

Cause of Overfitting: A Driving Analogy

‘good fit’ $\implies$ ‘overfit’

Learning	Driving
overfit	commit a car accident
use excessive $d_{VC}$	‘drive too fast’
noise	bumpy road
limited data size $N$	limited observations about road condition

next: how does noise & data size affect overfitting?

The role of Noise and Data size

Case Study

10-th order target function + noise 50-th order target function + noiseless

overfitting from best $g_2 \in \mathcal{H}_2$ to best $g_{10} \in \mathcal{H}_{10}$ ?

10-th order target function + noise

\[ \begin{array}{c|c c} & g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\ \hline E_{in} & 0.050 & 0.034 \\ E_{out} & 0.127 & 9.0 \end{array} \]

50-th order target function + noiseless

\[ \begin{array}{c|c c} & g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\ \hline E_{in} & 0.029 & 0.00001 \\ E_{out} & 0.120 & 7680 \end{array} \]

Irony of Two learners

Learner $O$verfit: pick $g_{10} \in \mathcal{H}_{10}$
Learner $R$estricted: pick $g_{2} \in \mathcal{H}_{2}$
When both know that target = 10th
- R ‘give up’ ability to fit

But $R$ wins in $E_{out}$ a lot. Philosophy: concession for advantages? 以退为进！

Learning Curves Revisited

When $\mathcal{H}_{2}$, $E_{in}$ is a little bit larger than $E_{out}$.
$\mathcal{H}_{10}$: lower $E_{out}$ when $N \leftarrow \infty$, but much larger generalization error for small $N$.
Grey area: $O$ overfits! $E_{in} \downarrow$, $E_{out} \uparrow$. 聪明反被聪明误！
$R$ always wins in $E_{out}$ if $N$ small! 不要钻牛角尖，简单的hypothesis反而效果比较好。

No Noise Case

Learner $O$verfit: pick $g_{10} \in \mathcal{H}_{10}$
Learner $R$estricted: pick $g_{2} \in \mathcal{H}_{2}$
When both know that there is no noise
- $R$ still wins

Is there really no noise? ‘Target complexity’ acts like noise.

Deterministic Noise

A Detailed Experiment

\[ y = f(x) + \epsilon \sim \text{Gaussian} (\sum_{q=0}^{Q_f} \alpha_{q} x^q, \sigma^2) \]

Gaussian iid noise $\epsilon$ with noise $\sigma^2$.
Some ‘uniform’ distribution on $f(x)$ with complexity level $Q_{f}$
data size $N$

The Overfit Measure

$g_{2} \in \mathcal{H}_{2}$
$g_{10} \in \mathcal{H}_{10}$
$E_{in}(g_{10}) \leq E_{in}(g_{2})$ for sure.

Overfit measure $E_{out}(g_{10}) - E_{out}(g_{2})$

The Results

impact of $\sigma^2$ versus $N$
- Less noise (small ), more data points $N$, less overfit.
impact of $Q_{f}$ versus $N$
- Less complex of objective function (smaller $Q_{f}$), more data points $N$, less overfit.
The logo of this class
- Middle two figures: overfit
- left: nonlinear transformation
- right: linear learning

Impact of Noise and Data Size

impact of $\sigma^2$ versus $N$: stochastic noise
impact of $Q_{f}$ versus $N$: deterministic noise
four reasons of serious overfitting:
- data size $N \downarrow$
- stochastic noise $\uparrow$
- deterministic noise $\uparrow$
- excessive power $\uparrow$, VC dimension is too high

overfitting ‘easily’ happens.

Deterministic Noise

if $f \notin \mathcal{H}$: something of $f$ cannot be captured by $\mathcal{H}$
deterministic noise: difference between best $h^* \in \mathcal{H}$ and $f$
acts like stochastic noise: not new, pseudo-random generator
difference to stochastic noise:
- depends on $\mathcal{H}$
- fixed for a given x

Philosophy: when teaching a kid, perhaps better not to use examples from a complicated target function? :)

Dealing with Overfitting

Driving Analogy Revisited

Learning	Driving
overfit	commit a car accident
use excessive $d_{VC}$	‘drive too fast’
noise	bumpy road
limited data size $N$	limited observations about road condition

Learning	Driving
start from simple model	drive slowly
data cleaning/pruning	use more accurate road information
data hint	exploit more road information
regularization	put the breaks
validation	monitor the dashboard

All very practical techniques to combat overfitting.

Data Cleaning/Pruning

if ‘detects’ the outlier 5 at the top by
- too close to other $\circ$, or too far from other $\times$
- wrong by current classifer
- …
possible action 1: correct the label (data cleaning)
possible action 2: remove the example (data pruning)

Possibly helps: but effect varies

Data Hinting

slightly shifted/rotated digits carry the same meaning
possible action: add virtual examples by shifting/rotating the given digits (data hinting)

Possibly helps: but watch out - virtual example not $\overset{iid}{\sim} P(x, y)$!

Conclusions

What is overfitting?
- lower $E_{in}$, but higher $E_{out}$
The role of Noise and Data size
- overfitting ‘easily’ happens
Deterministic Noise
- what $\mathcal{H}$ cannot capture acts like noise
Dealing with Overfitting
- data cleaning/pruning/hinting, and more

Lecture 14 Regularization (规则化/规律化)

Overfitting happens with excessive power, stochastic/deterministic noise, and limited data.

Regularized Hypothesis set

Regularization: The Magic

overfit $\implies$ ‘regularized fit’

idea: ‘step back’ from $\mathcal{H}_{10}$ to $\mathcal{H}_{2}$
name history: function approximation for ill-posed problems

how to step back?

Stepping Back as Constraint

$Q$-th order polynomial transform for $x \in \mathbb{R}$:

\[ \Phi_{Q}(x) = (1, x, x^2, \dots, x^Q) \]

+linear regression, denote $\tilde{w}$ by w

hypothesis w in $\mathcal{H}_{10}$: $w_0 + w_1 x + w_2 x^2 + \dots + w_{10} x^{10}$
hypothesis w in $\mathcal{H}_{2}$: $w_0 + w_1 x + w_2 x^2$
that is, $\mathcal{H}_{2} = \mathcal{H}_{10}$ AND ‘Constraint that $w_3 = \dots = w_{10} = 0$’
step back = ‘constraints’

Regression with Constraint

$\mathcal{H}_{10} \equiv {w \in \mathbb{R}^{10+1}}$
- regression with $\mathcal{H}_{10}$: $\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)$
$\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}$
- regression with $\mathcal{H}_{2}$:
- $\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)$
- $s.t. w_3 = \dots = w_{10} = 0$

step back = constrained optimization of $E_{in}$

Why don’t you just use $w in \mathbb{R}^{2+1}$

Regression with Looser Constraint

$\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}$
- regression with $\mathcal{H}_{2}$:
- $\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)$
- $s.t. w_3 = \dots = w_{10} = 0$

Relax to:

$\mathcal{H'}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} \ge 8 \text{of } w_{q}= 0}$
- regression with $\mathcal{H'}_{2}$:
- $\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)$
- $\text{s.t.} \sum_{q=0}^{10} [w_q \ne 0] \le 3$
more flexible than $\mathcal{H}_2$: $\mathcal{H}_{2} \in \mathcal{H'}_{2}$
less risky than $\mathcal{H}_10$: $\mathcal{H'}_{2} \in \mathcal{H'}_{10}$

Bad news for sparse hypothesis set $\mathcal{H'}_{2}$: NP-hard to solve.

Boolean operation, discrete.

Regression with Softer Constraint

Relax to:

\[ \begin{split} \mathcal{H}(C) \equiv { w \in \mathbb{R}^{10+1} \text{while} |w|^2 \le C } \\ \underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w) \text{s.t. } \sum_{q=0}^{10} w_q^2 \le C \end{split} \]

$\mathcal{H}(C)$: overlaps but not exactly the same as $\mathcal{H'}_2$
soft and smooth Structure over $C \ge 0$:
- $\mathcal{H}(0) \in \mathcal{H}(1.126) \in \dots \in \mathcal{H}(1126) \in \dots \in \mathcal{H}(\infty) = \mathcal{H}_{10}$

regularized hypothesis $w_{REG}$: optimal solution from regularized hypothesis set $\mathcal{H}(C)$

Weighted Decay Regularization

Matrix Form of Regularized Regression Problem

\[ \begin{split} \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T z_n - y_n)^2 \\ \text{s.t. } \sum_{q=0}^{Q} w_q^2 \le C \end{split} \]

$\sum_n = (Zw - y)^T(Zw - y)$
$w^Tw \le C$: feasible $w$ within a radius-$\sqrt(C)$ hypersphere

How to solve the constrained optimization problem?

The Lagrange Multiplier

\[ \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} (Zw - y)^T(Zw - y) \text{s.t. } w^Tw \le C \]

decreasing direction: $- \nabla E_{in}(w)$
normal vector of $w^T w = C$: $w$
if $- \nabla E_{in}(w)$ and $w$ not parallel: can decrease $E_{in}(w)$ without violating the constraint
at optimal solution $w_{REG}$, $- \nabla E_{in}(w) \propto w_{REG}$
want: find Lagrange multiplier $\lambda > 0$ and $w_{REG}$ such that $\nabla E_{in}(w_{REG}) - \frac{2 \lambda}{N} w_{REG} = 0$

Augmented Error

if oracle tells you $\lambda > 0$, then solving $\nabla E_{in}(w_{REG}) + \frac{2 \lambda}{N} w_{REG} = 0$

\[ \frac{2}{N}(Z^T Z w_{REG} - Z^T y) + \frac{2 \lambda}{N} w_{REG} = 0 \]

optimal solution:

\[ w_{REG} \leftarrow (Z^T Z + \lambda I)^{-1} Z^T y \]

$(Z^T Z + \lambda I)$ always invertible.
called ridge regression in Statistics.

Another view, equivalent to minimizing augmented error $E_{aug}(w)$:

\[ E_{in}(w) + \frac{\lambda}{N} w^{T} w \]

regularization with augmented error instead of constrained $E_{in}$

\[ w_{REG} \leftarrow \underset{w}{\text{argmin}} E_{aug}(w) \text{for given} \lambda \text{ or } \lambda = 0 \]

minimizing unconstrained $E_{aug}$ effectively minimizes some $C$-constrained $E_{in}$

The Results

$\lambda = 0$, overfitting
$\lambda = 0.0001$, OK
$\lambda = 0.01$, OK
$\lambda = 1$, underfitting, complexity is not enough.
Philosophy: a little regularization goes a long way.

call ‘$+\frac{\lambda}{N} w^T w$’ weight-decay regularization:

larger $\iff$
prefer shorter $w \iff$
effectively smaller $C$

Go with any transform + linear model

Some details: Legendre Polynomials

\[ \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} \frac{1}{N} \sum_{n=0}^{N} (w^T \Phi(x_n) - y_n^2)^2 + \frac{1}{N} \sum_{q=0}^{Q} w_q^2 \]

naive polynomial transform:

\[ \Phi(x) = (1, x, x^2, \dots, x^Q) \]

When $x_n \in [-1, +1]$, $x_n^q$ really small, needing large $w_q$.

We need large $w_Q$ for big $Q$. We penalize too much for the high dimension.

normalized polynomial transform:

\[ (1, L_1(x), L_2(x), \dots, L_Q(x)) \]

orthonormal basis functions called Legendre Polynomials

Regularization and VC Theory

Regularization and VC Theory

Regularization by constrained-minimizing $E_{in}$

\[ \underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^T w \le C \]

Regularization by Minimizing $E_{aug}$

\[ \underset{w}{\text{min}} E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw \]

VC Guarantee of constrained-minimizing $E_{in}$

\[ E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C)) \]

minimizing $E_{aug}$: indirectly getting VC guarantee without confining to $\mathcal{H}(C)$.

Another View of Augmented Error

Augmented Error: $E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw$
VC Bound: $E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C))$
regularizer $w^Tw = \Omega(w)$: complexity of a single hypothesis
generalization price $\Omega(\mathcal{H})$: complexity of a hypothesis set
if $\frac{\lambda}{N}\Omega(w)$ ‘represents’ $\Omega(\mathcal{H})$ well, $E_{aug}$ is a better proxy of $E_{out}$ than $E_{in}$
minimizing $E_{aug}$:
- (heuristically) operating with the better proxy;
- (technically) enjoying the flexibility of whole $\mathcal{H}$

Effective VC Dimenson

\[ \underset{w\in\mathbb{R}^{\tilde{d}+1}}{\text{min}} = E_{in}(w) + \frac{\lambda}{N}\Omega(w) \]

model complexity?
- $d_{VC}(\mathcal{H}) = \tilde{d} + 1$, because {$w$} ‘all considered’ during minimization
{$w$} ‘actually needed’: $\mathcal{H}(C)$, with some $C$ equivalent to $\lambda$
$d_{VC}(\mathcal{H}(C))$: effective VC dimension $d_{EFF}(\mathcal{H}, \underbrace{\mathcal{A}}_{\text{min} E_{aug}})$

explanation of regularization: $d_{VC}(\mathcal{H})$ large, while $d_{EFF}(\mathcal{H}, \mathcal{A})$ small if $\mathcal{A}$ regularized.

General Regularizers

General Regularizers $\Omega(w)$

want: constraint in the ‘direction’ of target function

target-dependent: some properties of target, if known (domain knowledge)
- symmetry regularizer: $\sum$ [q is odd] $w_q^2$
plausible: direction towards smoother or simple
- regularization is to overcome noise
- stochastic/deterministic noise both non-smooth
- sparsity (L1) regularizer: $\sum |w_q|$
friendly: easy to opmitize
- weight-decay (L2) regularizer: $\sum w_q^2$
bad? no worries, guard by $\lambda$
augmented error = error $\widehat{\text{err}}$ + regularizer $\Omega$
regularizer: target-dependent, plausible or friendly. ringing a bell?
error measure: user-dependent, plausible or friendly.

L2 and L1 Regularizer

L2 Regularizer
- $\Omega(w) = \sum_{q=0}^{Q} w_q^2 = \begin{Vmatrix} w \end{Vmatrix}_2^2$
- convex, differentiable everywhere
- easy to optimize
L1 Regularizer
- $\Omega(w) = \sum_{q=0}^{Q} |w_q| = \begin{Vmatrix} w \end{Vmatrix}_1$
- convex, not differentiable everywhere
- sparsity in solution
- optimal solution is often at the corner, meaning some $w$ is 0.

L1 useful if needing sparse solution.

The Optimal $\lambda$

more noise $\iff$ more regularization needed
- more bumpy road $\iff$ putting brakes more
noise unknown - important to make proper choices

How to choose? Stay tuned for the next lecture!

Conclusions

Regularized Hypothesis set
- original $\mathcal{H}$ + constraint
Weighted Decay Regularization
- add $\frac{\lambda}{N} w^Tw$ in $E_{aug}$
Regularization and VC Theory
- regularization decreases $d_{\text{EFF}}$
General Regularizers
- target-dependent, plausible or friendly.

Lecture 15 Validation

minimizes augmented error, where the added regularizer effectively limits model complexity.

Model Selection Problem

So Many Models Learned

Even just for Binary Classification …

$\mathcal{A} \in$ {PLA, pocket, linear regression, logistic regression}
$T \in$ {100, 1000, 10000}
$\eta \in$ {1, 0.01, 0.0001}
$\Phi \in$ {linear, quadratic, poly-10, Legendre-poly-10}
$\Omega(w) \in$ {L2 regularizer, L1 regularizer, symmetry, regularizer}
$\lambda \in$ {0, 0.01, 1}

in addition to your favorite combination, may need to try other combination to get a good $g$.

Model Selection Problem

Which one do you prefer? Linear $\mathcal{H}_1$ or quadratic $\mathcal{H}_2$?
given: $M$ models $\mathcal{H}_1, \mathcal{H}_2, \dots, \mathcal{H}_M$, each with corresponding algorithm $\mathcal{A}_1, \mathcal{A}_2, \dots, \mathcal{A}_M$
goal: select $\mathcal{H}_{m*}$ such that $g_{m*} = \mathcal{A}_{m*}(\\mathcal{D})$ is of low $E_{out}(g_{m*})$
unknown $E_{out}$ due to unknown $P(x)$ & $P(y|x)$, as always.
arguably the most important practical problem of ML.

How to select? visually? No! See lecture 12.

Not always can be visualized.
The high VC dimension in the human brain.

Model Selection by Best $E_{in}$

select by best $E_{in}$?
- $m* = \underset{1 \le m \le M}{E_m = E_{in}(\mathcal{m}(\mathcal{D}))}$
$\Phi_{1126}$ always more preferred over $\Phi_1$
$\lambda = 0$ always more preferred over $\lambda = 0.1$ - overfitting
if $\mathcal{A}_1$ minimizes over $\mathcal{H}_1$ and $\mathcal{A}_2$ minimizes over $\mathcal{H}_2$,
- $\implies g_{m*}$ achieves minimal $E_{in}$ over $\mathcal{H}_{1} \cup \mathcal{H}_{2}$
- $\implies$ Extra model complexity, learning pays $d_{VC}(\mathcal{H}_{1} \cup \mathcal{H}_{2})$
- bad generalization?

selecting by $E_{in}$ is dangerous.

Model Selection by Best $E_{test}$

select by best $E_{test}$, which is evaluated on a fresh $\mathcal{D}_{test}$?
- $m* = \underset{1 \le m \le M}{E_m = E_{test}(\mathcal{m}(\mathcal{D}))}$
generalization guarantee (finite-bin Hoeffding):
- yes! strong guarantee! \[ E_{out}(g_{m*}) \le E_{test}(g_{m*}) + O(\sqrt{\frac{\text{log}M}{N_{\text{test}}}}) \]
But where is $\mathcal{D}_{test}$? Your boss’s safe, maybe?

selecting by $E_{test}$ is infeasible and cheating!

Comparison between $E_{in}$ and $E_{test}$

in-sample error $E_{in}$
- calculate from $\mathcal{D}$
- feasible on hand
- ‘contaminated’ as $\mathcal{D}$ also used by $\mathcal{A}_{m}$ to select $g_{m}$
test error $E_{test}$
- calculate from $\mathcal{D}_{test}$
- infeasible in boss’s safe
- ‘clean’ as $\mathcal{D}_{test}$ never used for selection before
something in between: $E_{val}$
- calculate from $_{val} $\mathcal{D}$
- feasible on hand
- clean if $\mathcal{D}_{val}$ never used by $\mathcal{A}_{m}$ before.

selecting by $E_{val}$: legal cheating.

Validation

Validation Set $\mathcal{D}_{val}$

\[ \underbrace{\mathcal{D}}_{\text{size } N} \leftarrow \underbrace{\mathcal{D}_{train}}_{\text{size } N-K} \cup \underbrace{\mathcal{D}_{val}}_{\text{size } K} \]

$\mathcal{D}_{val} \subset \mathcal{D}$: called validation set - ‘on hand’ simulation of test set to connect $E_{val}$ with $E_{out}$:
- $\mathcal{D}_{val} \underset{iid}{\sim} P(x, y)$: select $K$ examples from $\mathcal{D}$ at random
to make sure $\mathcal{D}_{val}$ ‘clean’: feed only $\mathcal{D}_{train}$ to $\mathcal{A}_m$ for model selection.
$g_{m} = \mathcal{A}_m(\mathcal{D})$
$g_{m}^- = \mathcal{A}_m(\mathcal{D}_{train})$

\[ E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}}) \]

Model Selection by Best $E_{val}$

\[ m* = \underset{\text{argmin}}{1 \le m \le M} (E_{in} = E_{val}(\mathcal{A}_m(\mathcal{D}_{train}))) \]

generalization guarantee for all $m$:
- $E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}})$
heuristic gain from $N-K$ to $N$:
- $E_{out}(\underbrace{g_{m*}}_{\mathcal{A}_{m*}(\mathcal{D})}) \le E_{out}(\underbrace{g_{m*}^-}_{\mathcal{A}_{m}(\mathcal{D}_{train})})$
- This is because learning curve.
The general flow:

divide $\mathcal{D}$ to $\mathcal{D}_{val}$ and $\mathcal{D}_{train}$.
Obtain $M$ models $g_1^- \dots g_M^-$
pick the best one $\{\mathcal{H}_{m*}, \mathcal{E}_{m*}\}$
use $\mathcal{D}$ to find $g_{m*}$ in $\mathcal{H}_{m*}$

\[ E_{out}(g_{m*}) \le E_{out}(g_{m*}^-) \le E_{val}(g_{m*}^-) + O(\sqrt{\frac{\\text{log} M}{K}}) \]

Validation in Practice

use validation to select between $\mathcal{H}_{\Phi_5}$ and $\mathcal{H}_{\Phi_{10}}$

in-sample: selection with $E_{in}$
optimal: cheating-selection with $E_{test}$
sub-$g$: selection with $E_{val}$ and report $g_{m*}^-$
full-$g$: selection with $E_{val}$ and report $g_{m*}$
- $E_{out}(g_{m*}) \le E_{out}(g_{m*}^-)$ indeed.

Why is sub-$g$ worse than in-sample some time?

$\mathcal{D}_{val} \uparrow$ and $\mathcal{D}_{train} \downarrow$.

The Dilemma about $K$

reasoning of validation:

\[ E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-) \underset{(\text{large } K)}{\approx} E_{val}(g^-) \]

large $K$: every $E_{val} \approx E_{out}$, but all $g_m^-$ much worse than $g_m$
small $K$: every $g_m^- \approx g_m$, but $E_{val}$ far from $E_{out}$

Practical rule of thumb: $K = \frac{N}{5}$

Fun Time

Validation might not be slow! Because we use less $\mathcal{D}_{train}$.

Leave-One-Out Cross Validation

Extreme Case: $K = 1$

reasoning of validation:

\[ E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-) \underset{(\text{large } K)}{\approx} E_{val}(g^-) \]

take $K = 1$? $\mathcal{D}_{val}^{(n)} = \{(x_n, y_n\}$ and $E_{val}^{(n)}(g_n^-) = \text{err}(g_{n}^-(x_n), y_n) = e_n$
make $e_n$ closer to $E_{out}(g)$? - average over possible $E_{val}^{(n)}$
Leave-One-Out cross validation estimate:

\[ E_{loocv}(\mathcal{H}, \mathcal{A}) = \frac{1}{N} \sum_{n=1}^{N} e_{n} = \frac{1}{N} \sum_{n=1}^{N} \\text{err} (g_{n}^-(x_{n}), y_{n}) \]

It can tell us if our algorithm $\mathcal{A}$ performs good in data set $\mathcal{D}$.
hope: $E_{loocv}(\mathcal{H}, \mathcal{A}) \approx E_{out}(g)$

Illustration of Leave-One-Out

\[ E_{loocv}(\text{linear}) = \\frac{1}{3} (e_1 + e_2 + e_3) \]

\[ E_{loocv}(\text{constant}) = \\frac{1}{3} (e_1 + e_2 + e_3) \]

Which one would you choose?
- If we only has three data points, the constant hypothesis will be better.
$m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{loocv}(\mathcal{H}_m, \mathcal{A}_m)$

Theoretical Guarantee of Leave-One-Out Estimate

Does $E_{loocv}(\mathcal{H}, \mathcal{A})$ say something about $E_{out}(g)$?

Yes! for average $E_{out}$ on size $N-1$ data.

\[ \underset{\mathcal{D}}{\mathcal{E}} E_{loocv}(\mathcal{H}, \mathcal{A}) = \underset{\mathcal{D}}{\mathcal{E}} \frac{1}{N} \sum_{n=1}^{N} e_n = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}}{\mathcal{E}} e_n = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} \underset{(x_n, y_n)}{\mathcal{E}} \text{err} (g_n^-(x_n), y_n) = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} E_{out}(g^-) = \frac{1}{N} \sum_{n=1}^{N} \bar{E_{out}(N-1)} = \bar{E_{out}(N-1)} = \bar{E_{out}(g^-)} \]

expected $E_{loocv}(\mathcal{H}, \mathcal{A})$ says something about expected $E_{out}(g^-)$

Often called ‘almost unbiased estimate of $E_{out}(g)$’

Leave-One-Out in Practice

Select by $E_{in}$, not smooth, overfit.
Select by $E_{loocv}$, smooth, avoid overfit.
$E_{loocv}$ much better than $E_{in}$

V-Fold Cross Validation

Disadvantage of Leave-One-Out Estimate

Computation issue:

\[ E_{loocv}(\mathcal{H}, \mathcal{A}) = \frac{1}{N} \sum_{n=1}^{N} e_n = \frac{1}{N} \sum_{n=1}^{N} \text{err} (g_n^-(x_n), y_n) \]

$N$ ‘additional’ training per model, not always feasible in practice.
- E.g., we have $\begin{vmatrix}\mathcal{D}\end{vmatrix} = 1000$, we need 1000 models, each one is trained from 999 points.
except ‘special case’ like analytic solution for linear regression.

Stability issue - due to variance of single-point estimates.

The Leave-One-Out curve is jagged shape. Not good for select best algorithm, We don’t want mistakenly select a low point which is just lucky to have low $E_{cv}$

V-fold Cross Validation

how to decrease computation need for cross validation?

essence of Leave-One-Out cross validation: partition of $\mathcal{D}$ to $N$ parts, taking $N-1$ for training and 1 for validation orderly
$V$-fold cross-validation: random-partition of $\mathcal{D}$ to $V$ equal parts, take $V-1$ for training and 1 for validation orderly

\[ E_{cv}(\mathcal{H}, \mathcal{A}) = \frac{1}{V} \sum_{v=1}^{V} E_{val}^{(v)}(g_v^-) \]

selection by $E_{cv}: m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{cv}(\mathcal{H}_m, \mathcal{A}_m)$

Practical rule of thumb: $V = 10$

Final Words on Validation

‘Selecting’ Validation Tool

$V$-fold generally preferred over single validation if computation allows
5-Fold or 10-Fold generally works well: not necessary to trade $V$-fold with Leave-One-Out

Nature of Validation

all training models: select among hypotheses, like qualification match for World cup
all validation Schemes: select among finalists, like knock-out stage
all testing methods: just evaluate

Validation still more optimistic than testing

validation is still selection, it’s probably to contaminate the data, model complexity.

Do not fool yourself and others, report test result, not best validation result.

Conclusions

Model Selection Problem
- Dangerous by $E_{in}$ and dishonest by $E_{test}$
Validation
- Select with $E_{val}(\mathcal{D}_{train})$ while returning $\mathcal{A}_{m*}(\mathcal{D})$
Leave-One-Out Cross Validation
- huge computation by for almost unbiased estimate
V-Fold Cross Validation
- reasonable computation and performance

Lecture 16 Three Learning Principles (锦囊妙计)

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler - Albert Einstein

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) - Willian of Occam (1287 - 1347)

‘Occam’s razor’ for trimming down unnecessary explanation.

对资料不要过分解释!

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.
Two questions?
- What does it mean for a model to be simple
- How do we know that simpler is better

Simple Model

Simple hypothesis $h$
- small $\Omega(h)$ = looks simple
- specifically by few parameters
simple mode $\mathcal{H}$
- small $\Omega(\mathcal{H})$ = not many
- contains small number of hypothesis
connection
- $\begin{vmatrix} \mathcal{H} \end{vmatrix}$ of size $2^{\mathcal{l}}$ $\implies$ $h$ specified by $\mathcal{l}$ bits
- small $\Omega(\mathcal{H}) \implies \Omega(h)$
simple: small hypothesis/model complexity
- either we use simple hypothesis set
- or we use regularization

Simple is Better

In addition to math proof that you have seen, Philosophically:

simple $\mathcal{H}$
$\implies$ smaller $m_{\mathcal{H}}(N)$
- 对随意一组资料，通过丢铜板的方式产生输出。（乱乱的资料）
- 没有规律。
$\implies$ less likely to fit data perfectly $\frac{m_{\mathcal{H}}(N)}{2^N}$

Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey
a newspaper phone-poll of how people voted, and set the title ‘Dewey Defeats Truman’ based on polling

The Big Smile Came from …

Truman, and yes he won

suspect of the mistake: * editorial bug? - no * bad luck of polling ($\delta$)? - no

hint: phones were expensive

Sampling Bias

If the data is sampled in a biased way, learning will produce a simlarity biased outcome.

technical explanation: data from $P_1(x, y)$ but test under $P_2 \ne P_1$: VC fails
Philosophical explanation: study Math hard but test English: no strong test guarantee
‘minor’ VC assumption: data and testing both iid from $P$

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie, recommender system: 10% improvement = 1M US dollars
formed $\mathcal{D}_{val}$, in my first shot, $E_{val}(g)$ showed 13% improvement
Why am I still teaching here?
validation: random examples within $\mathcal{D}$;
test: ‘last’ user records ‘after’ $\mathcal{D}$. 训练资料，测试资料不是随机取样，而是在时间轴上有前后关系。

Dealing with Sampling Bias

Practical rule of thumb: match test scenario as much as possible.
- 了解测试环境，让训练环境尽可能接近。
e.g. if test: ‘last’ user records ‘after’ $\mathcal{D}$
- training: emphasize later examples (KDDCup 2011)
- validation: use ‘late’ user records
last puzzle: danger when learning ‘credit card approval’ with existing bank records?
- We don’t know if the new user will follow the behavior of the existing users.

Data Snooping

Visual Data Snooping

Visualize: $\mathcal{X} = \mathbb{R}^2$

full $\Phi_2: z = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)$, $d_{VC} = 6$
or $z = (1, x_1^2, x_2^2)$, $d_{VC} = 2$?
of better $z = (1, x_1^2 + x_2^2)$, $d_{VC} = 2$?
or even better $z = \text{sign}(0.6 - x_1^2 - x_2^2)$?

Careful about your brain’s ‘model complexity’.

For VC-safety, $\Phi$ shall be decided without ‘Snooping’ data.

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

8 years of currency trading data
first 6 years from training, last 2 years for testing
x = previous days, y = 21th day
snooping versus no snooping: superior profit possible
snooping: shift-scale all values by training + testing
no snooping: shift-scale all values by training only

Data Snooping by Data Reusing

Research Scenario

benchmark data $\mathcal{D}$
- paper1: propose $\mathcal{H}_1$ that works well on $\mathcal{D}$
- paper2: find room for improvement, propose $\mathcal{H}_2$ - and publish only if better than $\mathcal{H}_1$ on $\mathcal{D}$
- paper3: find room for improvement, propose $\mathcal{H}_3$ - and publish only if better than $\mathcal{H}_2$ on $\mathcal{D}$
If all papers from the same author in one big paper: bad generalization due to $d_{VC}(\cup_m \mathcal{H}_m)$
step-wise: later author snooped data by reading earlier papers

If you torture the data long enough, it will confess :)

Dealing with Data Snooping

truth - very hard to avoid, unless being extremely honest
extremely honest: lock your test data in safe
less honest: reserve validation and use cautiously
be blind: avoid making modeling decision by data （作决定之前尽量不要看资料）
be suspicious: interpret research results (including your own) by proper feeling of contamination

One secret to wining KDDCups: careful balance between data-driven modeling (snooping) and validation (no-snooping).

Power of Three

Three Related Fields

Power of Three

Data Mining
- use (huge) data to find property that is interesting
- difficult to distinguish ML and DM in reality
Artificial Intelligence
- Compute something that shows intelligent behavior
- ML is one possible route to realize AI
Statistics
- use data to make inference about an unknown process
- statistics contains many useful tools for ML

Three Theoretical Bounds

Hoeffding: $\text{P[BAD]} \le 2 \text{exp} (-2 \epsilon^2 N)$
- one hypothesis
- useful for verifying/testing
Multi-Bin Hoeffding: $\text{P[BAD]} \le 2 M \text{exp} (-2 \epsilon^2 N)$
- M hypothesis
- useful for validation
VC: $\text{P[BAD]} \le 4 m_{\mathcal{H}}(2N) \text{exp} (\dots)$
- all $\mathcal{H}$
- useful for training

Three Linear Models

PLA/pocket: $h(x) = sign(s)$
- plausible err = 0/1 (small flipping noise) minimize specially
Linear regression $h(x) = sign(s)$
- friendly err = squared (easy to minimize) minimize analytically
logistic regression $h(x) = \theta(s)$
- plausible err = CE (maxmimum likelihood) minimize iteratively

Three Key Tools

Feature Transformation
- $E_{in}(w) \rightarrow E_{in}(\tilde{w})$, $d_{VC}(\mathcal{H}) \rightarrow d_{VC}(\mathcal{H}_{\Phi})$
- by using more complicated $\Phi$
- lower $E_{in}$
- higher $d_{VC}$
Regularization
- $E_{in}(w) \rightarrow E_{in}(w_{REG})$, $d_{VC}(\mathcal{H}) \rightarrow d_{\text{EFF}}(\mathcal{H}, \mathcal{A})$
- by augmenting regularizer $\Omega$
- lower $d_{\text{EFF}}$
- higher $E_{in}$
Validation
- $E_{in}(h) \rightarrow E_{val}(g)$, $\mathcal{H} \rightarrow \{g_1^-, \dots, g_M^-\}$
- by reserving $K$ examples as $\mathcal{D}_{val}$
- fewer choices
- fewer examples

Three Learning Principles

Occam’s Razer
- simple is good
Sampling Bias
- class matches exam
Data Snooping
- honestly is best policy

Three Future Directions

More Transform
More Regularization
Less Label

Conclusions

Occam’s Razor
- simple, simple, simple!
Sampling Bias
- match test scenario as much as possible
Data Snooping
- any use of data is ‘contamination’
Power of Three
- relatives, bounds, models, tools, principles

Learning	Driving
overfit	commit a car accident
use excessive \(d_{VC}\)	‘drive too fast’
noise	bumpy road
limited data size \(N\)	limited observations about road condition

Notes on Machine Learning Foundations

Yang Ge

September 28, 2015

Week 2

1. When can machine learn?

Hypothesis: ‘perceptron’

Perceptron Learning Algorithm

Guarantee of PLA

Non separable data

Types of Learning

Learning with different output space

Learning with different labels

Learning with different protocols

Learning with different input space

Lecture 4, Feasible of Learning

Learning is impossible?

Probability to the Rescue

Connection to Learning

Added components

The Formal Guarantee

Connection to real learning

BAD Data for Many \(h\)

Bound of BAD data

Lecture 5 Training v.s. Testing

Recap and Preview

Effective Number of Lines

Effective Number of Hypothesis

Break Point

Conclusion

Lecture 6 Theory of Generalization

Restriction of Break Point

Bounding Function: Basic Cases

Bounding Function: Inductive Cases

A pictorial Proof

Conclusion

Lecture 7 The VC Dimension

Definition of VC Dimension

VC Dimension of Perceptrons

Physical Intuition of VC Dimenson

Interpreting VC Dimension

Conclusions

Lecture 8 Noise and Error

Noise and Probabilistic Target

Error Measure

Algorithmic Error Measure

Weighted Classification

Conclusions

Lecture 9 Linear Regression

Linear Regression Problem

Linear Regression Algorithm

Generalization Issue

Linear Regression for Binary Classification

Conclusions

Lecture 10 Logistic Regression

Logistic Regression Problem

Logistic Regression Error

Gradient of Logistic Regression Error

Gradient Descent

Conclusions

Lecture 11 Linear Models for Classification

Linear Models for Binary Classification

Stochastic Gradient Descent

Multiclass via Logistic Regression

Multiclass via Binary Classification

Conclusions

Lecture 12 Nonlinear Transformation

Quadratic Hypothesis

Nonlinear Transform

Price of Nonlinear Transform

Structured Hypothesis Sets

Conclusions

Lecture 13 Hazard of Overfitting

What is overfitting?

The role of Noise and Data size

Deterministic Noise

Dealing with Overfitting

Conclusions

Lecture 14 Regularization (规则化/规律化)

Regularized Hypothesis set

Weighted Decay Regularization