Week 2

1. When can machine learn?

Hypothesis: ‘perceptron’

Each “tall” \(w\) represents a hypothesis \(h\) and multiply the “tall” \(x\).

Perceptron is equivalent to linear (binary) classifier.

Perceptron Learning Algorithm

want g is close to f (hard when f unknow)

Ideally, \[ g(x_n) = f(x_n) = y_n \]

Difficulty: H is infinite space.

Idea: start from some g0, and correct its mistakes on D.

Algorithm:

  1. Find a mistakes \(x_n(t)\)
  2. Updates \(w(t+1) <- w(t) + y_n(t)x_n(t)\)

Geometric explaination: if the angle between w and x is too big, we make it smaller, otherwise, make it bigger.

A fault confessed is half redressed.

Guarantee of PLA

The normalized inner product of w_f and w_t will increase monotonously. It will stop at certain iteration.

Non separable data

Pros: simple to implement, fast, work for any dimension d. Cons: * Assume linearly separable. * not fully sure how long halting takes (rho depends on wf)

PLA Algorithm

Run enough iteration.

Types of Learning

Learning with different output space

Binary classification * Credit card approval * Email spam/not spam * Patient sick/not sick * ad profitable/not profitable * answer correct/incorrect

Learning with different labels

Learning with different protocols

Learning with different input space

Lecture 4, Feasible of Learning

Learning is impossible?

No free launch!

Probability to the Rescue

Bin model.

in sample \(\nu\) is likely close to unknow \(\mu\).

Hoeffding’s Inequality:

\[ Pr(|\mu-\nu| \geq \epsilon) \leq exp(-2\epsilon^{2}N) \]

Probably Approximately Correct (PAC).

Connection to Learning

Learning.

Unknown f target f(x)

h is wrong/orange equivalent h(x) not equals to f(x) h is right/green Check h on D = {(xn, yn)}

Added components

\(E_{in}\): in sample

\(E_{out}\): out sample

The Formal Guarantee

  • Valid for all \(N\) and \(\epsilon\).
  • Doesn’t depend on \(E_{out}(h)\), \(f\) and \(P\) can stay unknown.
  • \(E_{in}(h) = E_{out}(h)\) is PAC.

Connection to real learning

BAD Data for Many \(h\)

no ‘freedom of choice’ by A, 踩到雷。

Bound of BAD data

If \(|\mathcal{H}| = M\) is finite, \(N\) large enough, for whatever \(g\) picked by \(\mathcal{A}\), \(E_{out}(g) \approx E_{in}(g)\).

If \(\mathcal{A}\) finds one \(g\) with \(E_{in}(g) \approx 0\), PAC guarantees for \(E_{out}(g) \approx 0\).

Lecture 5 Training v.s. Testing

Recap and Preview

If \(|\mathcal{H}| = M\) is finite, \(N\) large enough, for whatever \(g\) picked by \(\mathcal{A}\), \(E_{out}(g) \approx E_{in}(g)\).

If \(\mathcal{A}\) finds one \(g\) with \(E_{in}(g) \approx 0\), PAC guarantees for \(E_{out}(g) \approx 0\).

\[ E_{out}(h) \underbrace{\approx}_{test} E_{in}(h) \underbrace{\approx}_{train} 0 \]

Two Central Questions:

  1. Can we make sure that \(E_{out}(h)\) is close enough to \(E_{in}(g)\)?

  2. Can we make \(E_{in}(g)\) small enough?

What role does \(\underbrace{|\mathcal{H}|}_{M}\) play for these two questions?

Trade-off on \(M\)

  • small \(M\)

    1. Yes! \(Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)\)
    2. No! too few choices!
  • large \(M\)

    1. No! \(Pr(BAD) \leq 2 \cdot M \cdot exp(\dots)\) is big.
    2. Yes! many choices
  • What if \(M = \infty\)

Todo:

  • establish a finite quantity that replaces \(M\) \[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2 m_{\mathcal{H}} exp(-2 \epsilon^2 N) \]
  • justify the feasiblity of learning for infinite \(M\)
  • study \(m_{\mathcal{H}}\) to understand its trade-off for ‘right’ \(\mathcal{H}\)

These will be covered in next 3 lectures :-)

Effective Number of Lines

Where did M come from

  • BAD events \(B_m\): \(|E_{in}(B_m) - E_{out}(B_m)| > \epsilon\)
  • To give \(\mathcal{A}\) freedom choice, need to bound \(Pr(B_1 \cup B_2 \cup \dots B_M)\)
  • Worse case: all \(B_m\) non-overlapping. It fails when M is infinite. \[ Pr(B_1 \cup B_2 \cup \dots B_M) \leq Pr(B_1) + \dots + Pr(B_M) \]

Where did uniform bound fail?

  • BAD events \(B_m\): \(|E_{in}(B_m) - E_{out}(B_m)| > \epsilon\) overlapping for similar hypothesis \(h_1 \approx h_2\)
  • Why?
    • \(E_{out}(h_1) \approx E_{out}(h_2)\)
    • for most \(D\), \(E_{in}(h_1) \approx E_{in}(h_2)\)
  • union bound over-estimation
  • Solution: group similar hypothesis by kind.

How Many lines Are There?

\(\mathcal{H} = {all\ lines\ in\ R^2}\)

  • how many lines? \(\infty\)
  • how many kinds of lines from \(\mathbf{x_1}\) point of view? 2 kinds:
    • \(h_1(x_1) = +1\) and \(h_2(x_1) = -1\)
  • how many kinds of lines for 4 points? At most 14.

Effective Number of Lines

It means max number of lines w.r.t N inputs \(x_1, \dots, x_N\)

  • must be \(\leq 2^N\)
  • finite ‘grouping’ of infinitely-many lines \(\in \mathcal{H}\)
  • wish:

\[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ eff(N)\ exp(-2 \epsilon^2 N) \]

If

  1. eff(N) can replace \(M\) and
  2. eff(N) << \(2^N\)

Then learning is possible for infinite lines :-)

Effective Number of Hypothesis

Dichotomy: mini hypothesis

  • Definition: hypothesis ‘limited’ to the eyes of \(x_1, \dots, x_n\), i.e. \[ h(x_1, \dots, x_n) = (h(x_1), \dots, h(x_n)) \in \{\times, \circ\}^N \]

  • Dichotomy set: \(\mathcal{H}(x_1, \dots, x_n)\): all dichotomies implemented by \(\mathcal{H}\) on \(x_1, \dots, x_n\).
  • Dichotomy set is bounded by \(2^N\), \(\mathcal{H}\) can be infinite.
  • \(|\mathcal{H}(x_1, \dots, x_n)|\) is a candidate for replacing \(M\).

Growth Function

In order to remove the dependance between the size of dichotomy set and the inputs, define \(m_H(N) = max|\mathcal{H}(x_1, \dots, x_n)|\). For example, for PLA, \(m_H(1) = 2, m_H(2) = 4, m_H(3) = 8, m_H(4) = 14\).

It is called growth function and bounded by \(2^N\).

Growth Function for Positive Rays

  • \(\mathcal{X} = \mathbb{R}\)
  • \(\mathcal{H}\) contains \(h\), where each \(h(x) = sign(x-1)\) for threshold \(a\).
  • positive half of 1D perceptrons.
  • \(m_{\mathcal{H}}(N)=N+1 \ll 2^N\)

Growth Function for Positive Intervals

  • \(\mathcal{X} = \mathbb{R}\)
  • \(\mathcal{H}\) contains \(h\), where each \(h(x) = sign(x-1)\) iff \(x \in [l,r)\), -1 otherwise. \[ m_{\mathcal{H}}(N) = \binom{N+1}{2} + 1 = \frac{N^2}{2} + \frac{N}{2} + 1 \ll 2^N \]

Growth Function for convex set

  • If \(x_1, \dots, x_n\) on a big circle, then every dichotomy can be implemented. \[ m_{\mathcal{H}}(N) = 2^N \]
  • call those \(N\) inputs ‘shattered’ by \(\mathcal{H}\).

Break Point

The 4 growth functions

  • positive ray, \(m_{\mathcal{H}}(N)=N+1\)
  • postitive intervals, \(\frac{N^2}{2} + \frac{N}{2} + 1\)
  • convex set, \(2^N\)
  • 2D perceptrons: \(m_{\mathcal{H}}(N) < 2^N\) in some cases

Can we replace \(M\) by \(\mathcal{H}\)? Polynomial good, exponential bad…

\[ Pr(|E_{in}(g)-E_{out}(g)| > \epsilon) \overset{?}{\leq} 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N) \]

Break point of

  • three inputs: ‘exists’ shatter; for inputs, ‘for all’ no shatter
  • if no \(k\) inputs can be shattered by \(\mathcal{H}\), the \(k\) is a break point.
  • Any \(k+i\) is break points.
  • conjecture: break point \(k\), \(m_{\mathcal{H}}(N) = O(N^{k-1})\)

Conclusion

  • Recap and Preview:
    • two questions, \(E_{out} \approx E_{in}\) and \(E_{in}(g) \approx 0\)
  • Effective No. of lines
  • Effective No. of Hypothesis
  • Break point
    • When \(m_{\mathcal{H}}\) becomes non-exponential

Lecture 6 Theory of Generalization

Restriction of Break Point

  • max possible \(m_{\mathcal{H}}\) when \(N=3\) and \(k=2\)
    • 4 dichotomies, shatter two points, no, but 5 yes.
  • break point \(k\) restricts maximum possible \(m_{\mathcal{H}}\) a lot for \(N > k\)

  • idea: \[ m_{\mathcal{H}} \leq m_{\mathcal{H}}\ given\ k \leq ploy(N) \]

Bounding Function: Basic Cases

  • bounding function \(B(N, k)\): max possible \(m_{\mathcal{H}}\) when break point \(= k\)
  • combinatorial quantity: max No. of length-N vectors with \((\circ, \times)\) while ‘no shatter’ any length-\(k\) subvectors.
  • new goal \(B(N, K) \leq poly(N)\)

Bounding Function: Inductive Cases

  • Estimating \(B(4, 3)\), after exhaustively searching the \(2^{2^4}\), the winner is 11. For 4 points, there are 16 possible \((\circ, \times)\) combinations, each of them can either be in or not in the dichotomy set, so there are \(2^{16}\) possible dichotomy sets.

  • Estimating \(B(4, 3)\), \(B(4, 3) = 11 = 2\alpha + \beta\). \(\alpha+\beta\): dichotomies on \((x_1, x_2, x_3)\). \(\alpha+\beta \leq B(3, 3)\)

  • If we only look at the orange part, no any two points can be shattered. So \(\alpha \leq B(3, 2)\).

  • We have \(B(4, 3) \leq B(3, 3) + B(3, 2)\). And \(B(N, k) \leq B(N-1, k) + B(N-1, k-1)\).

Bounding Function: The Theorem

\[ B(N,k) \leq \sum_{i=0}^{k-1} \binom{N}{i} \]

The bottom line is \(m_{\mathcal{H}}\) is poly(N) if break point exists.

Three Break Points

2D perceptrons: \(m_{\mathcal{H}}(4) = 14 < 2^4\): break point at 4.

A pictorial Proof

want

\[ Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\ m_{\mathcal{H}}(N) \ exp(-2 \epsilon^2 N) \]

we can get \[ Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(g)-E_{out}(g)| > \epsilon) \leq 2\cdot 2\ m_{\mathcal{H}}(2N) \ exp(-2 \frac{1}{16} \epsilon^2 N) \]

Step1: Replace \(E_{out}\) by \(E_{in}'\)

  • How? sample verification set \(D'\) of size N to calculate \(E_{in}'\)
  • Bad \(h\) of \(E_{in}-E_{out}\) can probably infer bad \(h\) of \(E_{in}-E_{in}'\)

Step2: Decompose \(\mathcal{H}\) by kind

\[ BAD \leq 2Pr(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \]

Step3: Use Hoeffding without Replacement

VC bound:

\[ BAD \leq 2m_{\mathcal{H}}(2N)Pr(fixed\ h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon/2) \leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N) \]

Conclusion

  • Restriction of Break Point: breaks consequent points
  • Bounding Function: Basic Cases, \(B(N,k)\) bounds \(m_{\mathcal{H}}(N)\)
  • Bounding Function: Inductive Cases, \(B(N,k)\) is poly(N)
  • A pictorial Proof: \(m_{\mathcal{H}}(N)\) can replace \(M\) with a few changes.

Lecture 7 The VC Dimension

Definition of VC Dimension

Recap: More on Growth Function

\(m_{\mathcal{H}}(N)\) of break point \(k \leq B(N,k)=\sum_{i=0}^{k-1} \binom{N}{i} \sim O(N^{k-1})\)

Recap: More on VC bound

For any \(g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}\) and ‘statistival’ large \(\mathcal{D}\)

\[ \begin{split} \mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \\ \leq \mathbb{P}_{\mathcal{D}}(\exists h \in \mathcal{H}\ s.t. |E_{in}(h)-E_{in}'(h)| > \epsilon) \\ \leq 4m_{\mathcal{H}}(2N)exp(-\frac{1}{8}\epsilon^2N) \\ \leq 4(2N)^{k-1}exp(-\frac{1}{8}\epsilon^2N) \end{split} \]

If:

  1. \(m_{\mathcal{H}}\) breaks at \(k\), (good \(\mathcal{H}\))
  2. \(N\) large enough, (good \(\mathcal{D}\))
  3. \(\mathcal{A}\) picks good \(g\) with small \(E_{in}\), good \(\mathcal{A}\)

VC Dimension

Definition:

  • \(d_{VC}(\mathcal{H})\), VC dimension of \(\mathcal{H}\): largest \(N\) \(m_{\mathcal{H}}(N) = 2^N\)
  • the most input \(\mathcal{H}\) that can shatter
  • \(N \leq d_{vc} \Longrightarrow \mathcal{H}\) can shatter some \(N\) inputs
  • \(k \geq d_{vc} \Longrightarrow k\) is a break point

\[ m_{\mathcal{H}}(N) \leq N^{d_{VC}}, if N \geq 2, d_{VC} \geq 2 \]

The Four VC Dimensions

  • positive ray, \(d_{VC} = 1\)
  • positive intervals, \(d_{VC} = 2\)
  • convex set, \(d_{VC} = \infty\)
  • 2D perceptrons, \(d_{VC} = 3\)

VC Dimension and Learning

  • regardless of learning algorithm
  • regardless of input distribution \(P\)
  • regardless of target function \(f\)

FUn time No conclusion

VC Dimension of Perceptrons

2D PLA revisited

  • linearly separable \(\mathcal{D}\) -> PLA can converge -> \(T\) iterations -> \(E_{in}(g) = 0\)
  • with \(x_n \sim P\) and \(y_n = f(x_n)\) -> \(\mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq \dots\) because \(d_{VC} = 3\) -> \(N\) is large -> \(E_{in}(g) \approx E_{out}(g) \approx 0\)

VC Dimension of Perceptrons

  • 1D (pos/neg rays): \(d_{VC}=2\)
  • 2D: \(d_{VC}=3\)
  • d-D: : \(d_{VC} \overset{?}{=} d+1\)
    • \(d_{VC} \geq d+1\)
    • \(d_{VC} \leq d+1\)

\(d_{VC} \geq d+1\)

  • A special matrix

\(d_{VC} \leq d+1\)

  • linear dependency restricts dichotomy \[ w^Tx_4 = w^Tx_2 + w^Tx_3 - w^Tx_1 > 0 \]

Physical Intuition of VC Dimenson

Degrees of Freedom

Two Old Friends

  • Positive Rays, free parameters: \(a\)
  • Positive Intervals, free parameters: \(l, r\)
  • practical rule of thumb: \(d_{VC} \approx\) # of free parameters(but not always).

\(M\) and \(d_{VC}\)

  • small \(d_{VC}\)
  • large \(d_{VC}\)
  • using the right \(d_{VC}\) (or $)

Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

For any \(g=\mathcal{A}(\mathcal{D}) \in \mathcal{H}\) and ‘statistival’ large \(\mathcal{D}\)

\[ \mathbb{P}_{\mathcal{D}}(|E_{in}(g)-E_{out}(g)| > \epsilon) \leq 4(2N)^{d_{VC}}exp(-\frac{1}{8}\epsilon^2N) = \delta \] \[ \epsilon = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})} \] with probablity \(1-\delta\), : Generalization error \(|E_{in}(g)-E_{out}(g)| \leq \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}\) \[ E_{out}(g) \leq E_{in}(g)+\sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})} \]

THE VC message

\(\Omega(N, \mathcal{H}, \delta) = \sqrt{\frac{8}{N}ln(\frac{4(2N)^{d_{VC}}}{\delta})}\) is called model complexity penalty.

  • \(d_{VC} \uparrow\), \(E_{in} \downarrow\) but \(\Omega \uparrow\)
  • \(d_{VC} \downarrow\), \(E_{in} \uparrow\) but \(\Omega \downarrow\)
  • best \(d_{VC}\) is in the middle
  • powerful \(\mathcal{H}\) not always good!

VC Bound Rephrase: Sample Complexity

  • Need \(10000d_{VC}\) in theory
  • Practical rule of thumb: \(10d_{VC}\)

Looseness of VC Bound

  • Hoeffding for unknown \(E_out\), any distribution any target.
  • \(m_{\mathcal{H}}\) instead of \(|\mathcal{H}(x_1, \dots, x_n)|\), any data
  • \(N^{d_{VC}}\) instead of \(m_{\mathcal{H}}\), any \(\mathcal{H}\) of same \(d_{VC}\)
  • union bound on worst cases, any choice made by \(\mathcal{A}\)
  • But hardly better, and similar loose for all models
  • Philosophical message of VC bound is important for improving ML

Conclusions

  • Definition VC Dimension: max non-break point
  • VC Dimension of Perceptrons: \(d+1\)
  • Physical Intuition of VC Dimenson: \(\approx\) # free parameters
  • Interpreting VC Dimension: model and sample complexity

Lecture 8 Noise and Error

Noise and Probabilistic Target

What if there is noise?

Noise

  • noise in y: good customer mislabeled as bad
  • noise in y: same customer different labels
  • noise in x: inaccurate information

Does VC bound work under noise?

Probabilistic Marbels

  • Probabilistic (noise) marbles
    • marble \(x \sim P(x)\)
    • probabilistic color, \(y \ne h(x)\) with \(y \sim P(y|x)\)
  • VC holds for
    • \(x \overset{i.i.d}{\sim} P(x)\)
    • \(y \overset{i.i.d}{\sim} P(y|x)\)
    • \((x, y) \overset{i.i.d}{\sim} P(x, y)\)

Target Distribution \(P(y|x)\)

  • can be viewed as ‘idea mini target’ + noise, e.g.
    • \(P(\circ|x) = 0.7, P(\times|x) = 0.3\)
    • idea mini target \(f(x) = \circ\)
    • ‘flipping’ noise level = 0.3
  • deterministic target \(f\): special case of target distribution
    • \(P(y|x) = 1\) for \(y = f(x)\)
    • \(P(y|x) = 0\) for \(y \ne f(x)\)
  • goal of learning: predict idea mini target (w.r.t \(P(y|x)\)) on often-seen inputs (w.r.t \(P(x)\))

New Learning flow

VC still works, pocket algorithm explained :)

Error Measure

final hypothesis \(g \approx f\)

  • how well? previously, considered out-of-sample measure \[ E_{out}(g) = \underset{X \sim P}{\epsilon}(g(x) \ne f(x)) \]

  • more generally, error measure \(E(g, f)\)
  • natrually considered
    • out-of-sample: averaged over unknow \(x\)
    • pointwise: evaluated on one \(x\)
    • classification: [prediction \(\ne\) target], often also called ‘0/1 error’

Point Error Measure

\[ E_{out}(g) = \underset{X \sim P}{\mathcal{E}} \underbrace{[g(x) \ne f(x)]}_{error(g(x), f(x))} \]

  • in-sample

\[ E_{in}(g) = \frac{1}{N} \sum_{n=1}^{N}err(g(x_n), f(x_n)) \]

  • out-of-sample

\[ E_{out}(g) = \underset{X \sim P}{\mathcal{E}} err(g(x), f(x)) \]

will mainly consider pointwise err for simplicity

Two Important Pointwise Error Measures

  • 0/1 error
    • \(err(\tilde{y}, y) = [\hat{y} \ne y]\)
    • correct or not
    • often for classification
  • squared error
    • \(err(\tilde{y}, y) = (\hat{y}-y)^2\)
    • how far \(\hat{y}\) from \(y\)
    • often for regression

How does err ‘guide’ learning?

Ideal Mini-Target

\[ P(y=1|x) = 0.2, P(y=2|x) = 0.7, P(y=3|x) = 0.1 \]

  • 0/1 error, \(err(\tilde{y}, y) = [\tilde{y} \ne y]\)
    • \(\tilde{y} = 1\), avg. err 0.8
    • \(\tilde{y} = 2\), avg. err 0.3
    • \(\tilde{y} = 3\), avg. err 0.9
    • \(\tilde{y} = 1.9\), avg. err 1
    • \(f(x) = \underset{y\in\mathcal{y}}{argmax}(P(y|x))\)
  • squared error, \(err(\tilde{y}, y) = (\tilde{y}-y)^2\)
    • \(\tilde{y} = 1\), avg. err 1.1
    • \(\tilde{y} = 2\), avg. err 0.3
    • \(\tilde{y} = 3\), avg. err 1.5
    • \(\tilde{y} = 1.9\), avg. err 0.29
    • \(f(x) = \sum_{y\in\mathcal{y}} y \cdot P(y|x)\)

Learning Flow with Error Measure

Extended VC theory/‘Philosophy’ works for most \(\mathcal{H}\) and err.

Algorithmic Error Measure

Choice of Error Measure

two types of error: false accept and false reject

0/1 error penalizes both types equally.

  • supermarket: fingerprint for discount
    • false reject: very unhappy customer, lose future business, (penalty 10)
    • false accept: give away a minor discount, intruder left fingerprint (penalty 1)
  • CIA: fingerprint for entrance
    • false accept: very serious consequences! (penalty 1000)
    • false reject: unhappy employee, but so what? :) (penalty 1)

Take-home Message for Now

err is application/user-dependent

Algorithm Error Measures \(\widehat{error}\)

  • true: just err
  • plausible:
    • 0/1 minimum ‘flipping noise’ - NP-hard to optimize
    • squared: minimum Gaussian noise
  • friendly: easy to optimize for \(\mathcal{A}\)
    • closed-form solution
    • convex objective function

err: application goal

\(\widehat{err}\): a key part of many \(mathcal{A}\)

Weighted Classification

\[ \begin{array}{c|c} & \begin{matrix} h(x) \\ 1 \qquad -1 \end{matrix} \\ \hline \begin{matrix} f(x) & 1 \\ & -1 \end{matrix} & \begin{matrix} 0 & 1 \\ 1000 & 0 \end{matrix} \end{array} \]

out-of-sample \(E_out(h)\) =

  • 1, if \(y_n = +1\)
  • 1000, if \(y_n = -1\)

in-sample is similar

Minimizing \(E_{in}\) for weighted Classification

Naive Thoughts:

  • PLA: doesn’t matter if linear separable.
  • pocket: modify pocket-replacement rule
    • if \(w_{t+1}\) reaches smaller \(E_{in}^{w}\) than \(\hat{w}\)

Systematic Route: Connect \(E_{in}^w\) and \(E_{in}^{0/1}\)

  • idea: duplicate the data point with -1 label 1000 times.

weighted pocket algorithm

using ‘virtual copying’, weighted pocket algorithm include:

  • weighted PLA: randomly check -1 example mistakes with 1000 times more probability
  • weighted pocket replacement: if \(w_{t+1}\) reaches smaller \(E_{in}^{w}\) than \(\hat{w}\), replace \(\hat{w}\) by \(w_{t+1}\)

systematic route (called reduction): can be applied to many other algorithm

Fun time

Unbalanced data

Conclusions

  • Noise and Probabilistic Target
    • can replace \(f(x)\) with \(P(y|x)\)
  • Error Measure
    • affect ‘ideal’ target
  • Algorithmic Error Measure
    • user-dependent => plausible or friendly
  • Weighted Classification
    • easily done by virtual ‘example copying’

Lecture 9 Linear Regression

Linear Regression Problem

Credit Limit Problem

credit limit? $100,000

Linear Regression Hypothesis

  • For \(x = (x_0, x_1, \dots, x_d)\) ‘features of customer’, approximate the desired credit limit with a weighted sum:

\[ y \approx \sum_{i = 0}^{d} w_i x_i \]

  • linear regression hypothesis: \(h(x) = w^T x\)

  • \(h(x)\): like perceptron, but without the sign

The Error Measure

square error \(err(\hat{y}, y) = (\hat{y}^2 - y)^2\)

  • in sample

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (\underbrace{h(x_n)}_{w^T x_n} - y_n)^2 \]

  • out-of-sample

\[ E_{out}(w) = \underset{(x, y) \sim P}{\mathcal{E}} (w^T x - y) \]

Linear Regression Algorithm

Matrix from of \(E_{in}(w)\)

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T x_n - y_n)^2 = \frac{1}{N} \begin{Vmatrix} x_1^T w - y_1 \\ x_2^T w - y_2 \\ \dots \\ x_N^T w - y_N \\ \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} Xw-y \end{Vmatrix}^2 \]

  • \(E_{in}(w)\): continuous, differentiable, convex
  • necessary condition of ‘best’ \(w\) - not possible to ‘roll down’
  • task: find \(w_{LIN}\) such that \(\nabla E_{in}(w_{LIN}) = 0\)

The Gradient \(\nabla E_{in}(w)\)

\[ E_{in}(w) = = \frac{1}{N} \begin{Vmatrix} Xw-y \end{Vmatrix}^2 = \frac{1}{N} (w^T X^T X w - 2 w^T X^T y + y^Ty) = \frac{1}{N} (w^T A w - 2 w^T b + c) \]

For vector \(w\), we have

\[ \nabla E_{in}(w) = \frac{1}{N} (2Aw - 2b) = \frac{2}{N} (X^TXw - X^Ty) = 0 \]

Optimal Linear Regression Weights

  • invertible \(X^T X\)
    • easy! unique solution \(w_{LIN} = \underbrace{(X^T X)^{-1}X^T}_{pseudo-inverse X^T} y\)
    • often the case because \(N >> d+1\)
  • singular \(X^T X\)
    • many optimal solutions
  • practical suggestion:
    • use well-implemented routine to solve \(X^+\) instead of \((X^T X)^{-1}X^T\) for numerical stability.

Linear Regression Algorithm

  1. from \(\mathcal{D}\), construct input matrix \(X\) and output vector \(y\) by

\[ X = \underbrace{ \begin{bmatrix} --x_1^T-- \\ --x_2^T-- \\ \dots \\ --x_N^T-- \\ \end{bmatrix} }_{N\times(d+1)} y = \underbrace{ \begin{bmatrix} --y_1^T-- \\ --y_2^T-- \\ \dots \\ --y_N^T-- \\ \end{bmatrix} }_{N\times1} \]

  1. calculate pseudo-inverse \(\underbrace{X^+}_{(d+1)\times N}\)
  2. return \(\underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y\)

Generalization Issue

\[ \underbrace{w_{LIN}}_{(d+1) \times 1} = X^+y \]

  • No!
    • analytic (closed-form) solution, ‘instantaneous’
    • not improving \(E_{in}\)
  • Yes!
    • good \(E_{in}\)? Yes, optimal
    • good \(E_{out}\)? yes, finite \(d_{vc}\) like perceptrons
    • improving iteratively? somewhat, whithin an iterative pseudo-inverse routine
  • if \(E_{out}(w_{LIN})\) is good, learning ‘happened’!

Benefit of analytic solution: ‘simpler-than-VC’ Guarantee

\[ \overline{E_{in}} = \underset{\mathcal{D} \sim P^N}{\mathcal{E}}{E_{in}(w_{LIN} w.r.t. \mathcal{D})} \overset{\text{to be shown}}{=} \text{noise level} \cdot (1 - \frac{d+1}{N}) \]

We have:

\[ E_{in}(w_{LIN}) = \frac{1}{N} \begin{Vmatrix} y - \hat{y} \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} y - XX^{+}y \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} (I - XX^{+})y \end{Vmatrix}^2 \]

We call \(XX^+\) hat matrix \(H\) because it puts \(\wedge\) on \(y\)

Geometric View of Hat Matrix

  • in \(\mathbb{R}^N\)
    • \(\hat{y} = Xw_{LIN}\) within the span of \(X\) columns
    • \(y-\hat{y}\) smallest: \(y-\hat{y} \bot\) span
    • \(H\): project \(y\) to \(\hat{y} \in\) span
    • \(I-H\): transform \(y\) to \(y-\hat{y} \bot\) span
  • claim: trace(\(I-H\)) = \(N-(d+1)\)

An Illustrative ‘Proof’

  • if \(y\) comes from some ideal \(f(X) \in\) span plus noise
  • noise transfromed by \(I-H\) to be \(y-\hat{y}\)

\[ E_{in}(w_{LIN}) = \frac{1}{N} \begin{Vmatrix} y - \hat{y} \end{Vmatrix}^2 = \frac{1}{N} \begin{Vmatrix} (I-H)\text{noise} \end{Vmatrix}^2 = \frac{1}{N}(N-(d+1)) \begin{Vmatrix} \text{noise} \end{Vmatrix}^2 \]

We have:

\[ \overline{E_{in}} = \text{noise level} \cdot (1 - \frac{d+1}{N}) \]

\[ \overline{E_{out}} = \text{noise level} \cdot (1 + \frac{d+1}{N}) \]

The Learning Curve

  • Both converge to \(\sigma^2\) (noise level) for \(N \rightarrow \infty\)
  • expected generalization error: \(\frac{2(d+1)}{N}\)
    • similar to worst-case guarantee from VC
  • linear regression (LinReg): learning happened!

Linear Regression for Binary Classification

Linear Classification v.s. Linear Regression

  • Linear Classification:
    • \(\mathcal{Y} = {-1, +1}\)
    • \(h(x) = \text{sign}(w^Tx)\)
    • \(\text{err}(\hat{y}, y) = [\hat{y} \ne y]\)
    • NP-hard to solve in general
  • Linear Regression
    • \(\mathcal{Y} = \mathbb{R}\)
    • \(h(x) = (w^Tx)\)
    • \(\text{err}(\hat{y}, y) = (\hat{y} - y)^2\)
    • efficient analytic solution
  • \({-1, 1} \in \mathbb{R}\): linear regression for classification?
    • run LinReg on binary classification data \(\mathcal{D}\) (efficient)
    • return \(g(x) = \text{sign}(w_{LIN}^Tx)\)
  • explanation of this heuristic?

Relation of Two Errors

  • \(\text{err}_{0/1} = [sign(w^Tx) \ne y]\)
  • \(\text{err}_{sqr} = (w^Tx-y)^2\)
  • \(\text{err}_{0/1} \leq \text{err}_{sqr}\)

Linear Regression for Binary Classification

  • \(\text{err}_{0/1} \leq \text{err}_{sqr}\)

\[ \text{classification} E_{out}(w) \overset{VC}{\leq} \text{classification} E_{in}(w) + \sqrt{\dots\dots} \leq \text{regression} E_{in}(w) + \sqrt{\dots\dots} \]

  • (loose) upper bound \(err_{sqr}\) as \(\widehat{err}\) to approximate \(\text{err}_{0/1}\)
    • if we have small enough \(err_{sqr}\), the \(\text{classification} E_{out}\) will also be small
  • trade bound tightness for efficiency
  • \(w_{LIN}\): useful baseline classifer, or as initial PLA/pocket vector

Conclusions

  • Linear Regression Problem
    • use hyperplane to approximate real values
  • Linear Regression Algorithm
    • analytic solution with pseudo-inverse
  • Generalization Issue
    • \(E_{out} - E_{in} \approx \frac{2(d+1)}{N}\) on average
  • Linear Regression for Binary Classification
    • 0/1 error \(\leq\) squared error

Lecture 10 Logistic Regression

Logistic Regression Problem

Heart Attack Prediction Problem

binary classification: heart disease? yes or no ?

  • ideal \(f(x) = \text{sign}(P(+1|x) - 0.5) \in {-1, +1}\)

‘Soft’ binary classification: similar problem heart attack? 80% risk

  • \(f(x) = P(+1|x) \in [0, 1]\)

Soft Binary Classification

  • target function $f(x) = P(+1|x) $

  • ideal (noiseless) data
    • \((x_1, y_1' = 0.9 = P(+1|x_1))\)
    • \((x_2, y_2' = 0.2 = P(+1|x_2))\)
    • \((x_N, y_N' = 0.6 = P(+1|x_N))\)
  • actual (noisy) data
    • \((x_1, y_1' = 1 = [\circ \overset{?}{\sim} P(y|x_1)]\)
    • \((x_2, y_2' = 0 = [\circ \overset{?}{\sim} P(y|x_2)]\)
    • \((x_N, y_N' = 0 = [\circ \overset{?}{\sim} P(y|x_N)]\)
  • We have same data as hard binary Classification, but different target function

Logistic Hypothesis

  • For \(x = (x_0, x_1, \dots, x_d)\) ‘features of patient’, calculate a weighted ‘risk score’:

\[ s = \sum_{i=0}^{d} w_i x_i \]

  • convert the score to estimated probability by logistic function \(\theta(s)\)
  • logistic hypothesis: \(h(x) = \theta(w^T x)\)

Logistic Function

\[ \theta(s) = \frac{e^s}{1+ e^s} = \frac{1}{1+e^{-s}} \]

  • It’s smooth, monotonic sigmoid function of \(s\)
  • logistic regression: use the following to approximate target function \(f(x) = P(y|x)\)

\[ h(x) = \frac{1}{1+ exp(-w^T x)} \]

Logistic Regression Error

Three Linear Models

  • Linear Classification
    • \(h(x) = \text{sign}(s)\)
    • plausible err = 0/1 (small flipping noise)
  • Linear Regression
    • \(h(x) = s\)
    • friendly err = squared (easy to minimize)
  • Logistic Regression
    • \(h(x) = \theta(s)\)
    • err = ?

Likelihood

\[ f(x) = P(+1|x) \iff P(y|x) = \begin{cases} f(x) & \quad \text{for } y = +1 \\ 1-f(x) & \quad \text{for } y = -1 \\ \end{cases} \]

  • Consider \(\mathcal{D} = {(x_1, \circ), (x_2, \times), \dots, (x_N, \times)}\)
  • probability that \(f\) generates \(\mathcal{D}\) is

\[ P(x_1)P(\circ | x_1) \times P(x_2)P(\times | x_2) \times \dots \times P(x_N)P(\times | x_N) \]

  • replace the above equation with \(f(x)\)

\[ P(x_1) \times f(x_1) P(x_2) (1 - f(x_2)) \times \dots \times P(x_N) (1 - f(x_N)) \]

  • Likelihood that \(h\) generates \(\mathcal{D}\) is

\[ P(x_1) \times h(x_1) P(x_2) (1 - h(x_2)) \times \dots \times P(x_N) (1 - h(x_N)) \]

  • if \(h \approx f\), then likelihood(\(h\)) \(\approx\) probability using \(f\)
  • probability using \(f\) usually large

Likelihood of Logistic Hypothesis

  • likelihood(\(h\)) \(\approx\) probability using \(f \approx\) large
  • \(g = \underset{h}{\text{argmax}} \text{likelihood}(h)\)
  • when logistic: \(h(x) = \theta(w^Tx)\)
    • \(1-h(x) = h(-x)\)

\[ \text{likelihood}(h) = P(x_1) h(+x_1) P(x_2) h(-x_2) \cdots P(x_N) h(-X_N) \]

So we have

\[ \text{likelihood}(\text{logistic } h) \propto \prod_{n=1}^{N} h(y_n x_n) \]

Cross-Entropy Error

\[ \underset{w}{\text{max}}\ \text{likelihood } (w) \propto \prod_{n=1}^{N} \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{max}}\ \text{ln } \prod_{n=1}^{N} \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} - \text{ln } \theta(y_n w^T x_n) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n)) \]

\[ \underset{w}{\text{min}}\ \frac{1}{N} \underbrace{\sum_{n=1}^{N} \text{err}(w, x_n, y_n)}_{E_{in}(w)} \]

  • cross-entropy error: \(\text{err}(w, x_n, y_n) = \text{ln } (1 + exp(-y_n w^T x_n))\)

Gradient of Logistic Regression Error

Minimizing \(E_{in}(w)\)

  • \(E_{in}(w)\): continuous, differentiable, twice-differentiable, convex
  • how to minimize? locate valley, want \(\nabla E_{in}(w) = 0\)
  • first: derive \(\nabla E_{in}(w) = 0\)

The Gradient \(\nabla E_{in}(w)\)

\[ E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} \text{ln } (1 + exp(-y_n w^T x_n)) \]

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n) \]

Note that in Andrew’s cource, the \(\nabla E_{in}(w)\) is as following. Which is actually the same as above if we replace \(y_n = 1\) and \(y_n = -1\), and use \(h(-x) = 1 - h(x)\) property. But note, in Andrew’s cource, \(y_n\) can be 1 or 0, not \(\pm 1\).

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} (h(x_n) - y_n) x_n \]

Minimizing \(E_{in}(w)\)

\[ \nabla E_{in}(w) = 1 / N \sum_{n=1}^{N} \theta(-y_n w^T x_n) (-y_n x_n) = 0 \]

  • scaled \(\theta\)-weighted sum of \(-y_n x_n\)
    • all \(\theta(i) = 0\), only if \(y_n w^T x_n \gg 0\) — Linear separable \(\mathcal{D}\)
    • weighted sum = 0: non-linear equation of \(w\)

PLA revisited: iterative optimization

PLA: start from some \(w_0\), and correct its mistakes on \(\mathcal{D}\)

For \(t = 0, 1, \dots\)

  1. find a mistake of \(w_t\) called \((x_{n(t)}, y_{n(t)})\)

\[ \text{sign}(w_t^Tx_{n(t)}) \ne y_{n(t)} \]

  1. try to correct the mistake by

\[ w_{t+1} \leftarrow w_t + y_{n(t)}x_{n(t)} \]

when stop, return last \(w\) as \(g\)

For \(t = 0, 1, \dots\)

  1. (equivalently) pick some \(n\), and update \(w_t\) by

\[ w_{t+1} \leftarrow w_t + \underbrace{1}_{\eta} \cdot \underbrace{[\text{sign}(w_t^Tx_n) \ne y_n]y_{n(t)}x_{n(t)}}_{v} \]

when stop, return last \(w\) as \(g\)

  • \(v\) is the updates direction
  • \(\eta\) how big is the step in the direction
  • choice of \((\eta, v)\) and stopping condition defines iterative optimization approach

Gradient Descent

Iterative Optimization

For \(t = 0, 1, \dots\)

\[ w_{t+1} \leftarrow w_t + \eta v \]

when stop, return last \(w\) as \(g\)

  • PLA: \(v\) comes from mistake correction
  • smooth \(E_{in}(w)\) for logistic regression: choose \(v\) to get the ball roll ‘downhill’
    • direction \(v\): (assumed) of unit length
  • a greedy approach for some given \(\eta > 0\):

\[ \underset{|v| = 1}{\text{min}} E_{in} (\underbrace{w_t + \eta v}_{w_{t+1}}) \]

Linear Approximation

  • still non-linear optimization, now with constraints
    • not any easier than \(\text{min}_w E_{in} (w)\)
  • local approximation by linear formula makes problem easier.

\[ E_{in} (w_t + \eta v) \approx E_{in}(w_t) + \eta v^T \nabla E_{in}(w_t) \]

  • if \(\eta\) really small (Taylor expansion)

an approximate greedy approach for some given small \(\eta\):

\[ \underset{|v| = 1}{\text{min}} E_{in} (w_t) + \eta v^T \nabla E_{in}(w_t) \]

Gradient Descent

  • optimal \(v\):

\[ v = - \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|} \]

  • gradient descent: for small \(\eta\)

\[ w_{t+1} \leftarrow w_t - \eta \frac{\nabla E_{in}(w_t)}{|\nabla E_{in}(w_t)|} \]

  • a simple and popular optimization

Choice of \(\eta\)

  • too small: too slow
  • too large: too unstable
  • just right: use changing \(\eta\)

\(\eta\) better be monotonic of \(\begin{Vmatrix}\nabla E_{in}(w_t)\end{Vmatrix}\)

Simple heuristic for changing \(\eta\)

\[ w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t) \]

  • called the fixed learning rate

Putting Everything Together

Initialize \(w_0\)

For \(t = 0, 1, \dots\)

  1. compute

\[ \nabla E_{in}(w_t) = 1 / N \sum_{n=1}^{N} \theta(-y_n w_t^T x_n) (-y_n x_n) \]

  1. update by

\[ w_{t+1} \leftarrow w_t - \eta \nabla E_{in}(w_t) \]

… until \(\nabla E_{in}(w_t) \approx 0\) or for enough iterations.

  • Similer time complexity to pocket per iteration

Conclusions

  • Logistic Regression Problem
    • \(P(+1|x)\) as target and \(\theta(w^Tx)\) as hypothesis
  • Logistic Regression Error
    • cross-entropy (negative log likelihood)
  • Gradient of Logistic Regression Error
    • \(\theta\)-weighted sum of data vectors
  • Gradient Descent
    • roll downhill by \(-\nabla E_{in}(w_t)\)

Lecture 11 Linear Models for Classification

Linear Models for Binary Classification

Linear Models Revisited

linear scoring function: \(s = w^Tx\)

  • Linear Classification
    • \(h(x) = \text{sign}(s)\)
    • plausible err = 0/1 (discrete \(E_{in}(w)\) NP hard to solve)
  • Linear Regression
    • \(h(x) = s\)
    • friendly err = squared (quadratic convex \(E_{in}(w)\): closed form solution)
  • Logistic Regression
    • \(h(x) = \theta(s)\)
    • plausible err = cross-entropy (smooth convex \(E_{in}(w)\): gradient descent)

Can linear regression or logistic regression help linear classfication? * linear classfication is hard to solve, but the other two are easy to solve * Can we use the other two to replace the first one?

Error Functions Revisited

linear scoring function: \(s = w^Tx\)

for binary classfication \(\mathcal{y} \in {-1, +1}\)

  • Linear Classification
    • \(h(x) = \text{sign}(s)\)
    • err\((h, x, y) = [h(x) \ne y]\)
    • \(\text{err}_{0/1}(s, y) = [\text{sign}(s) \ne y] = [\text{sign}(ys) \ne 1]\)
  • Linear Regression
    • \(h(x) = s\)
    • err\((h, x, y) = (h(x) - y)^2\)
    • \(\text{err}_{SQR}(s, y) = (s - y)^2 = (ys-1)^2\)
  • Logistic Regression
    • \(h(x) = \theta(s)\)
    • err\((h, x, y) = \text{ln } h(yx)\)
    • \(\text{err}_{CE}(s, y) = \text{ln} (1+exp(-ys))\)
  • \((ys)\): classfication correctness score

Visualizing Error Functions

  • \(0/1\): 1 if and only if \(ys \leq 0\)
  • sqr: large if \(ys \ll 1\), but over-charge \(ys \gg 1\), small \(\text{err}_{SQR} \rightarrow \text{small } \text{err}_{0/1}\)
  • ce: monotonic of \(ys\) small err CE \(\iff\) small error 0/1.
  • scaled ce: a proper upper bound of 0/1 small CE \(\iff\) small error 0/1.
  • upper bound: useful for designing Algorithmic error \(\widehat{err}\)

Theoretical Implication of Upper Bound

  • For any \(ys\) where \(s = W^T x\)

\[ \begin{split} \text{err}_{0/1}(s, y) \leq \text{err}_{SCE}(s, y) = \frac{1}{ln2} \text{err}_{CE}(s, y) \\ \iff \\ E_{in}^{0/1}(w) \leq E_{in}^{SCE}(w) = \frac{1}{ln2} E_{in}^{CE}(w) \\ E_{out}^{0/1}(w) \leq E_{out}^{SCE}(w) = \frac{1}{ln2} E_{out}^{CE}(w) \\ \end{split} \]

  • VC on 0/1

\[ E_{out}^{0/1}(w) \leq E_{in}^{0/1}(w) + \Omega^{0/1} \leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1} \]

  • VC on CE:

\[ E_{out}^{0/1}(w) \leq \frac{1}{ln2} E_{out}^{CE}(w) \leq \frac{1}{ln2} E_{in}^{CE}(w) + \Omega^{0/1} \]

  • Either way, we have

\[ \text{small } E_{in}^{CE}(w) \Rightarrow \text{small } E_{out}^{0/1}(w) \]

  • logistic/linear reg. for linear classification.

Regression for Classification

  1. run logistic/linear reg. on \(\mathcal{D}\) with \(y_n \in {-1, +1}\) to get \(w_{reg}\)
  2. return \(g(x) = \text{sign}(w_{reg}^T x)\)
  • linear regression
    • pros: ‘easiest’ optimization
    • cons: loose bound of err 0/1 for large \(|ys|\)
  • logistic regression
    • pros: ‘easy’ optimization
    • cons: loose bound of err 0/1 for very negative \(-ys\)
  • PLA
    • pros: efficient + strong guarantee if linear separable
    • cons: works only if linear separable, otherwise needing pocket heuristic
  • linear regression sometimes used to set \(w_0\) for PLA/pocket/logistic regression.
  • logistic regression often preferred over pocket.

Stochastic Gradient Descent

Two Iterative Optimization Schemes

For \(t = 0, 1, \dots\)

\[ w_{t+1} \leftarrow w_t + \eta v \]

when stop, return last \(w\) as \(g\)

  • PLA: pick \((x_n, y_n)\) and decide \(w_{t+1}\) by one example
    • \(O(1)\) time per iteration.
  • Logistic regression (pocket)
    • check \(\mathcal{D}\) and decide \(w_{t+1}\) by all examples
    • \(O(N)\) time per iteration.

Is it possible to make logistic regression \(O(1)\)?

Logistic Regression Revisited

\[ w_{t+1} \leftarrow w_t + \eta \frac{1}{N} \sum_{n=1}^{N} \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla E_{in}(w_t)} \]

  • want: update direction \(v \approx - \nabla E_{in}(w_t)\) while computing \(v\) by one single \((x_n, y_n)\)

  • technique on removing \(\frac{1}{N} \sum_{n=1}^{N}:\)
    • view as expectation \(\mathcal{E}\) over uniform choice of \(n\)!
    • Stochastic gradient: \(\nabla_w \text{err}(w, x_n, y_n)\) with random \(n\)
    • true gradient: \(\nabla_w E_{in}(w) = \underset{\text{random } n}{\mathcal{E}} \text{err}(w, x_n, y_n)\)

Stochastic Gradient Descent (SGD)

Stochastic gradient = true gradient + zero-mean ‘noise’ directions

  • idea: replace true gradient by stochastic gradient
  • after enough steps, average true gradient \(\approx\) average stochastic gradient
  • pros: simple and cheaper computation
    • useful for big data and online learning
  • cons: less stable in nature.

SGD logistic regression looks familiar?

\[ w_{t+1} \leftarrow w_t + \eta \underbrace{\theta(-y_n w_t^{T} x_n) (y_n x_n)}_{- \nabla \text{err}(w, x_n, y_n)} \]

PLA Revisited

SGD logistic regression:

\[ w_{t+1} \leftarrow w_t + \eta \dot \theta (-y_n w_t^{T} x_n) (y_n x_n) \]

PLA:

\[ w_{t+1} \leftarrow w_t + 1 \dot [y_n \ne \text{sign}(w_t^T x_n)] (y_n x_n) \]

  • SGD logistic regression \(\approx\) ‘soft’ PLA
    • more errors, updates more. less errors ,updates less
  • PLA \(\approx\) SGD logistic regression with \(\eta = 1\) when \(w_t^T x_n\) large.

Two practical rule of thumb:

  • stop condition? \(t\) large enough
  • \(\eta\)? 0.1 when \(x\) in proper range

Multiclass via Logistic Regression

Multiclass Classification

  • \(\mathcal{y} = {\star, \diamond, \circ, \times}\) 4-class classfication.
  • many applications in practice, especially for ‘recognition’.
  • Use tool for \({\times, \circ}\) for \({\star, \diamond, \circ, \times}\)

One Class at a time

  • \(\square\) or not? \({\square = \circ, \diamond = \triangle = \star = \times}\)
  • \(\diamond\) or not? \({\diamond = \circ, \square = \triangle = \star = \times}\)
  • \(\triangle\) or not? \({\triangle = \circ, \square = \diamond = \star = \times}\)
  • \(\star\) or not? \({\star = \circ, \square = \triangle = \diamond = \times}\)

Multiclass Prediction: Combine Binary Classifiers

  • We are sure some area is definitely what shape
  • What about tie?

One Class at a time softly

  • \(P(\square|x)? \quad \{\square = \circ, \diamond = \triangle = \star = \times\}\)
  • \(P(\diamond|x)? \quad \{\diamond = \circ, \square = \triangle = \star = \times\}\)
  • \(P(\triangle|x)? \quad \{\triangle = \circ, \square = \diamond = \star = \times\}\)
  • \(P(\star|x)? \quad \{\star = \circ, \square = \triangle = \diamond = \times\}\)

Multiclass Prediction: Combine Binary Classifiers

\[ g(x) = \text{argmax}_{k \in \mathcal{Y}} \theta(w_{[k]}^Tx) \]

OR, equivalently

\[ g(x) = \text{argmax}_{k \in \mathcal{Y}} w_{[k]}^Tx \]

One-Versus-All (OVA) Decomposition

  1. for \(k \in \mathcal{Y}\), obtain \(w_{[k]}\) by running logistic regression on
    • \(\mathcal{D}_{[k]} = \{(x_n, y_n = 2[y_n = k] - 1)\}_{n=1}^N\)
  2. return \(g(x) = \text{argmax}_{k \in \mathcal{Y}} (w_{[k]}^T x)\)
  • pros: efficient, can be coupled with any logistic regression-like approaches.
  • cons: often unbalanced \(\mathcal{D}_{[k]}\) when \(K\) large
    • logistic regression will produce naive predictor: predict everything to be \(\times\).
  • extension: multinomial (‘coupled’) logistic regression.

OVA: a simple Multiclass meta-algorithm to keep in your toolbox.

Multiclass via Binary Classification

Source of Unbalance: One Versus All

idea: make binary classification problems more balanced by one versus one.

One versus One at a Time

  • \(\square or \diamond\) ? {\(\square = \circ, \diamond = \times, \star = \triangle = \text{nil}\)}

Multiclass Prediction: Combine Pairwise Classifiers

\[ g(x) = \text{tournament champion } \{w_{[k, l]}^T x\} \]

voting of classifiers.

One-versus-one OVO Decomposition

  1. for \([k, l] \in \mathcal{Y} \times \mathcal{Y}\), obtain \(w_{[k, l]}\) by running logistic regression on
    • \(\mathcal{D}_{[k, l]} = \{(x_n, y'_n = 2[y_n = k] - 1): y_n = k \text{or } y_n = l\}\)
  2. return \(g(x) = \text{tournament champion } \{w_{[k, l]}^T x\}\)
  • pros: efficient (‘smaller’ Training problems), stable, can be coupled with any binary classification approaches.
  • cons: use \(O(K^2)w_{[k, l]}\)
    • more space, slower prediction, more training.

OVO: another simple Multiclass meta-algorithm to keep in your toolbox.

  • Very handy and stable!

Conclusions

  • Linear Models for Binary Classification
    • three models useful in different ways
  • Stochastic Gradient Descent
    • follow negative stochastic gradient
  • Multiclass via Logistic Regression
    • predict with maximum estimated P(k|x)
  • Multiclass via Binary Classification
    • predict the tournament champion

Lecture 12 Nonlinear Transformation

Quadratic Hypothesis

Linear Hypothesis

  • visually: ‘line’-like boundary
  • mathematically: linear scores \(s = w^T x\)

limitations:

  • theoretically: \(d_{vc}\) under control
  • practically: on some \(\mathcal{D}\), large \(E_{in}\) for every line.

How to break the limit of linear hypothesis?

Circular Separable

  • \(\mathcal{D}\) not linear separable
  • but circular separable by a circle of radius \(\sqrt{0.6}\) centered at origin:

\[ h_{SEP}(x) = \text{sign } (-x_1^2 - x_2^2 + 0.6) \]

re-derive Circular-PLA, Circular-Regression, blahblah … all over again?

Circular Separable and Linear Separable

\[ h(x) = \text{sign } ( \underbrace{0.6}_{\tilde{w_0}} \cdot \underbrace{1}_{\tilde{z_0}} + \underbrace{(-1)}_{\tilde{w_1}} \cdot \underbrace{x_1^2}_{\tilde{z_1}} + \underbrace{(-1)}_{\tilde{w_2}} \cdot \underbrace{x_2^2}_{\tilde{z_2}} ) = \text{sign } ( \tilde{w}^T z ) \]

  • \(\{(x_n, y_n)\}\) cicular separable \(\implies \{(x_n, y_n)\}\) linear separable.
  • \(x \in \mathcal{X} \overset{\Phi}{\mapsto} z \in \mathcal{Z}\): nonlinear feature transform \(\Phi\).
  • circular separable in \(\mathcal{X} \mapsto\) linear separable in \(\mathcal{Z}\)
  • What about vice versa?

Linear Hypothesis in Z space

\[ (z_0, z_1, z_2) = z = \Phi(x) = (1, x_1^2, x_2^2) \]

\[ h(x) = \tilde{h}(z) = \text{sign}(\tilde{w}^T \Phi(x)) = \text{sign} (\tilde{w_0} + \tilde{w_1}x_1^2 + \tilde{w_2}x_2^2) \]

\((\tilde{w_0}, \tilde{w_1}, \tilde{w_2})\)

  • \((0.6, -1, -1)\): circle (\(\circ\) inside)
  • \((-0.6, +1, +1)\): circle (\(\circ\) outside)
  • \((0.6, -1, -2)\): ellipse
  • \((0.6, -1, +2)\): hyperbola
  • \((0.6, +1, +2)\): constant \(\circ\)

line in \(\mathcal{Z} \iff\) special quadratic curves in \(\mathcal{X}\) space.

  • E.g., the center of a circle must be the origin

General Quadratic Hypothesis set

a bigger \(\mathcal{Z}\) with \(\Phi_2(x) = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)\)

Perceptrons in \(\mathcal{Z}\) space \(\iff\) quadratic hypothesis in \(\mathcal{X}\) space

\[ \mathcal{H}(\Phi_2) = {h(x): h(x) = \tilde{h}(\Phi_2(x)) \text{for some linear } \tilde{h} on \mathcal{Z}} \]

  • Can implement all possible quadratic curve boundries: circle, ellipse, rotated ellipse, hyperbola, parabola

\[ \text{ellipse } 2 (x_1 + x_2 - 3)^2 + (x_1 - x_2 - 4)^2 = 1 \]

\[ \implies \tilde{w}^T = (33, -20, -4, 3, 2, 3) \]

  • Include lines and constants as degenerated cases
  • next: learn a good quadratic hypothesis \(g\)

Nonlinear Transform

Good Quadratic Hypothesis

  • \(\mathcal{Z}\) space perceptrons \(\iff\) \(\mathcal{X}\) quadratic hypothesis
  • good perceptron \(\iff\) good quadratic hypothesis
  • separating perceptron \(\iff\) separating quadratic hypothesis
  • want: good perceptron in \(\mathcal{Z}\) space
  • know: get good perceptron in \(\mathcal{X}\) space with data \((x_n, y_n)\)
  • todo: get good perceptron in \(\mathcal{Z}\) space with data \((z_n = \Phi_2(x_n), y_n)\)

Nonlinear Transform Steps

  • transform original data \((x_n, y_n)\) to \((z_n = \Phi_2(x_n), y_n)\) by \(\Phi\)
  • Get a good perceptron \(\tilde{w}\) using \((z_n, y_n)\) and your favorite linear classification algorithm \(\mathcal{A}\)
  • return \(g(x) = \text{sign}(\tilde{w}^T \Phi(x))\)

Nonlinear Model via Nonlinear \(\Phi\) + Linear Models

  • Two choices: feature transform \(\Phi\)
  • Linear model \(\mathcal{A}\), not just binary classification
  • Pandora’s box: can now freely do quadratic PLA, quadratic regression, cubic, …, polynomial regression.

Feature Transform

  • not new, not just polynomial:
  • In digits recognition:
    • raw (pixels) \(\overset{\text{domain knowledge}}{\implies}\) concrete (intensity, symmetry)
  • the force: too good to be true?

Price of Nonlinear Transform

Computation/Storage Price

\(Q\)-th order polynomial transform:

\[ \begin{split} ( 1, \\ x_1, x_2, \dots, x_d, \\ x_1^2, x_1x_2, \dots, x_d^2, \\ \dots, \\ x_1^Q, x_1^{Q-1}x_2, \dots, x_d^Q) \end{split} \]

  • \(\underbrace{1}_{\tilde{w_0}} + \underbrace{\tilde{d}}_{\text{others}}\) dimensions = # ways of \(\leq Q\)-combination for \(d\) kinds with repetitions
  • = \(\binom{Q+d}{Q} = \binom{Q+d}{d} = O(Q^d)\)
  • = efforts needed for computing/storing \(z = \Phi_{Q}(x)\) and \(\tilde{x}\)
  • \(Q\) large \(\implies\) difficult to compute/store

Model Complexity Price

  • \(\underbrace{1}_{\tilde{w_0}} + \underbrace{\tilde{d}}_{\text{others}}\) dimensions = \(O(Q^d)\)
  • number of free parameters \(\tilde{w_i} = \tilde{d} + 1 \approx d_{VC}(\mathcal{H}_{\Phi_Q})\)
  • \(d_{VC}(\mathcal{H_{\Phi_{Q}}}) \leq \tilde{d} + 1\), why?
    • any \(\tilde{d} + 2\) inputs not shattered in \(\mathcal{Z}\)
    • \(\implies\) any \(\tilde{d} + 2\) inputs not shattered in \(\mathcal{X}\)
  • \(Q\) large \(\implies\) large \(d_{VC}\)

Generalization Issue

  • Which do you prefer?
    • \(\Phi_1\) (original \(x\)), visually preferable.
    • \(\Phi_4\), \(E_{in}(g) = 0\) but overkill.
  1. Can we make sure that \(E_{in}(g)\) is close enough to \(E_{out}(g)\)?
  2. Can we make \(E_{in}(g)\) small enough?
  • Higher \(\tilde{d}(Q)\), good for 2, bad for 1
  • Lower \(\tilde{d}(Q)\), good for 1, bad for 2
  • how to pick \(Q\). visually, maybe?

Danger of Visual Choices

  • First of all, can you really ‘visualize’ when \(\mathcal{X} = \mathbb{R}^10\)?

Visualize \(\mathcal{X} = \mathbb{R}^2\)

  • full \(\Phi_2: z = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)\), \(d_{VC} = 6\)
  • or \(z = (1, x_1^2, x_2^2)\), \(d_{VC} = 2\)?
  • of better \(z = (1, x_1^2 + x_2^2)\), \(d_{VC} = 2\)?
  • or even better \(z = \text{sign}(0.6 - x_1^2 - x_2^2)\)?

We use our own brain to make the transform. This is human learning not machine learning. 受到主观影响.

careful about your brain’s ‘model complexity’ 看起来VC Dimenson很小,是因为没有把脑袋里的VC Dimenson算进去。 The results are too optimistic.

  • For VC-safety, \(\Phi\) shall be decided without ‘peeking’ data.

Structured Hypothesis Sets

Polynomial Transform Revisited

\[ \begin{split} \Phi_0(x) = 1 \\ \Phi_1(x) = (\Phi_0(x), \quad x_1, x_2, \dots, x_d) \\ \Phi_2(x) = (\Phi_1(x), x_1^2, x_1x_2, \dots, x_d^2) \\ \dots \dots \\ \Phi_Q(x) = (\Phi_{Q-1}(x), x_1^Q, x_1^{Q-1}, \dots, x_d^Q) \end{split} \]

\[ \mathcal{H}_{\Phi_0} \subset \mathcal{H}_{\Phi_1} \subset \mathcal{H}_{\Phi_2} \subset \cdots \subset \mathcal{H}_{\Phi_Q} \]

Structured Hypothesis Sets

\[ \mathcal{H}_0 \subset \mathcal{H}_1 \subset \mathcal{H}_2 \subset \]

\[ d_{VC}(\mathcal{H}_0) \leq d_{VC}(\mathcal{H}_1) \subset d_{VC}(\mathcal{H}_2) \leq \cdots \]

\[ E_{in}(g_0) \leq E_{in}(g_1) \leq E_{in}(g_2) \leq \]

  • Use a high dimension from beginning is not a good idea.

Linear Model First

  • tempting sin: use \(\mathcal{H}_{1126}\), low \(E_{in}(g_{1126})\) to fool your boss.
    • really? :( a dangerous path of no return
  • safe route: \(\mathcal{H}_1\) first
    • if \(E_{in}(g_1)\) good enough, live happily thereafter :)
    • otherwise, move right of the curve with nothing lost except ‘wasted’ computation.
  • linear model first: simple, efficient, safe and workable.

Conclusions

  • Quadratic Hypothesis
    • linear hypothesis on quadratic-transformed data
  • Nonlinear Transform
    • happy linear modeling after \(\mathcal{Z} = \Phi(\mathcal{X})\)
  • Price of Nonlinear Transform
    • computation/storage/model complexity
  • Structured Hypothesis Sets
    • linear/simple model first

Lecture 13 Hazard of Overfitting

nonlinear (regression, classification) via nonlinear feature transform \(\Phi\) plus linear (regression, classification) with price of model complexity. However, it will causes overfitting. In this lecture:

What is overfitting?

Bad Generalization

  • regression for \(x \in \mathbb{R}\) with \(N = 5\) examples
  • target $f(x) = $ 2nd order polynomial
  • label \(y_n = f(x_n) +\) very small noise
  • linear regression in \(\mathcal{Z}\)-space + \(\Phi\) = 4th order polynomial
  • unique solution passing all examples \(\implies E_{in}(g) = 0\)
  • \(E_{out}(g)\) huge

bad generalization: low \(E_{in}\), high \(E_{out}\).

Bad Generalization and Overfitting

  • take \(d_{VC} = 1126\) for learning: bad generalization
    • \(E_{in} - E_{out}\) large
  • switch from \(d_{VC} = d_{VC}^{\star}\) to \(d_{VC} = 1126\): overfitting
    • \(E_{in}\) up, \(E_{out}\) down.
    • hard to fix
  • switch from \(d_{VC} = d_{VC}^{\star}\) to \(d_{VC} = 1\): underfitting
    • \(E_{in}\) down, \(E_{out}\) up.
    • easy to fix

bad generalization: low \(E_{in}\), high \(E_{out}\). overfitting: lower \(E_{in}\), higher \(E_{out}\)

Cause of Overfitting: A Driving Analogy

‘good fit’ \(\implies\) ‘overfit’

Learning Driving
overfit commit a car accident
use excessive \(d_{VC}\) ‘drive too fast’
noise bumpy road
limited data size \(N\) limited observations about road condition

next: how does noise & data size affect overfitting?

The role of Noise and Data size

Case Study

10-th order target function + noise 50-th order target function + noiseless

overfitting from best \(g_2 \in \mathcal{H}_2\) to best \(g_{10} \in \mathcal{H}_{10}\) ?

10-th order target function + noise

\[ \begin{array}{c|c c} & g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\ \hline E_{in} & 0.050 & 0.034 \\ E_{out} & 0.127 & 9.0 \end{array} \]

50-th order target function + noiseless

\[ \begin{array}{c|c c} & g_2 \in \mathcal{H}_2 & g_{10} \in \mathcal{H}_{10} \\ \hline E_{in} & 0.029 & 0.00001 \\ E_{out} & 0.120 & 7680 \end{array} \]

Irony of Two learners

  • Learner \(O\)verfit: pick \(g_{10} \in \mathcal{H}_{10}\)
  • Learner \(R\)estricted: pick \(g_{2} \in \mathcal{H}_{2}\)
  • When both know that target = 10th
    • R ‘give up’ ability to fit

But \(R\) wins in \(E_{out}\) a lot. Philosophy: concession for advantages? 以退为进!

Learning Curves Revisited

  • When \(\mathcal{H}_{2}\), \(E_{in}\) is a little bit larger than \(E_{out}\).
  • \(\mathcal{H}_{10}\): lower \(E_{out}\) when \(N \leftarrow \infty\), but much larger generalization error for small \(N\).
  • Grey area: \(O\) overfits! \(E_{in} \downarrow\), \(E_{out} \uparrow\). 聪明反被聪明误!
  • \(R\) always wins in \(E_{out}\) if \(N\) small! 不要钻牛角尖,简单的hypothesis反而效果比较好。

No Noise Case

  • Learner \(O\)verfit: pick \(g_{10} \in \mathcal{H}_{10}\)
  • Learner \(R\)estricted: pick \(g_{2} \in \mathcal{H}_{2}\)
  • When both know that there is no noise
    • \(R\) still wins

Is there really no noise? ‘Target complexity’ acts like noise.

Deterministic Noise

A Detailed Experiment

\[ y = f(x) + \epsilon \sim \text{Gaussian} (\sum_{q=0}^{Q_f} \alpha_{q} x^q, \sigma^2) \]

  • Gaussian iid noise \(\epsilon\) with noise \(\sigma^2\).
  • Some ‘uniform’ distribution on \(f(x)\) with complexity level \(Q_{f}\)
  • data size \(N\)

The Overfit Measure

  • \(g_{2} \in \mathcal{H}_{2}\)
  • \(g_{10} \in \mathcal{H}_{10}\)
  • \(E_{in}(g_{10}) \leq E_{in}(g_{2})\) for sure.

Overfit measure \(E_{out}(g_{10}) - E_{out}(g_{2})\)

The Results

  • impact of \(\sigma^2\) versus \(N\)
    • Less noise (small ), more data points \(N\), less overfit.
  • impact of \(Q_{f}\) versus \(N\)
    • Less complex of objective function (smaller \(Q_{f}\)), more data points \(N\), less overfit.
  • The logo of this class
    • Middle two figures: overfit
    • left: nonlinear transformation
    • right: linear learning

Impact of Noise and Data Size

  • impact of \(\sigma^2\) versus \(N\): stochastic noise
  • impact of \(Q_{f}\) versus \(N\): deterministic noise
  • four reasons of serious overfitting:
    • data size \(N \downarrow\)
    • stochastic noise \(\uparrow\)
    • deterministic noise \(\uparrow\)
    • excessive power \(\uparrow\), VC dimension is too high

overfitting ‘easily’ happens.

Deterministic Noise

  • if \(f \notin \mathcal{H}\): something of \(f\) cannot be captured by \(\mathcal{H}\)
  • deterministic noise: difference between best \(h^* \in \mathcal{H}\) and \(f\)
  • acts like stochastic noise: not new, pseudo-random generator
  • difference to stochastic noise:
    • depends on \(\mathcal{H}\)
    • fixed for a given x

Philosophy: when teaching a kid, perhaps better not to use examples from a complicated target function? :)

Dealing with Overfitting

Driving Analogy Revisited

Learning Driving
overfit commit a car accident
use excessive \(d_{VC}\) ‘drive too fast’
noise bumpy road
limited data size \(N\) limited observations about road condition
Learning Driving
start from simple model drive slowly
data cleaning/pruning use more accurate road information
data hint exploit more road information
regularization put the breaks
validation monitor the dashboard

All very practical techniques to combat overfitting.

Data Cleaning/Pruning

  • if ‘detects’ the outlier 5 at the top by
    • too close to other \(\circ\), or too far from other \(\times\)
    • wrong by current classifer
  • possible action 1: correct the label (data cleaning)
  • possible action 2: remove the example (data pruning)

Possibly helps: but effect varies

Data Hinting

  • slightly shifted/rotated digits carry the same meaning
  • possible action: add virtual examples by shifting/rotating the given digits (data hinting)

Possibly helps: but watch out - virtual example not \(\overset{iid}{\sim} P(x, y)\)!

Conclusions

  • What is overfitting?
    • lower \(E_{in}\), but higher \(E_{out}\)
  • The role of Noise and Data size
    • overfitting ‘easily’ happens
  • Deterministic Noise
    • what \(\mathcal{H}\) cannot capture acts like noise
  • Dealing with Overfitting
    • data cleaning/pruning/hinting, and more

Lecture 14 Regularization (规则化/规律化)

Overfitting happens with excessive power, stochastic/deterministic noise, and limited data.

Regularized Hypothesis set

Regularization: The Magic

overfit \(\implies\) ‘regularized fit’

  • idea: ‘step back’ from \(\mathcal{H}_{10}\) to \(\mathcal{H}_{2}\)
  • name history: function approximation for ill-posed problems

how to step back?

Stepping Back as Constraint

\(Q\)-th order polynomial transform for \(x \in \mathbb{R}\):

\[ \Phi_{Q}(x) = (1, x, x^2, \dots, x^Q) \]

+linear regression, denote \(\tilde{w}\) by w

  • hypothesis w in \(\mathcal{H}_{10}\): \(w_0 + w_1 x + w_2 x^2 + \dots + w_{10} x^{10}\)
  • hypothesis w in \(\mathcal{H}_{2}\): \(w_0 + w_1 x + w_2 x^2\)
  • that is, \(\mathcal{H}_{2} = \mathcal{H}_{10}\) AND ‘Constraint that \(w_3 = \dots = w_{10} = 0\)
  • step back = ‘constraints’

Regression with Constraint

  • \(\mathcal{H}_{10} \equiv {w \in \mathbb{R}^{10+1}}\)
    • regression with \(\mathcal{H}_{10}\): \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
  • \(\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}\)
    • regression with \(\mathcal{H}_{2}\):
    • \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
    • \(s.t. w_3 = \dots = w_{10} = 0\)

step back = constrained optimization of \(E_{in}\)

Why don’t you just use \(w in \mathbb{R}^{2+1}\)

Regression with Looser Constraint

  • \(\mathcal{H}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} w_3 = \dots = w_{10} = 0}\)
    • regression with \(\mathcal{H}_{2}\):
    • \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
    • \(s.t. w_3 = \dots = w_{10} = 0\)

Relax to:

  • \(\mathcal{H'}_{2} \equiv {w \in \mathbb{R}^{10+1} \text{while} \ge 8 \text{of } w_{q}= 0}\)
    • regression with \(\mathcal{H'}_{2}\):
    • \(\underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w)\)
    • \(\text{s.t.} \sum_{q=0}^{10} [w_q \ne 0] \le 3\)
  • more flexible than \(\mathcal{H}_2\): \(\mathcal{H}_{2} \in \mathcal{H'}_{2}\)
  • less risky than \(\mathcal{H}_10\): \(\mathcal{H'}_{2} \in \mathcal{H'}_{10}\)

Bad news for sparse hypothesis set \(\mathcal{H'}_{2}\): NP-hard to solve.

  • Boolean operation, discrete.

Regression with Softer Constraint

Relax to:

\[ \begin{split} \mathcal{H}(C) \equiv { w \in \mathbb{R}^{10+1} \text{while} |w|^2 \le C } \\ \underset{w\in\mathbb{R}^{10+1}}{\text{min}} E_{in}(w) \text{s.t. } \sum_{q=0}^{10} w_q^2 \le C \end{split} \]

  • \(\mathcal{H}(C)\): overlaps but not exactly the same as \(\mathcal{H'}_2\)
  • soft and smooth Structure over \(C \ge 0\):
    • \(\mathcal{H}(0) \in \mathcal{H}(1.126) \in \dots \in \mathcal{H}(1126) \in \dots \in \mathcal{H}(\infty) = \mathcal{H}_{10}\)

regularized hypothesis \(w_{REG}\): optimal solution from regularized hypothesis set \(\mathcal{H}(C)\)

Weighted Decay Regularization

Matrix Form of Regularized Regression Problem

\[ \begin{split} \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} (w^T z_n - y_n)^2 \\ \text{s.t. } \sum_{q=0}^{Q} w_q^2 \le C \end{split} \]

  • \(\sum_n = (Zw - y)^T(Zw - y)\)
  • \(w^Tw \le C\): feasible \(w\) within a radius-\(\sqrt(C)\) hypersphere

How to solve the constrained optimization problem?

The Lagrange Multiplier

\[ \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} E_{in}(w) = \frac{1}{N} (Zw - y)^T(Zw - y) \text{s.t. } w^Tw \le C \]

  • decreasing direction: \(- \nabla E_{in}(w)\)
  • normal vector of \(w^T w = C\): \(w\)
  • if \(- \nabla E_{in}(w)\) and \(w\) not parallel: can decrease \(E_{in}(w)\) without violating the constraint
  • at optimal solution \(w_{REG}\), \(- \nabla E_{in}(w) \propto w_{REG}\)
  • want: find Lagrange multiplier \(\lambda > 0\) and \(w_{REG}\) such that \(\nabla E_{in}(w_{REG}) - \frac{2 \lambda}{N} w_{REG} = 0\)

Augmented Error

  • if oracle tells you \(\lambda > 0\), then solving \(\nabla E_{in}(w_{REG}) + \frac{2 \lambda}{N} w_{REG} = 0\)

\[ \frac{2}{N}(Z^T Z w_{REG} - Z^T y) + \frac{2 \lambda}{N} w_{REG} = 0 \]

  • optimal solution:

\[ w_{REG} \leftarrow (Z^T Z + \lambda I)^{-1} Z^T y \]

  • \((Z^T Z + \lambda I)\) always invertible.
  • called ridge regression in Statistics.

Another view, equivalent to minimizing augmented error \(E_{aug}(w)\):

\[ E_{in}(w) + \frac{\lambda}{N} w^{T} w \]

  • regularization with augmented error instead of constrained \(E_{in}\)

\[ w_{REG} \leftarrow \underset{w}{\text{argmin}} E_{aug}(w) \text{for given} \lambda \text{ or } \lambda = 0 \]

  • minimizing unconstrained \(E_{aug}\) effectively minimizes some \(C\)-constrained \(E_{in}\)

The Results

  • \(\lambda = 0\), overfitting
  • \(\lambda = 0.0001\), OK
  • \(\lambda = 0.01\), OK
  • \(\lambda = 1\), underfitting, complexity is not enough.
  • Philosophy: a little regularization goes a long way.

call ‘\(+\frac{\lambda}{N} w^T w\)’ weight-decay regularization:

  • larger \(\iff\)
  • prefer shorter \(w \iff\)
  • effectively smaller \(C\)

Go with any transform + linear model

Some details: Legendre Polynomials

\[ \underset{w\in\mathbb{R}^{Q+1}}{\text{min}} \frac{1}{N} \sum_{n=0}^{N} (w^T \Phi(x_n) - y_n^2)^2 + \frac{1}{N} \sum_{q=0}^{Q} w_q^2 \]

naive polynomial transform:

\[ \Phi(x) = (1, x, x^2, \dots, x^Q) \]

When \(x_n \in [-1, +1]\), \(x_n^q\) really small, needing large \(w_q\).

  • We need large \(w_Q\) for big \(Q\). We penalize too much for the high dimension.

normalized polynomial transform:

\[ (1, L_1(x), L_2(x), \dots, L_Q(x)) \]

orthonormal basis functions called Legendre Polynomials

Regularization and VC Theory

Regularization and VC Theory

  • Regularization by constrained-minimizing \(E_{in}\)

\[ \underset{w}{\text{min}} E_{in}(w) \text{s.t. } w^T w \le C \]

  • Regularization by Minimizing \(E_{aug}\)

\[ \underset{w}{\text{min}} E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw \]

  • VC Guarantee of constrained-minimizing \(E_{in}\)

\[ E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C)) \]

minimizing \(E_{aug}\): indirectly getting VC guarantee without confining to \(\mathcal{H}(C)\).

Another View of Augmented Error

  • Augmented Error: \(E_{aug}(w) = E_{in}(w) + \frac{\lambda}{N} w^Tw\)
  • VC Bound: \(E_{out}(w) \le E_{in}(w) + \Omega(\mathcal{H}(C))\)

  • regularizer \(w^Tw = \Omega(w)\): complexity of a single hypothesis
  • generalization price \(\Omega(\mathcal{H})\): complexity of a hypothesis set
  • if \(\frac{\lambda}{N}\Omega(w)\) ‘represents’ \(\Omega(\mathcal{H})\) well, \(E_{aug}\) is a better proxy of \(E_{out}\) than \(E_{in}\)

  • minimizing \(E_{aug}\):
    • (heuristically) operating with the better proxy;
    • (technically) enjoying the flexibility of whole \(\mathcal{H}\)

Effective VC Dimenson

\[ \underset{w\in\mathbb{R}^{\tilde{d}+1}}{\text{min}} = E_{in}(w) + \frac{\lambda}{N}\Omega(w) \]

  • model complexity?
    • \(d_{VC}(\mathcal{H}) = \tilde{d} + 1\), because {\(w\)} ‘all considered’ during minimization
  • {\(w\)} ‘actually needed’: \(\mathcal{H}(C)\), with some \(C\) equivalent to \(\lambda\)
  • \(d_{VC}(\mathcal{H}(C))\): effective VC dimension \(d_{EFF}(\mathcal{H}, \underbrace{\mathcal{A}}_{\text{min} E_{aug}})\)

explanation of regularization: \(d_{VC}(\mathcal{H})\) large, while \(d_{EFF}(\mathcal{H}, \mathcal{A})\) small if \(\mathcal{A}\) regularized.

General Regularizers

General Regularizers \(\Omega(w)\)

want: constraint in the ‘direction’ of target function

  • target-dependent: some properties of target, if known (domain knowledge)
    • symmetry regularizer: \(\sum\) [q is odd] \(w_q^2\)
  • plausible: direction towards smoother or simple
    • regularization is to overcome noise
    • stochastic/deterministic noise both non-smooth
    • sparsity (L1) regularizer: \(\sum |w_q|\)
  • friendly: easy to opmitize
    • weight-decay (L2) regularizer: \(\sum w_q^2\)
  • bad? no worries, guard by \(\lambda\)

  • augmented error = error \(\widehat{\text{err}}\) + regularizer \(\Omega\)
  • regularizer: target-dependent, plausible or friendly. ringing a bell?
  • error measure: user-dependent, plausible or friendly.

L2 and L1 Regularizer

  • L2 Regularizer
    • \(\Omega(w) = \sum_{q=0}^{Q} w_q^2 = \begin{Vmatrix} w \end{Vmatrix}_2^2\)
    • convex, differentiable everywhere
    • easy to optimize
  • L1 Regularizer
    • \(\Omega(w) = \sum_{q=0}^{Q} |w_q| = \begin{Vmatrix} w \end{Vmatrix}_1\)
    • convex, not differentiable everywhere
    • sparsity in solution
    • optimal solution is often at the corner, meaning some \(w\) is 0.

L1 useful if needing sparse solution.

The Optimal \(\lambda\)

  • more noise \(\iff\) more regularization needed
    • more bumpy road \(\iff\) putting brakes more
  • noise unknown - important to make proper choices

How to choose? Stay tuned for the next lecture!

Conclusions

  • Regularized Hypothesis set
    • original \(\mathcal{H}\) + constraint
  • Weighted Decay Regularization
    • add \(\frac{\lambda}{N} w^Tw\) in \(E_{aug}\)
  • Regularization and VC Theory
    • regularization decreases \(d_{\text{EFF}}\)
  • General Regularizers
    • target-dependent, plausible or friendly.

Lecture 15 Validation

minimizes augmented error, where the added regularizer effectively limits model complexity.

Model Selection Problem

So Many Models Learned

Even just for Binary Classification …

  • \(\mathcal{A} \in\) {PLA, pocket, linear regression, logistic regression}
  • \(T \in\) {100, 1000, 10000}
  • \(\eta \in\) {1, 0.01, 0.0001}
  • \(\Phi \in\) {linear, quadratic, poly-10, Legendre-poly-10}
  • \(\Omega(w) \in\) {L2 regularizer, L1 regularizer, symmetry, regularizer}
  • \(\lambda \in\) {0, 0.01, 1}

in addition to your favorite combination, may need to try other combination to get a good \(g\).

Model Selection Problem

  • Which one do you prefer? Linear \(\mathcal{H}_1\) or quadratic \(\mathcal{H}_2\)?

  • given: \(M\) models \(\mathcal{H}_1, \mathcal{H}_2, \dots, \mathcal{H}_M\), each with corresponding algorithm \(\mathcal{A}_1, \mathcal{A}_2, \dots, \mathcal{A}_M\)
  • goal: select \(\mathcal{H}_{m*}\) such that \(g_{m*} = \mathcal{A}_{m*}(\\mathcal{D})\) is of low \(E_{out}(g_{m*})\)
  • unknown \(E_{out}\) due to unknown \(P(x)\) & \(P(y|x)\), as always.
  • arguably the most important practical problem of ML.

How to select? visually? No! See lecture 12.

  • Not always can be visualized.
  • The high VC dimension in the human brain.

Model Selection by Best \(E_{in}\)

  • select by best \(E_{in}\)?
    • \(m* = \underset{1 \le m \le M}{E_m = E_{in}(\mathcal{m}(\mathcal{D}))}\)
  • \(\Phi_{1126}\) always more preferred over \(\Phi_1\)
  • \(\lambda = 0\) always more preferred over \(\lambda = 0.1\) - overfitting
  • if \(\mathcal{A}_1\) minimizes over \(\mathcal{H}_1\) and \(\mathcal{A}_2\) minimizes over \(\mathcal{H}_2\),
    • \(\implies g_{m*}\) achieves minimal \(E_{in}\) over \(\mathcal{H}_{1} \cup \mathcal{H}_{2}\)
    • \(\implies\) Extra model complexity, learning pays \(d_{VC}(\mathcal{H}_{1} \cup \mathcal{H}_{2})\)
    • bad generalization?

selecting by \(E_{in}\) is dangerous.

Model Selection by Best \(E_{test}\)

  • select by best \(E_{test}\), which is evaluated on a fresh \(\mathcal{D}_{test}\)?
    • \(m* = \underset{1 \le m \le M}{E_m = E_{test}(\mathcal{m}(\mathcal{D}))}\)
  • generalization guarantee (finite-bin Hoeffding):
    • yes! strong guarantee! \[ E_{out}(g_{m*}) \le E_{test}(g_{m*}) + O(\sqrt{\frac{\text{log}M}{N_{\text{test}}}}) \]
  • But where is \(\mathcal{D}_{test}\)? Your boss’s safe, maybe?

selecting by \(E_{test}\) is infeasible and cheating!

Comparison between \(E_{in}\) and \(E_{test}\)

  • in-sample error \(E_{in}\)
    • calculate from \(\mathcal{D}\)
    • feasible on hand
    • ‘contaminated’ as \(\mathcal{D}\) also used by \(\mathcal{A}_{m}\) to select \(g_{m}\)
  • test error \(E_{test}\)
    • calculate from \(\mathcal{D}_{test}\)
    • infeasible in boss’s safe
    • ‘clean’ as \(\mathcal{D}_{test}\) never used for selection before
  • something in between: \(E_{val}\)
    • calculate from $_{val} \(\mathcal{D}\)
    • feasible on hand
    • clean if \(\mathcal{D}_{val}\) never used by \(\mathcal{A}_{m}\) before.

selecting by \(E_{val}\): legal cheating.

Validation

Validation Set \(\mathcal{D}_{val}\)

\[ \underbrace{\mathcal{D}}_{\text{size } N} \leftarrow \underbrace{\mathcal{D}_{train}}_{\text{size } N-K} \cup \underbrace{\mathcal{D}_{val}}_{\text{size } K} \]

  • \(\mathcal{D}_{val} \subset \mathcal{D}\): called validation set - ‘on hand’ simulation of test set to connect \(E_{val}\) with \(E_{out}\):
    • \(\mathcal{D}_{val} \underset{iid}{\sim} P(x, y)\): select \(K\) examples from \(\mathcal{D}\) at random
  • to make sure \(\mathcal{D}_{val}\) ‘clean’: feed only \(\mathcal{D}_{train}\) to \(\mathcal{A}_m\) for model selection.
  • \(g_{m} = \mathcal{A}_m(\mathcal{D})\)
  • \(g_{m}^- = \mathcal{A}_m(\mathcal{D}_{train})\)

\[ E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}}) \]

Model Selection by Best \(E_{val}\)

\[ m* = \underset{\text{argmin}}{1 \le m \le M} (E_{in} = E_{val}(\mathcal{A}_m(\mathcal{D}_{train}))) \]

  • generalization guarantee for all \(m\):
    • \(E_{out}(g_m^-) \le E_{val}(g_m^-) + O(\sqrt{\frac{\\text{log} M}{K}})\)
  • heuristic gain from \(N-K\) to \(N\):
    • \(E_{out}(\underbrace{g_{m*}}_{\mathcal{A}_{m*}(\mathcal{D})}) \le E_{out}(\underbrace{g_{m*}^-}_{\mathcal{A}_{m}(\mathcal{D}_{train})})\)
    • This is because learning curve.
  • The general flow:
  1. divide \(\mathcal{D}\) to \(\mathcal{D}_{val}\) and \(\mathcal{D}_{train}\).
  2. Obtain \(M\) models \(g_1^- \dots g_M^-\)
  3. pick the best one \(\{\mathcal{H}_{m*}, \mathcal{E}_{m*}\}\)
  4. use \(\mathcal{D}\) to find \(g_{m*}\) in \(\mathcal{H}_{m*}\)

\[ E_{out}(g_{m*}) \le E_{out}(g_{m*}^-) \le E_{val}(g_{m*}^-) + O(\sqrt{\frac{\\text{log} M}{K}}) \]

Validation in Practice

use validation to select between \(\mathcal{H}_{\Phi_5}\) and \(\mathcal{H}_{\Phi_{10}}\)

  • in-sample: selection with \(E_{in}\)
  • optimal: cheating-selection with \(E_{test}\)
  • sub-\(g\): selection with \(E_{val}\) and report \(g_{m*}^-\)
  • full-\(g\): selection with \(E_{val}\) and report \(g_{m*}\)
    • \(E_{out}(g_{m*}) \le E_{out}(g_{m*}^-)\) indeed.

Why is sub-\(g\) worse than in-sample some time?

  • \(\mathcal{D}_{val} \uparrow\) and \(\mathcal{D}_{train} \downarrow\).

The Dilemma about \(K\)

reasoning of validation:

\[ E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-) \underset{(\text{large } K)}{\approx} E_{val}(g^-) \]

  • large \(K\): every \(E_{val} \approx E_{out}\), but all \(g_m^-\) much worse than \(g_m\)
  • small \(K\): every \(g_m^- \approx g_m\), but \(E_{val}\) far from \(E_{out}\)

Practical rule of thumb: \(K = \frac{N}{5}\)

Fun Time

Validation might not be slow! Because we use less \(\mathcal{D}_{train}\).

Leave-One-Out Cross Validation

Extreme Case: \(K = 1\)

reasoning of validation:

\[ E_{out}(g) \underset{(\text{small } K)}{\approx} E_{out}(g^-) \underset{(\text{large } K)}{\approx} E_{val}(g^-) \]

  • take \(K = 1\)? \(\mathcal{D}_{val}^{(n)} = \{(x_n, y_n\}\) and \(E_{val}^{(n)}(g_n^-) = \text{err}(g_{n}^-(x_n), y_n) = e_n\)
  • make \(e_n\) closer to \(E_{out}(g)\)? - average over possible \(E_{val}^{(n)}\)
  • Leave-One-Out cross validation estimate:

\[ E_{loocv}(\mathcal{H}, \mathcal{A}) = \frac{1}{N} \sum_{n=1}^{N} e_{n} = \frac{1}{N} \sum_{n=1}^{N} \\text{err} (g_{n}^-(x_{n}), y_{n}) \]

  • It can tell us if our algorithm \(\mathcal{A}\) performs good in data set \(\mathcal{D}\).
  • hope: \(E_{loocv}(\mathcal{H}, \mathcal{A}) \approx E_{out}(g)\)

Illustration of Leave-One-Out

\[ E_{loocv}(\text{linear}) = \\frac{1}{3} (e_1 + e_2 + e_3) \]

\[ E_{loocv}(\text{constant}) = \\frac{1}{3} (e_1 + e_2 + e_3) \]

  • Which one would you choose?
    • If we only has three data points, the constant hypothesis will be better.
  • \(m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{loocv}(\mathcal{H}_m, \mathcal{A}_m)\)

Theoretical Guarantee of Leave-One-Out Estimate

Does \(E_{loocv}(\mathcal{H}, \mathcal{A})\) say something about \(E_{out}(g)\)?

Yes! for average \(E_{out}\) on size \(N-1\) data.

\[ \underset{\mathcal{D}}{\mathcal{E}} E_{loocv}(\mathcal{H}, \mathcal{A}) = \underset{\mathcal{D}}{\mathcal{E}} \frac{1}{N} \sum_{n=1}^{N} e_n = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}}{\mathcal{E}} e_n = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} \underset{(x_n, y_n)}{\mathcal{E}} \text{err} (g_n^-(x_n), y_n) = \frac{1}{N} \sum_{n=1}^{N} \underset{\mathcal{D}_n}{\mathcal{E}} E_{out}(g^-) = \frac{1}{N} \sum_{n=1}^{N} \bar{E_{out}(N-1)} = \bar{E_{out}(N-1)} = \bar{E_{out}(g^-)} \]

expected \(E_{loocv}(\mathcal{H}, \mathcal{A})\) says something about expected \(E_{out}(g^-)\)

  • Often called ‘almost unbiased estimate of \(E_{out}(g)\)

Leave-One-Out in Practice

  • Select by \(E_{in}\), not smooth, overfit.
  • Select by \(E_{loocv}\), smooth, avoid overfit.
  • \(E_{loocv}\) much better than \(E_{in}\)

V-Fold Cross Validation

Disadvantage of Leave-One-Out Estimate

Computation issue:

\[ E_{loocv}(\mathcal{H}, \mathcal{A}) = \frac{1}{N} \sum_{n=1}^{N} e_n = \frac{1}{N} \sum_{n=1}^{N} \text{err} (g_n^-(x_n), y_n) \]

  • \(N\) ‘additional’ training per model, not always feasible in practice.
    • E.g., we have \(\begin{vmatrix}\mathcal{D}\end{vmatrix} = 1000\), we need 1000 models, each one is trained from 999 points.
  • except ‘special case’ like analytic solution for linear regression.

Stability issue - due to variance of single-point estimates.

  • The Leave-One-Out curve is jagged shape. Not good for select best algorithm, We don’t want mistakenly select a low point which is just lucky to have low \(E_{cv}\)

V-fold Cross Validation

how to decrease computation need for cross validation?

  • essence of Leave-One-Out cross validation: partition of \(\mathcal{D}\) to \(N\) parts, taking \(N-1\) for training and 1 for validation orderly
  • \(V\)-fold cross-validation: random-partition of \(\mathcal{D}\) to \(V\) equal parts, take \(V-1\) for training and 1 for validation orderly

\[ E_{cv}(\mathcal{H}, \mathcal{A}) = \frac{1}{V} \sum_{v=1}^{V} E_{val}^{(v)}(g_v^-) \]

  • selection by \(E_{cv}: m* = \underset{\text{argmin}}{1 \le m \le M} (E_{m} = E_{cv}(\mathcal{H}_m, \mathcal{A}_m)\)

Practical rule of thumb: \(V = 10\)

Final Words on Validation

‘Selecting’ Validation Tool

  • \(V\)-fold generally preferred over single validation if computation allows
  • 5-Fold or 10-Fold generally works well: not necessary to trade \(V\)-fold with Leave-One-Out

Nature of Validation

  • all training models: select among hypotheses, like qualification match for World cup
  • all validation Schemes: select among finalists, like knock-out stage
  • all testing methods: just evaluate

Validation still more optimistic than testing

  • validation is still selection, it’s probably to contaminate the data, model complexity.

Do not fool yourself and others, report test result, not best validation result.

Conclusions

  • Model Selection Problem
    • Dangerous by \(E_{in}\) and dishonest by \(E_{test}\)
  • Validation
    • Select with \(E_{val}(\mathcal{D}_{train})\) while returning \(\mathcal{A}_{m*}(\mathcal{D})\)
  • Leave-One-Out Cross Validation
    • huge computation by for almost unbiased estimate
  • V-Fold Cross Validation
    • reasonable computation and performance

Lecture 16 Three Learning Principles (锦囊妙计)

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler - Albert Einstein

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) - Willian of Occam (1287 - 1347)

‘Occam’s razor’ for trimming down unnecessary explanation.

  • 对资料不要过分解释!

Occam’s Razor for Learning

  • The simplest model that fits the data is also the most plausible.
  • Two questions?
    • What does it mean for a model to be simple
    • How do we know that simpler is better

Simple Model

  • Simple hypothesis \(h\)
    • small \(\Omega(h)\) = looks simple
    • specifically by few parameters
  • simple mode \(\mathcal{H}\)
    • small \(\Omega(\mathcal{H})\) = not many
    • contains small number of hypothesis
  • connection
    • \(\begin{vmatrix} \mathcal{H} \end{vmatrix}\) of size \(2^{\mathcal{l}}\) \(\implies\) \(h\) specified by \(\mathcal{l}\) bits
    • small \(\Omega(\mathcal{H}) \implies \Omega(h)\)
  • simple: small hypothesis/model complexity
    • either we use simple hypothesis set
    • or we use regularization

Simple is Better

In addition to math proof that you have seen, Philosophically:

  • simple \(\mathcal{H}\)
  • \(\implies\) smaller \(m_{\mathcal{H}}(N)\)
    • 对随意一组资料,通过丢铜板的方式产生输出。(乱乱的资料)
    • 没有规律。
  • \(\implies\) less likely to fit data perfectly \(\frac{m_{\mathcal{H}}(N)}{2^N}\)

Sampling Bias

Presidential Story

  • 1948 US President election: Truman versus Dewey
  • a newspaper phone-poll of how people voted, and set the title ‘Dewey Defeats Truman’ based on polling

The Big Smile Came from …

Truman, and yes he won

suspect of the mistake: * editorial bug? - no * bad luck of polling (\(\delta\))? - no

hint: phones were expensive

Sampling Bias

If the data is sampled in a biased way, learning will produce a simlarity biased outcome.

  • technical explanation: data from \(P_1(x, y)\) but test under \(P_2 \ne P_1\): VC fails
  • Philosophical explanation: study Math hard but test English: no strong test guarantee
  • ‘minor’ VC assumption: data and testing both iid from \(P\)

Sampling Bias in Learning

A True Personal Story

  • Netflix competition for movie, recommender system: 10% improvement = 1M US dollars
  • formed \(\mathcal{D}_{val}\), in my first shot, \(E_{val}(g)\) showed 13% improvement
  • Why am I still teaching here?

  • validation: random examples within \(\mathcal{D}\);
  • test: ‘last’ user records ‘after’ \(\mathcal{D}\). 训练资料,测试资料不是随机取样, 而是在时间轴上有前后关系。

Dealing with Sampling Bias

  • Practical rule of thumb: match test scenario as much as possible.
    • 了解测试环境,让训练环境尽可能接近。
  • e.g. if test: ‘last’ user records ‘after’ \(\mathcal{D}\)
    • training: emphasize later examples (KDDCup 2011)
    • validation: use ‘late’ user records
  • last puzzle: danger when learning ‘credit card approval’ with existing bank records?
    • We don’t know if the new user will follow the behavior of the existing users.

Data Snooping

Visual Data Snooping

Visualize: \(\mathcal{X} = \mathbb{R}^2\)

  • full \(\Phi_2: z = (1, x_1, x_2, x_1^2, x_1x_2, x_2^2)\), \(d_{VC} = 6\)
  • or \(z = (1, x_1^2, x_2^2)\), \(d_{VC} = 2\)?
  • of better \(z = (1, x_1^2 + x_2^2)\), \(d_{VC} = 2\)?
  • or even better \(z = \text{sign}(0.6 - x_1^2 - x_2^2)\)?

Careful about your brain’s ‘model complexity’.

  • For VC-safety, \(\Phi\) shall be decided without ‘Snooping’ data.

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

  • 8 years of currency trading data
  • first 6 years from training, last 2 years for testing
  • x = previous days, y = 21th day
  • snooping versus no snooping: superior profit possible

  • snooping: shift-scale all values by training + testing
  • no snooping: shift-scale all values by training only

Data Snooping by Data Reusing

Research Scenario

  • benchmark data \(\mathcal{D}\)
    • paper1: propose \(\mathcal{H}_1\) that works well on \(\mathcal{D}\)
    • paper2: find room for improvement, propose \(\mathcal{H}_2\) - and publish only if better than \(\mathcal{H}_1\) on \(\mathcal{D}\)
    • paper3: find room for improvement, propose \(\mathcal{H}_3\) - and publish only if better than \(\mathcal{H}_2\) on \(\mathcal{D}\)
  • If all papers from the same author in one big paper: bad generalization due to \(d_{VC}(\cup_m \mathcal{H}_m)\)
  • step-wise: later author snooped data by reading earlier papers

If you torture the data long enough, it will confess :)

Dealing with Data Snooping

  • truth - very hard to avoid, unless being extremely honest
  • extremely honest: lock your test data in safe
  • less honest: reserve validation and use cautiously
  • be blind: avoid making modeling decision by data (作决定之前尽量不要看资料)
  • be suspicious: interpret research results (including your own) by proper feeling of contamination

One secret to wining KDDCups: careful balance between data-driven modeling (snooping) and validation (no-snooping).

Power of Three

Three Related Fields

Power of Three

  • Data Mining
    • use (huge) data to find property that is interesting
    • difficult to distinguish ML and DM in reality
  • Artificial Intelligence
    • Compute something that shows intelligent behavior
    • ML is one possible route to realize AI
  • Statistics
    • use data to make inference about an unknown process
    • statistics contains many useful tools for ML

Three Theoretical Bounds

  • Hoeffding: \(\text{P[BAD]} \le 2 \text{exp} (-2 \epsilon^2 N)\)
    • one hypothesis
    • useful for verifying/testing
  • Multi-Bin Hoeffding: \(\text{P[BAD]} \le 2 M \text{exp} (-2 \epsilon^2 N)\)
    • M hypothesis
    • useful for validation
  • VC: \(\text{P[BAD]} \le 4 m_{\mathcal{H}}(2N) \text{exp} (\dots)\)
    • all \(\mathcal{H}\)
    • useful for training

Three Linear Models

  • PLA/pocket: \(h(x) = sign(s)\)
    • plausible err = 0/1 (small flipping noise) minimize specially
  • Linear regression \(h(x) = sign(s)\)
    • friendly err = squared (easy to minimize) minimize analytically
  • logistic regression \(h(x) = \theta(s)\)
    • plausible err = CE (maxmimum likelihood) minimize iteratively

Three Key Tools

  • Feature Transformation
    • \(E_{in}(w) \rightarrow E_{in}(\tilde{w})\), \(d_{VC}(\mathcal{H}) \rightarrow d_{VC}(\mathcal{H}_{\Phi})\)
    • by using more complicated \(\Phi\)
    • lower \(E_{in}\)
    • higher \(d_{VC}\)
  • Regularization
    • \(E_{in}(w) \rightarrow E_{in}(w_{REG})\), \(d_{VC}(\mathcal{H}) \rightarrow d_{\text{EFF}}(\mathcal{H}, \mathcal{A})\)
    • by augmenting regularizer \(\Omega\)
    • lower \(d_{\text{EFF}}\)
    • higher \(E_{in}\)
  • Validation
    • \(E_{in}(h) \rightarrow E_{val}(g)\), \(\mathcal{H} \rightarrow \{g_1^-, \dots, g_M^-\}\)
    • by reserving \(K\) examples as \(\mathcal{D}_{val}\)
    • fewer choices
    • fewer examples

Three Learning Principles

  • Occam’s Razer
    • simple is good
  • Sampling Bias
    • class matches exam
  • Data Snooping
    • honestly is best policy

Three Future Directions

  • More Transform
  • More Regularization
  • Less Label

Conclusions

  • Occam’s Razor
    • simple, simple, simple!
  • Sampling Bias
    • match test scenario as much as possible
  • Data Snooping
    • any use of data is ‘contamination’
  • Power of Three
    • relatives, bounds, models, tools, principles