Week 2 — Lecture Notes

The Combinatorial Core of Learning

April 9, 2026 · Topics in Learning Theory

Week 1 gave us two very different mistake bounds in online learning. For finite hypothesis classes, Halving guarantees at most $\log_2 |\mathcal{H}|$ mistakes. For linear predictors with margin, the Perceptron guarantees at most $R^2 / \gamma^2$ mistakes. The first bound is intrinsic to the class but only works when $\mathcal{H}$ is finite. The second works for infinite classes, but it depends on geometric properties of the data, not only on the class itself.

This lecture isolates the combinatorial quantity that Halving uses on a finite pool: not the number of hypotheses, but the number of distinct labeling patterns the class can induce on finitely many points. That leads to the growth function, shattering, VC dimension, and finally the Sauer–Shelah lemma.

1. From Week 1 to Finite Pools

1.1 Cardinality and finite-pool behavior

Recall the Halving algorithm for a finite class $\mathcal{H}$. After observing $(x_t, y_t)$, we delete every hypothesis that disagrees with the revealed label. The proof of the $\log_2 |\mathcal{H}|$ mistake bound only uses one fact: on every mistake, at least half of the currently surviving candidates are eliminated.

But if two hypotheses agree on every point that ever appears, then they are operationally indistinguishable: they survive or die together, and they make exactly the same predictions on the observed data. This shows that the raw cardinality $|\mathcal{H}|$ does not by itself determine how the class behaves on a finite sample. On the observed points, the algorithm tracks how many different behaviors the class can exhibit.

Remark The question of Week 2

We want a complexity measure that is intrinsic to the class $\mathcal{H}$, works for finite and infinite classes alike, and says how rich the class can look on a finite sample. The lecture develops such a measure through restriction, growth, shattering, and VC dimension.

1.2 Online transductive learning

We now consider a restricted finite-pool version of online learning. Instead of allowing the adversary to choose arbitrary new points from $\mathcal{X}$, we fix a finite pool and only ask the learner to predict labels on points from that pool.

Definition 1 Online Transductive Learning

Fix a finite pool $C = \{x_1, \ldots, x_n\} \subseteq \mathcal{X}$ and assume realizability: there exists some unknown $h^* \in \mathcal{H}$ such that the true label of each point is $h^*(x_i)$.

The adversary reveals the points of $C$ one at a time, in an arbitrary order. On each round, the learner predicts the label of the revealed point, then the true label is revealed.

In this setting, the learner only needs to distinguish hypotheses through their labels on the pool $C$. If two hypotheses agree on every point in $C$, then no interaction with this pool can ever tell them apart.

1.3 Halving on restrictions

For this setting, the relevant object is the restriction of the class to the pool. Because we will later need both the underlying set and an ordering of its points, we write $C = (x_1, \ldots, x_n)$ when the order matters.

Definition 2 Restriction of $\mathcal{H}$ to a Pool

Let $C = (x_1, \ldots, x_n)$ be an ordered tuple of distinct points. The restriction of $\mathcal{H}$ to $C$ is

$$ \mathcal{H}|_C = \left\{ \bigl(h(x_1), \ldots, h(x_n)\bigr) : h \in \mathcal{H} \right\} \subseteq \{0,1\}^n. $$

Its elements are the distinct binary labeling patterns that $\mathcal{H}$ can realize on the pool.

Even when $\mathcal{H}$ is infinite, the restriction $\mathcal{H}|_C$ is finite, because there are only $2^n$ binary labelings of $n$ points. This means that Halving can be run directly on restrictions.

Algorithm 1 Halving on Restrictions

Maintain a set $S_t \subseteq \mathcal{H}|_C$ of surviving patterns.

Initialize $S_1 = \mathcal{H}|_C$.
If the adversary reveals $x_j$, predict the majority label among the $j$th coordinates of the patterns in $S_t$, breaking ties arbitrarily.
After the true label $y$ is revealed, update $S_{t+1} = \{b \in S_t : b_j = y\}$.

Theorem 1 Mistake Bound on a Fixed Pool

Assume realizability on the pool $C$. If Halving is run on $\mathcal{H}|_C$, then the total number of mistakes satisfies

$$ M \le \log_2 \bigl|\mathcal{H}|_C\bigr|. $$

Proof

Let $b^* = \bigl(h^*(x_1), \ldots, h^*(x_n)\bigr) \in \mathcal{H}|_C$ be the true pattern induced by the realizable target $h^*$. Because $h^*$ always predicts the correct labels, the pattern $b^*$ is never deleted.

Now consider any round on which the learner makes a mistake at point $x_j$. The learner predicted the majority label among the surviving patterns, so the true label is held by at most half of them. Therefore at least half of the current surviving set is deleted on that round.

After any round on which $m$ mistakes have been made, the current surviving set $S$ therefore satisfies $|S| \le \bigl|\mathcal{H}|_C\bigr| / 2^m$. Since $b^*$ always survives, we still have $|S| \ge 1$. Hence

$$ 1 \le \frac{\bigl|\mathcal{H}|_C\bigr|}{2^m}, \qquad\text{so}\qquad 2^m \le \bigl|\mathcal{H}|_C\bigr|. $$

Taking $\log_2$ gives $m \le \log_2 \bigl|\mathcal{H}|_C\bigr|$, as claimed.

Remark A note on threshold notation

In this lecture we write threshold functions as $h_\theta(x) = \mathbf{1}[x \le \theta]$. This is the convention used in the week 2 slides, and it makes the restriction patterns on ordered points look like a block of $1$s followed by a block of $0$s.

Example 1 Thresholds on Three Ordered Points

Consider the threshold class $\mathcal{H} = \{x \mapsto \mathbf{1}[x \le \theta] : \theta \in \mathbb{R}\}$, and let $x_1 < x_2 < x_3$. The possible patterns are:

Range of $\theta$	$h(x_1)$	$h(x_2)$	$h(x_3)$
$\theta < x_1$	0	0	0
$x_1 \le \theta < x_2$	1	0	0
$x_2 \le \theta < x_3$	1	1	0
$\theta \ge x_3$	1	1	1

So $\mathcal{H}|_C = \{000, 100, 110, 111\}$ and $\bigl|\mathcal{H}|_C\bigr| = 4$. If the true pattern is $000$ and the reveal order is $x_2, x_1, x_3$, then a tie-breaking rule that predicts $1$ on ties makes two mistakes and then identifies the true pattern. This attains the upper bound $\log_2 4 = 2$.

Theorem 1 bounds the number of mistakes on a specific pool by $\log_2 \bigl|\mathcal{H}|_C\bigr|$. The adversary, however, picks the pool, so the relevant quantity is the worst case of $\bigl|\mathcal{H}|_C\bigr|$ over all pools of size $n$. Three questions remain, each answered by one of the remaining sections.

How large can this worst-case count be as a function of $n$? Section 2 introduces the growth function $\Gamma_{\mathcal{H}}(n)$ for exactly this quantity.
For which $n$ does the worst case equal $2^n$? There the mistake bound becomes $\log_2 \Gamma_{\mathcal{H}}(n) = n$ and is vacuous: no better than assuming nothing at all. Section 3 pinpoints this threshold via shattering and VC dimension.
Past that threshold, how fast does the worst-case count drop below $2^n$? If it stays close, the learner gains essentially nothing; if it falls to polynomial in $n$, the gain is dramatic. Section 4 gives the answer with the Sauer–Shelah lemma.

2. Restriction and Growth

2.1 Restriction to a finite set

The restriction $\mathcal{H}|_C$ records the full behavior of $\mathcal{H}$ on a fixed finite sample. Its size is the number of different labelings the class can induce there.

Definition 3 Growth on a Fixed Pool

For an ordered tuple $C = (x_1, \ldots, x_n)$ of distinct points, define

$$ \Gamma_{\mathcal{H}}(C) := \bigl|\mathcal{H}|_C\bigr|. $$

This is the number of distinct labeling patterns that $\mathcal{H}$ realizes on the pool $C$. Reordering the points only permutes the coordinates of each pattern, so this count depends only on the underlying set $\{x_1, \ldots, x_n\}$; we will freely write $\Gamma_{\mathcal{H}}(C)$ when $C$ is given as a set.

2.2 The growth function

Definition 4 Growth Function

The growth function of $\mathcal{H}$ is

$$ \Gamma_{\mathcal{H}}(n) := \max_{C:\, |C| = n} \Gamma_{\mathcal{H}}(C), $$

where the maximum ranges over all $n$-point subsets of $\mathcal{X}$ (equivalently, over ordered $n$-tuples of distinct points).

The quantity $\Gamma_{\mathcal{H}}(C)$ depends on one specific pool. The quantity $\Gamma_{\mathcal{H}}(n)$ is the worst case over all pools of size $n$. This is the quantity that appears when the pool is chosen adversarially.

Basic Bounds What is always true

For every nonempty class $\mathcal{H}$ and every $n$, $1 \le \Gamma_{\mathcal{H}}(n) \le 2^n$. If $\mathcal{H}$ is finite, then also $\Gamma_{\mathcal{H}}(n) \le |\mathcal{H}|$. These bounds can be very loose: an infinite class may still have growth only polynomial in $n$.

2.3 Examples

Example 2 The Unrestricted Class

Let $\mathcal{H}$ be the class of all binary functions $\mathcal{X} \to \{0,1\}$. Then on any $n$ distinct points, every one of the $2^n$ labelings is realizable. Therefore

$$ \Gamma_{\mathcal{H}}(n) = 2^n. $$

This is the largest possible growth. The class has no structural constraints at all.

Example 3 Thresholds on the Line

For the threshold class $\mathcal{H} = \{x \mapsto \mathbf{1}[x \le \theta] : \theta \in \mathbb{R}\}$, take ordered points $x_1 < \cdots < x_n$. The only possible patterns are

$$ 00\cdots0,\; 10\cdots0,\; 110\cdots0,\; \ldots,\; 11\cdots1. $$

There are exactly $n+1$ such patterns, one for each place where the threshold may fall relative to the ordered sample. Hence

$$ \Gamma_{\mathcal{H}}(n) = n+1. $$

Example 4 Intervals on the Line

Consider the interval class $\mathcal{H} = \{x \mapsto \mathbf{1}[a \le x \le b] : a,b \in \mathbb{R}\}$. On ordered points $x_1 < \cdots < x_n$, any realized positive set must be a single contiguous block:

$$ 00\cdots 0\,11\cdots 1\,00\cdots 0. $$

A nonempty block is determined by its start position $i$ and end position $j$ with $1 \le i \le j \le n$, so there are $\binom{n+1}{2}$ nonempty blocks. Adding the all-zero labeling gives

$$ \Gamma_{\mathcal{H}}(n) = 1 + \binom{n+1}{2}. $$

Thus intervals form another infinite class whose growth is polynomial, not exponential.

These examples show that cardinality does not distinguish these classes. The unrestricted class, thresholds, and intervals can all be infinite, but their growth functions behave differently.

3. Shattering and VC Dimension

3.1 Shattering

The maximal possible growth on an $n$-point pool is $2^n$. This happens exactly when the class can realize every labeling of that pool.

Definition 5 Shattering

A finite set $C = \{x_1, \ldots, x_n\} \subseteq \mathcal{X}$ is shattered by $\mathcal{H}$ if for every labeling $(y_1, \ldots, y_n) \in \{0,1\}^n$, there exists some $h \in \mathcal{H}$ such that $h(x_i) = y_i$ for all $i$.

Proposition 1 Shattering Is Equivalent to Full Growth

Let $C$ be a set of size $n$. Then $C$ is shattered by $\mathcal{H}$ if and only if $\Gamma_{\mathcal{H}}(C) = 2^n$. Consequently, $\Gamma_{\mathcal{H}}(n) = 2^n$ if and only if $\mathcal{H}$ shatters some $n$-point set.

Proof

There are exactly $2^n$ binary labelings of an $n$-point set. By definition, $C$ is shattered precisely when every one of those labelings is realized by some hypothesis in $\mathcal{H}$. That is equivalent to saying that the restriction $\mathcal{H}|_C$ contains all $2^n$ labelings, which is exactly the statement $\Gamma_{\mathcal{H}}(C) = 2^n$.

The second statement follows immediately by maximizing over all sets $C$ of size $n$.

On a shattered $n$-point set, every labeling is compatible with some hypothesis in the class. So on that set, knowing $\mathcal{H}$ does not rule out any labeling in advance.

3.2 VC dimension

Definition 6 VC Dimension

The VC dimension of $\mathcal{H}$, written $\mathrm{VCdim}(\mathcal{H})$, is the largest integer $d$ such that some set of size $d$ is shattered by $\mathcal{H}$. If $\mathcal{H}$ shatters sets of arbitrarily large finite size, we write $\mathrm{VCdim}(\mathcal{H}) = \infty$.

VC dimension records the largest set size on which the class can still realize all labelings. Up to size $d$, the class may behave like the unrestricted class on some set. Beyond size $d$, that is no longer possible.

3.3 Examples

Example 5 Thresholds Have VC Dimension 1

A single point can be labeled either $0$ or $1$ by choosing the threshold on the appropriate side of it, so thresholds shatter every one-point set.

But no two ordered points $x_1 < x_2$ can be shattered. The possible patterns are only $00$, $10$, and $11$; the labeling $01$ is impossible. Therefore

$$ \mathrm{VCdim}(\mathcal{H}_{\mathrm{thr}}) = 1. $$

Example 6 Intervals Have VC Dimension 2

Two ordered points $x_1 < x_2$ can be shattered by intervals: $00$ is realized by an interval disjoint from both points, $10$ by an interval containing only $x_1$, $01$ by an interval containing only $x_2$, and $11$ by an interval containing both.

But no three ordered points $x_1 < x_2 < x_3$ can be shattered, because the labeling $101$ is impossible for a single interval. Hence

$$ \mathrm{VCdim}(\mathcal{H}_{\mathrm{int}}) = 2. $$

Theorem 2 Axis-Aligned Rectangles in $\mathbb{R}^2$ Have VC Dimension 4

Let $\mathcal{H}_{\square}$ be the class of indicators of axis-aligned rectangles in $\mathbb{R}^2$. Then $\mathrm{VCdim}(\mathcal{H}_{\square}) = 4$.

Proof

For the lower bound, consider the four points $p_T = (0,2)$, $p_B = (0,-2)$, $p_L = (-2,0)$, and $p_R = (2,0)$. Given any subset $S$ of these four points, define

$$ x^-(S) = \begin{cases} -3, & p_L \in S, \\ -1, & p_L \notin S, \end{cases} \qquad x^+(S) = \begin{cases} 3, & p_R \in S, \\ 1, & p_R \notin S, \end{cases} $$ $$ y^-(S) = \begin{cases} -3, & p_B \in S, \\ -1, & p_B \notin S, \end{cases} \qquad y^+(S) = \begin{cases} 3, & p_T \in S, \\ 1, & p_T \notin S. \end{cases} $$

Then the rectangle $R_S = [x^-(S), x^+(S)] \times [y^-(S), y^+(S)]$ contains exactly the points in $S$. So these four points are shattered, and $\mathrm{VCdim}(\mathcal{H}_{\square}) \ge 4$.

For the upper bound, take any five points in $\mathbb{R}^2$. Select one point with minimal $x$-coordinate, one with maximal $x$-coordinate, one with minimal $y$-coordinate, and one with maximal $y$-coordinate. This yields a set $E$ of at most four selected points, so among the five original points there is at least one point $p$ that is not selected.

By construction, the coordinates of $p$ lie between the extreme $x$-coordinates and between the extreme $y$-coordinates of the selected set $E$. Therefore any axis-aligned rectangle containing all points of $E$ must also contain $p$. So the labeling that marks every point in $E$ by $1$ and the point $p$ by $0$ is impossible. Hence no set of five points is shattered, and $\mathrm{VCdim}(\mathcal{H}_{\square}) \le 4$.

Theorem 3 Homogeneous Halfspaces in $\mathbb{R}^d$ Have VC Dimension $d$

Let $\mathcal{H}_{\mathrm{hs}} = \{x \mapsto \mathbf{1}[\langle w, x \rangle \ge 0] : w \in \mathbb{R}^d\}$. Then $\mathrm{VCdim}(\mathcal{H}_{\mathrm{hs}}) = d$.

Proof

For the lower bound, take the standard basis vectors $e_1, \ldots, e_d$. Let $(y_1, \ldots, y_d) \in \{0,1\}^d$ be any target labeling and set

$$ w_i = \begin{cases} 1, & y_i = 1, \\ -1, & y_i = 0. \end{cases} $$

Since $\langle w, e_i \rangle = w_i$, we get $\mathbf{1}[\langle w, e_i \rangle \ge 0] = y_i$ for every $i$. Thus $e_1, \ldots, e_d$ are shattered, so $\mathrm{VCdim}(\mathcal{H}_{\mathrm{hs}}) \ge d$.

For the upper bound, take any $d+1$ points $x_1, \ldots, x_{d+1} \in \mathbb{R}^d$. They are linearly dependent, so there exist coefficients $\alpha_1, \ldots, \alpha_{d+1}$, not all zero, such that

$$ \alpha_1 x_1 + \cdots + \alpha_{d+1} x_{d+1} = 0. $$

We show that this dependence forces at least one impossible labeling.

First suppose that some $\alpha_i > 0$ and some $\alpha_j < 0$. Label the points with positive coefficient by $1$ and all other points by $0$. If some $w$ realized this labeling, then for every $i$ with $\alpha_i > 0$ we would have $\alpha_i \langle w, x_i \rangle \ge 0$, while for every $i$ with $\alpha_i < 0$ we would have $\langle w, x_i \rangle < 0$, hence $\alpha_i \langle w, x_i \rangle > 0$. Summing gives

$$ 0 = \left\langle w,\, \sum_{i=1}^{d+1} \alpha_i x_i \right\rangle = \sum_{i=1}^{d+1} \alpha_i \langle w, x_i \rangle > 0, $$

a contradiction.

The remaining case is that all nonzero coefficients have the same sign. Multiplying the relation by $-1$ if necessary, we may assume $\alpha_i \ge 0$ for every $i$, with at least one strict inequality. Then the all-zero labeling is impossible. Indeed, if $h_w(x_i) = 0$ for every $i$, then $\langle w, x_i \rangle < 0$ for every $i$, so

$$ \sum_{i=1}^{d+1} \alpha_i \langle w, x_i \rangle < 0, $$

again contradicting $\sum_{i=1}^{d+1} \alpha_i x_i = 0$.

In every case there is a labeling of the $d+1$ points that cannot be realized, so no set of size $d+1$ is shattered. Therefore $\mathrm{VCdim}(\mathcal{H}_{\mathrm{hs}}) \le d$, and the proof is complete.

Every VC-dimension computation has the same shape: a lower bound comes from exhibiting a shattered set, and an upper bound comes from finding one forbidden labeling on every larger set.

4. Sauer–Shelah Lemma

4.1 Statement

VC dimension tells us where full growth stops: if $\mathrm{VCdim}(\mathcal{H}) = d$, then $\Gamma_{\mathcal{H}}(n) = 2^n$ is possible up to $n = d$, but not beyond. The next question is quantitative: once full shattering stops, how large can the growth function still be?

The infinite classes computed earlier hint at the answer. For thresholds ($\mathrm{VCdim} = 1$), the growth function is $\Gamma_{\mathcal{H}}(n) = n + 1$, linear in $n$ with exponent $1$. For intervals ($\mathrm{VCdim} = 2$), it is $\Gamma_{\mathcal{H}}(n) = 1 + \binom{n+1}{2} = \Theta(n^2)$, quadratic in $n$ with exponent $2$. In both cases, once $n$ exceeds the VC dimension, the growth function is polynomial with exponent matching the VC dimension. The Sauer–Shelah lemma says this is no coincidence.

Theorem 4 Sauer–Shelah Lemma

If $\mathrm{VCdim}(\mathcal{H}) = d < \infty$, then for every $n$,

$$ \Gamma_{\mathcal{H}}(n) \le \sum_{k=0}^{d} \binom{n}{k}. $$

In particular, for $n \ge d \ge 1$,

$$ \Gamma_{\mathcal{H}}(n) \le \left(\frac{en}{d}\right)^d. $$

So once $n$ exceeds the VC dimension, the growth function can no longer remain exponential in $n$; it is forced down to polynomial growth of degree at most $d$.

Phase Transition What the theorem says conceptually

Below the VC dimension, the class may still realize all $2^n$ labelings on some $n$-point set; above it, Sauer–Shelah pins $\Gamma_{\mathcal{H}}(n) \le \sum_{k=0}^{d} \binom{n}{k}$, which is polynomial in $n$ of degree at most $d$. Plotting $\log_2 \Gamma_{\mathcal{H}}(n)$ against $n$ makes the transition visible:

4.2 Proof by prefix splitting

The standard proof proceeds by induction on $n$. To isolate the combinatorial counting step, define

$$ \Gamma_d(n) := \max \bigl\{ \Gamma_{\mathcal{G}}(n) : \mathrm{VCdim}(\mathcal{G}) \le d \bigr\}. $$

Proving Sauer–Shelah is equivalent to proving $\Gamma_d(n) \le \sum_{k=0}^{d} \binom{n}{k}$ for all $d$ and $n$.

Example 7 The Prefix Split for Thresholds on $\{1,2,3\}$

For thresholds on $C = (1,2,3)$, the realized labelings are $000$, $100$, $110$, and $111$. Looking only at the prefix on $(1,2)$ gives:

Labeling on $(1,2,3)$	Prefix on $(1,2)$	Type
$000$	$(0,0)$	Type A: only extends with $0$
$100$	$(1,0)$	Type A: only extends with $0$
$110$ and $111$	$(1,1)$	Type B: extends with both $0$ and $1$

There are three distinct prefixes on the first two coordinates, but only one of them creates an extra labeling when the third coordinate is added. The proof counts this phenomenon in general.

Lemma 1 Type B Prefixes Lose One Unit of VC Dimension

Fix a class $\mathcal{H}$ with $\mathrm{VCdim}(\mathcal{H}) \le d$ and an ordered pool $C = (x_1, \ldots, x_n)$. Let $C' = (x_1, \ldots, x_{n-1})$. A realized prefix on $C'$ is called Type B if it extends to both possible labels at $x_n$ inside $\mathcal{H}|_C$. Then the family of Type B prefixes has VC dimension at most $d-1$.

Proof

Suppose instead that the Type B prefixes shattered some set of $d$ points among $x_1, \ldots, x_{n-1}$. Then for every labeling of those $d$ points, there would be a Type B prefix realizing it. By definition of Type B, that prefix extends to both labels at $x_n$.

Therefore every labeling of those $d$ points together with either possible label at $x_n$ would be realized by some hypothesis in $\mathcal{H}$. That means $\mathcal{H}$ shatters a set of $d+1$ points, contradicting $\mathrm{VCdim}(\mathcal{H}) \le d$.

Proof Idea The Pascal Recurrence

The key recurrence is

$$ \Gamma_d(n) \le \Gamma_d(n-1) + \Gamma_{d-1}(n-1). $$

Proof

Fix a class $\mathcal{H}$ with $\mathrm{VCdim}(\mathcal{H}) \le d$ and a pool $C = (x_1, \ldots, x_n)$. Let $C' = (x_1, \ldots, x_{n-1})$. Partition the realized prefixes on $C'$ into two types:

Type A prefixes extend to exactly one labeling on $C$.
Type B prefixes extend to two labelings on $C$.

Let $a$ and $b$ denote the numbers of Type A and Type B prefixes, respectively. Then the number of labelings on $C$ is

$$ \Gamma_{\mathcal{H}}(C) = a + 2b = (a+b) + b. $$

The first term $a+b$ is just the total number of distinct prefixes on $C'$, so $a+b \le \Gamma_d(n-1)$.

By Lemma 1, the Type B prefixes themselves form a class on $C'$ with VC dimension at most $d-1$. Hence $b \le \Gamma_{d-1}(n-1)$.

Therefore

$$ \Gamma_{\mathcal{H}}(C) \le \Gamma_d(n-1) + \Gamma_{d-1}(n-1). $$

Taking the maximum over all choices of $\mathcal{H}$ and $C$ yields the stated recurrence.

The base cases are as follows. If $d = 0$, then no one-point set is shattered, so every hypothesis in the class must agree on every point, and $\Gamma_0(n) = 1$. If $n \le d$, then $\Gamma_d(n) \le 2^n = \sum_{k=0}^{n} \binom{n}{k} = \sum_{k=0}^{d} \binom{n}{k}$ (the extra terms $\binom{n}{k}$ for $k > n$ vanish).

Induction Completing the Proof of Sauer–Shelah

Using the recurrence and the induction hypothesis,

$$ \Gamma_d(n) \le \Gamma_d(n-1) + \Gamma_{d-1}(n-1) \le \sum_{k=0}^{d} \binom{n-1}{k} + \sum_{k=0}^{d-1} \binom{n-1}{k}. $$

Now apply Pascal's identity $\binom{n}{k} = \binom{n-1}{k} + \binom{n-1}{k-1}$:

$$ \sum_{k=0}^{d} \binom{n-1}{k} + \sum_{k=0}^{d-1} \binom{n-1}{k} = \sum_{k=0}^{d} \binom{n}{k}. $$

This proves $\Gamma_d(n) \le \sum_{k=0}^{d} \binom{n}{k}$ for all $d$ and $n$. The simpler bound $\sum_{k=0}^{d} \binom{n}{k} \le (en/d)^d$ for $n \ge d \ge 1$ is a standard estimate on binomial sums.

5. What This Means for Learning

We can now combine the fixed-pool Halving analysis with the Sauer–Shelah lemma. This yields a mistake bound expressed in terms of the hypothesis class through a single combinatorial parameter.

Corollary 1 Halving Bound in Terms of VC Dimension

Let $\mathcal{H}$ have finite VC dimension $d$, and let $C \subseteq \mathcal{X}$ be a pool of size $n$. In the realizable online transductive setting, Halving on $\mathcal{H}|_C$ makes at most

$$ \log_2 \Gamma_{\mathcal{H}}(n) \le \log_2 \sum_{k=0}^{d} \binom{n}{k} $$

mistakes. In particular, when $n \ge d \ge 1$,

$$ M \le d \log_2 \frac{en}{d}. $$

Proof

Theorem 1 gives $M \le \log_2 \bigl|\mathcal{H}|_C\bigr| = \log_2 \Gamma_{\mathcal{H}}(C)$. Since $\Gamma_{\mathcal{H}}(C) \le \Gamma_{\mathcal{H}}(n)$, we have

$$ M \le \log_2 \Gamma_{\mathcal{H}}(n). $$

Applying Sauer–Shelah yields $\Gamma_{\mathcal{H}}(n) \le \sum_{k=0}^{d} \binom{n}{k}$, and the final inequality uses the standard bound $\sum_{k=0}^{d} \binom{n}{k} \le (en/d)^d$.

Takeaway One number controls finite-sample richness

For the unrestricted class, the learner faces $2^n$ possible labelings on an $n$-point pool, so the transductive mistake bound is linear in $n$. For a class of VC dimension $d$, the same quantity is at most polynomial in $n$ of degree $d$, so the mistake bound becomes $O(d \log n)$. This is the combinatorial core of learning theory.

6. Looking Ahead

Today we stayed in the online transductive setting: the pool is fixed in advance, the reveal order is adversarial, and realizability ensures that one hypothesis in $\mathcal{H}$ labels the whole pool correctly. Within that setting, the lecture introduced the combinatorial quantities used in the finite-pool analysis:

$$ \text{restriction} \to \text{growth function} \to \text{shattering} \to \text{VC dimension} \to \text{Sauer–Shelah.} $$

This chain will reappear in the statistical learning model as well. Later in the course we will move from adversarial reveal order to i.i.d. samples from an unknown distribution and ask whether the same combinatorial quantities still control learning in that setting.

Formalization View Lean Formalization

Week 2 packages a chain of definitions and theorems that can be formalized in Lean. Once restriction, growth, shattering, and VC dimension are defined carefully, the Halving analysis and Sauer–Shelah can be stated in the same development.