MLP Topic

UAT: Why MLPs Can Approximate Continuous Functions

The phrase “with enough expansion terms, we can fit almost anything” points to a shared structure: approximate a complex function by adding many simple functions. Fourier series use fixed sine and cosine bases; MLPs use learnable nonlinear basis functions. UAT is the expressiveness theorem for the latter.

One approximation template

f(x) ~= sum_{j=1}^N alpha_j phi_j(x)

Fourier: phi_j in {sin(kx), cos(kx)}
MLP:     phi_j(x) = sigma(w_j^T x + b_j)

01 / Intuition

A more precise version of the intuition

“Fit anything” is not an unconditional statement. A precise version names a function space, an error norm, a domain, and regularity conditions. A rich enough family is dense in that space, meaning the distance between a target f and a model g can be made smaller than any epsilon.

For every f in a function space F and every epsilon > 0,
there exists g in model class G such that

    distance(f, g) < epsilon.

Fourier series and UAT are both instances of this template. The difference is that Fourier bases are usually fixed, while MLP hidden units learn their location, direction, scale, and output weight.

02 / Theorem

A standard statement of UAT

Let K be a compact subset of R^d, and let C(K) be the space of continuous functions on K with the sup norm. For suitable activations, such as continuous non-polynomial activations, or bounded continuous nonconstant sigmoidal activations in Cybenko’s original theorem, single-hidden-layer networks are dense in C(K).

Given f in C(K) and epsilon > 0,
there exist N, alpha_j, w_j, b_j such that

sup_{x in K} | f(x) - sum_{j=1}^N alpha_j sigma(w_j^T x + b_j) | < epsilon.

The key word is existence. UAT says a wide enough MLP is expressive enough. It does not say training will find the parameters or that finite-sample generalization will be good.

03 / Proof

Cybenko-style proof: density by measure separation

The proof outline is functional-analytic. If MLPs were not dense, a nonzero measure would annihilate every hidden unit. But sigmoidal units approximate half-space indicators, forcing the measure of every half-space to be zero. That implies the measure itself is zero, a contradiction.

1

Assume the network span is not dense

Let H be the set of all finite sums sum alpha_j sigma(w_j^T x + b_j). If H is not dense in C(K), some continuous function cannot be approximated arbitrarily well by H.

2

Hahn-Banach plus Riesz representation

By Hahn-Banach separation, there is a nonzero continuous linear functional L with L(h)=0 for every h in H. The Riesz representation theorem writes L as integration against a finite signed Borel measure mu.

L(h) = int_K h(x) d mu(x)

Therefore for every w and b:
int_K sigma(w^T x + b) d mu(x) = 0.
3

Scale the sigmoid to get half-spaces

A sigmoidal activation approaches 1 as t goes to +infinity and 0 as t goes to -infinity. Replace sigma(w^T x+b) by sigma(lambda(w^T x+b)) and let lambda go to infinity. Away from the boundary hyperplane, it converges pointwise to a half-space indicator.

sigma(lambda(w^T x + b)) -> 1{w^T x + b > 0}

So, for almost every boundary shift b:
mu({x in K: w^T x + b > 0}) = 0.
4

Zero on all half-spaces implies zero measure

For any direction w, push mu forward to the line: nu_w(B)=mu({x: w^T x in B}). Since all half-lines have zero mass, nu_w=0. Therefore the Fourier transform of mu vanishes at every frequency. By uniqueness of finite measures, mu=0.

hat_mu(xi) = int_K exp(i xi^T x) d mu(x)
          = int_R exp(i t) d nu_xi(t)
          = 0 for every xi.

Thus mu = 0, contradicting that L was nonzero.

04 / Constructive Intuition

One-dimensional ReLU: approximate by piecewise-linear splines

For ReLU, the most concrete derivation starts in one dimension. The unit (x-t)_+ is a hinge that changes slope at t. Adding many hinges places many breakpoints, representing any continuous piecewise-linear function.

p(x) = beta_0 + beta_1 x + sum_{i=1}^{m-1} c_i (x - t_i)_+

If slopes on intervals are s_1, s_2, ..., s_m,
then c_i = s_{i+1} - s_i.

A continuous function on a closed interval is uniformly continuous. With a fine enough grid, the polygonal interpolation p satisfies sup_x |f(x)-p(x)| < epsilon. Since p is a finite linear combination of ReLU hinges, the constructive intuition is direct: more hidden units mean more breakpoints.

05 / Fourier Comparison

Fourier series: projection onto a fixed orthogonal basis

Fourier series use a fixed orthogonal basis, sin(kx) and cos(kx), on a periodic interval. For square-integrable periodic functions, the partial sum is a projection onto a finite-frequency subspace.

f(x) ~ a_0/2 + sum_{k=1}^N [a_k cos(kx) + b_k sin(kx)]

a_k = (1/pi) int_{-pi}^{pi} f(x) cos(kx) dx
b_k = (1/pi) int_{-pi}^{pi} f(x) sin(kx) dx

Orthogonality gives a clean error decomposition. With complex coefficients c_k=(1/2pi) int f(x)e^{-ikx}dx, Parseval decomposes energy by frequency; truncation error is the energy left in the discarded high frequencies.

||f - S_N f||_2^2 = 2pi * sum_{|k| > N} |c_k|^2
Dimension
Fourier series
MLP / UAT
Basis functions
Fixed sin(kx), cos(kx), or exp(ikx)
Learnable sigma(w_j^T x + b_j) units
Parameters
Usually only coefficients a_k and b_k
w_j, b_j, and alpha_j are all learned
Geometry
A sum of global frequencies
Movable, rotatable, scalable nonlinear ridge functions
Error norm
Often L2 convergence, with uniform convergence under stronger smoothness
Uniform sup-norm approximation on compact domains
Computation
Orthogonal projection with coefficients from integrals
Nonconvex optimization; existence does not mean SGD finds it
Risks
Periodic extension, low smoothness, and Gibbs effects
Overfitting, data limits, opacity, and optimization difficulty

06 / Key Limits

Neither UAT nor Fourier means “free fitting of everything”

Existence

UAT guarantees that parameters exist, not that training will find them.

Function space

Fourier results depend on periodicity, L2 assumptions, continuity, or smoothness.

Generalization

Small training error does not imply good out-of-sample behavior; validation and regularization still matter.

The precise one-sentence summary is: Fourier expansions and UAT both show that complex functions can be approximated by sums of simple functions; Fourier uses fixed orthogonal frequencies, while MLPs use learnable nonlinear units. The approximation idea is shared, but the assumptions, norms, computation, and failure modes differ.

References