Machine Learning
ML Basics: ERM, Validation, and Regularization
Reduce the sklearn workflow to empirical risk minimization and out-of-sample control.
Mechanism Lab
Animation: bias-variance as complexity changes
The animation shows training error falling, validation error becoming U-shaped, and regularization shifting the optimum.
Step 1 / 5
Data
Split into training and validation.
D = D_train union D_validAnimation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
Training selects a prediction function from a function class.
Training error is not the target; out-of-sample risk is.
Validation, cross-validation, and regularization prevent models from memorizing noise.
02 / Math
Empirical risk minimization
01 / Risk
The ideal target is expected loss under the unknown data distribution.
R(f) = E[L(Y, f(X))]02 / Empirical risk
A sample average approximates that expectation.
R_hat(f) = (1/n) sum_i L(y_i, f(x_i))03 / Regularization
Complexity penalties trade a little bias for lower variance.
min_f R_hat(f) + lambda * Omega(f)03 / Code
sklearn baseline
Build an interpretable baseline before using a complex model.
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
model = make_pipeline(
StandardScaler(),
LogisticRegression(C=1.0, penalty="l2", max_iter=1000),
)
scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(scores.mean(), scores.std())04 / Case
Case: paper acceptance risk prediction
- Features include author experience, topic, abstract length, and prior citations.
- Start with logistic regression as an interpretable baseline.
- Then test whether forests or neural models really improve validation performance.
05 / Risks