Create Presentation
Download Presentation

Download Presentation
## Predictive Learning from Data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Predictive Learning from Data**LECTURESET 4 Statistical Learning Theory (VC-theory) Electrical and Computer Engineering**OUTLINE**Objectives Inductive learning problem setting Keep-It-Direct Principle Statistical Learning Theory Applications Measuring the VC-dimension Summary and discussion**Motivation + Objectives**Classical statistical theoryworks for - parametric estimation (large samples) - nonparametric estimation (huge samples) But no theory for finite-sample problems Distinction btwn large- and finite samples Goal: math theory for Predictive Learning Three aspects of scientific theory: conceptual, technical, practical**Characteristics of Scientific Theory**Problem setting Solution approach Math proofs (technical analysis) Constructive methods Applications Note:Problem Setting and Solution Approach are independent (of each other)**Inductive Learning Setting**• The learningmachine observes samples (x ,y), and returns an estimated response • Two modesof inference: identification vs imitation • Risk**The Problem of Inductive Learning**• Given: finite training samples Z={(xi, yi),i=1,2,…n} choose from a given set of functions f(x, w) the one that approximates bestthe true output. (in the sense of risk minimization) Concepts andTerminology • approximating functions f(x, w) • (non-negative) loss function L(f(x, w),y) • expected risk functional R(Z,w) Goal:find the function f(x, wo) minimizingR(Z,w) when the joint distribution P(x,y) is unknown.**Empirical Risk Minimization**• ERM principle in model-based learning • Model parameterization: f(x, w) • Loss function: L(f(x, w),y) • Estimate risk from data: • Choose w*that minimizes Remp • Statistical Learning Theory developed from the theoretical analysis of ERM principle under finite sample settings**OUTLINE**Objectives Inductive learning problem setting Keep-It-Direct Principle Statistical Learning Theory Applications Measuring the VC-dimension Summary and discussion**Keep-It-Direct Principle**The goal of learning is generalization rather than estimation of true function (system identification) Keep-It-Direct Principle (Vapnik, 1995) Do not solve an estimation problem of interest by solving a more general (harder) problem as an intermediate step Good predictive model reflects certain properties of unknown distribution P(x,y) Since model estimation with finite data is ill-posed, one should never try to solve a more general problem than required by given application Importance of formalizing application requirements as a learning problem.**OUTLINE**Objectives Inductive learning problem setting Keep-It-Direct Principle Statistical Learning Theory VC-dimension and falsifiability Applications Measuring the VC-dimension Summary and discussion**Statistical Learning Theory**• History and Overview • Conditions for consistency and convergence of ERM • VC-dimension • Generalization bounds • Structural Risk Minimization (SRM) for learning with finite samples**History and Overview**• SLT aka VC-theory (Vapnik-Chervonenkis) • Theory for estimating dependencies from finite samples (predictive learning setting) • Based on the risk minimizationapproach • All main results originally developed in 1970’s for classification (pattern recognition) – why? but remained largely unknown • Renewed interest since late 90’s due to practical success of Support Vector Machines (SVM)**History and Overview(cont’d)**MAIN CONCEPTUAL CONTRIBUTIONS • Distinction between problem setting, inductive principle and learning algorithms • Direct approach to estimation with finite data (KID principle) • Math analysis of ERM (inductive setting) • Two factorsresponsible for generalization: - empirical risk (fitting error) - complexity(capacity) of approximating functions**History and Overview(cont’d)**VC-theory has 4 parts: • Conditions for consistency/ convergence of ERM • Generalization bounds • Inductive principles (for finite samples) • Constructive methods (learning algorithms) for implementing (3) NOTE: (1)(2)(3)(4)**Consistency/Convergence of ERM**• Empirical Risk known but Expected Risk unknown • Asymptotic consistency requirement: under what (general) conditions models providing min Empirical Risk will also provide min Prediction Risk, when the number of samples grows large? • Why asymptotic analysis is important? - helps to develop useful concepts - necessary and sufficient conditions ensure that VC-theory is general and cannot be improved**Consistency of ERM**• Convergence of empirical risk to expected risk does not imply consistency of ERM • Models estimated via ERM (w*) are always biased estimates of the functions minimizing expected risk:**Key Theorem of VC-theory**• For bounded loss functions, the ERM principle is consistent if and onlyif the empirical risk converges uniformly to the true risk in the following sense • consistency is determined by the worst-case approximating function, providing the largest discrepancy btwn the empirical risk and true risk Note:this condition is not useful in practice. Need conditions for consistency in terms of general properties of a set of loss functions (approximating functions)**Conditions for Consistency of ERM**• Goal: to derive conditions for consistency & fast convergence in terms of the properties of loss functions • Indicator 0/1 loss functions (binary classification) • Each indicator function partitions a given sample into two subsets (two classes). Each such partitioning is called dichotomy • The diversity of a set of functions is the number of different dichotomies that can be implemented on a random sample - the diversity is distribution-dependent**The Growth Function is the maximum number of dichotomies**that can be induced on a sample of size n (only one such sample is assumed to exist) since the max possible number of is 2^^n • The Growth Function is distribution-independent, and it depends only on the properties of a set of functions Necessary & sufficient condition for consistency of ERMlim G(n)/n = 0 (also for fast rate of convergence)**Properties of the Growth Function (GF)**• Theorem (Vapnik & Chervonenkis, 1968): The Growth Function is either linear or bounded by a logarithmic function of the number of samples n. • The point n=h where the GF starts to slow down is called the VC-dimension (of a set of indicator functions)**VC-dimension**• If the bound on the GF stays linear for any n: G(n) = n ln2 then the VC-dimension is infinite lim G(n)/n = ln2 and the ERM is inconsistent. That is, any sample of size n can be split in 2^^n possible ways; hence no valid generalization is possible • Necessary and sufficient condition for consistency of ERM: Vapnik-Chervonenkis (VC) dimension is finite (this condition isdistribution-independent)**VC-dimension measures the ability (of a set of functions) to**fit or ‘explain’ available finite data. • VC-dimension of a set of indicator functions: - Shattering: if n samples can be separated by a set of indicator functions in all 2^^n possible ways, then these samples can be shattered by this set of functions. - A set of functions has VC-dimension h if there exist h samples that can be shattered by this set of functions, but there does not exist h+1 samples that can be shattered. • Examples of analytic calculation of VC-dimension for simple sets of functions are shown next**Linear Indicator Functions (d=2):**there exist 3 points that can be shattered, but 4 cannot. VC-dim. = 3. In general, h=d+1**Spherical (local) Functions (d=2):**there exist 3 points that can be shattered, but 4 cannot. VC-dim. = 3. In general, h=d+1 Note: in these examples, VC-dim = DoF (# of parameters)**Example of infinite VC-dimension**A set of indicator functions y = I(sin(ωx)) has infinite VC dimension**Linear combination of fixed basis functions**is equivalent to linear functions in m-dimensional space • VC-dimension = m + 1 (this assumes linear independence of basis functions) • In general, analytic estimation of VC-dimension is hard • VC-dimension can be - equal to DoF - larger than DoF - smaller than DoF**Delta-margin hyperplanes:**consider linear slabs separating samples from two classes, such that the distance btwn and the closest data point is larger than some positive value • For large , VC-dimension may be smaller than d+1:**VC-dimension and Falsifiability**• A set of functions has VC-dimension h if • It can explain (shatter) a set of h samples ~ there exists h samples that cannot falsify it and (b) It can not shatterh+1 samples ~ any h+1 samples falsify this set • Finiteness of VC-dim is necessary and sufficient condition for generalization (for any learning method based on ERM)**Philosophical interpretation: VC-falsifiability**• Occam’s Razor: Select the model that explains available data andhas the smallest number of free parameters (entities) • VC theory: Select the model that explains available data andhas low VC-dimension (i.e. can be easily falsified) New principle of VC falsifiability**Generalization Bounds**• Bounds for learning machines(implementing ERM) evaluate the difference btwn (unknown) risk and known empirical risk, as a function of sample size n and the properties of the loss functions (approximating fcts). • Classification:the following bound holds with probability of for all approximating functions where is called the confidence interval • Regression:the following bound holds with probability of for all approximating functions where**Practical VC Bound for regression**• Practical regression boundcan be obtained by setting the confidence level and theoretical constants c=1, a1=a2=1 : • Compare to analytic bounds (SC, FPE) in Lecture Set 2 • Analysis (of denominator) shows that h < 0.8 n for any estimator In practice: h < 0.5 n for any estimator NOTE:generalization bounds do not depend on dimension!**Structural Risk Minimization**• Analysis of generalization bounds suggests that when n/h is large, the term is small This leads to parametric modeling approach (ERM) • When n/h is not large (say, less than 20), both terms in the right-hand side of VC- bound need to be minimized make the VC-dimension a controlling variable • SRM = formal mechanism for controlling model complexity Set of loss functions has nested structure such that structure ~ complexity ordering**Structural Risk Minimization**• An upper bound on the true risk and the empirical risk, as a function of VC-dimension h (for fixed sample size n)**Structural Risk Minimization**• Contrast SRM vs ERM approach to data modeling**SRM Approach**• Use VC-dimension as a controlling parameter for minimizing VC bound: • Two general strategies for implementing SRM: 1. Keep fixed and minimize (most statistical and neural network methods) 2. Keep fixed and minimize (Support Vector Machines)**Common SRM structures**• Dictionary structure A set of algebraic polynomials is a structure since More generally where is a set of basis functions (dictionary). The number of terms (basis functions) m specifies an element of a structure. For fixed basis fcts, VC-dim = number of parameters**Common SRM structures (cont’d)**• Feature selection (aka subset selection) Consider sparse polynomials where is a (positive) integer Each monomial is a feature the goal is to select a set of m features providing min. empirical risk (MSE) This is a structure since More generally, representation where m basis functions are selected from a large (given) set of M functions Note:nonlinear optimization, VC-dimension is unknown**Common SRM structures (cont’d)**• Penalization Consider algebraic polynomial of fixed degree where For each (positive) value c this set of functions specifies an element of a structure Minimization of empirical risk (MSE) on each element of a structure is a constrained minimization problem This optimization problem can be equivalently stated as minimization of the penalized empirical risk functional: where the choice of Note:VC-dimension is unknown**Common SRM structures (cont’d)**• Initialization structure The structure is defined for nonlinear optimization algorithm A fitting training data using a set of functions f(x,w) with initial conditions w0 Sk = [A: f(x,w), ||w0|| < Ck ] where C1 < C2 < ... • Early stopping rules The structure is defined for nonlinear optimization algorithm A fitting training data. The structure is index by the number of iterative steps of algorithm A, starting with initial (small) values of parameters w0**Common SRM structures (cont’d)**• Margin-based structure Recall large-margin separating hyperplanes for classification. Larger-margin hyperplanes form a subset of all smaller-margin hyperplanes: for VC-dimension controlled by the margin size**Example of SRM for Regression**• Polynomial regression using different structures Regression problem: x-values uniform on [0,1], Gaussian noise training set ~ 40 samples; validation ~ 40 samples • Different structures - dictionary (degrees 1 to 10) - penalization (degree 10 with ridge penalty) - sparse polynomials (of degree 1 to 6)**Dictionary**• Penalization lambda = 1.013e-005 • Feature selection target fct ~ red line; dictionary ~ black dashed; penalization ~ black dotted; feature selection ~ black solid**Aside: model interpretation**• Outside the scope of VC-theory • In the last example: different structures on the same set of functions lead to different interpretations • Often interpretation assumes function identification setting • In practice, interpretation always requires application domain knowledge**Practical Implementation of SRM**• Need to - estimate VC-dim for an element of a structure Sk - minimize empirical risk for each element Sk • Both possible for linear approximating functions • Both difficult for nonlinear parameterization many heuristic learning algorithms**SRM structures: summary**• SRM structure ~ complexity ordering on a set of admissible models (approximating functions) • Many different structureson the same set of approximating functions Which one is the ‘best’? - depends on the properties of application data - can be decided via empirical comparisons • SRM = mechanism for complexity control - selecting optimal complexity for a given data set - new measure of complexity: VC-dimension - model selection using analytic VC-bounds**OUTLINE**Objectives Inductive learning problem setting Keep-It-Direct Principle Statistical Learning Theory Applications - model selection (for regression) - signal denoising Measuring the VC-dimension Summary and discussion**Application: VC-based model**selectionseehttp://www.mitpressjournals.org/toc/neco/15/7 • Standard regression setting • Statistical criteria - Akaike Information Criterion (AIC) - Bayesian Information Criterion (BIC) where d ~ (effective) DoF, n ~ sample size • Require noise estimation**Many methods require noise estimation**• One approach is to estimate noise for each (fixed) model complexity multiplicative criteria • Another approach is to estimate noise first and then use it in the additive AIC or BIC criteria • This study uses additive AIC/BIC assuming known noise variance • VC-based model selection (aka VM)**Comparison for univariate regression**• Target functions: continuous + discontinuous • Approximating functions: algebraic polynomials Fourier basis**0.2**0.1 0 AIC SRM BIC 15 10 5 BIC AIC SRM Sine-squared target fct, polynomial regressionsample size n=30, noise Risk (MSE) DoF Comparison of AIC, BIC & SRM (or VM): - prediction risk (MSE) - selected DoF (~ h)