pres_intro.md (2567B)
1 # SGD with Large Step Sizes Learns Sparse Features 2 3 ## WHY? 4 5 Setup: 6 7 Training data $$ {(x_1, y_1), ..., (x_n, y_n)} \in \R^{d} x Y$$ 8 large-scale ML: large dimension $$d$$, large number of training data $$n$$ 9 10 Classic examples for fitting the data: 11 12 Least squares: $$1/n || Ax - b ||_2^2 =1/n \sum_i^n(a_i^T x - b_i)^2 = 1/n\sum_i^n f_i(x)$$ 13 14 Support Vector Machine (SVM): $$1/2||x||_2^2 + C/n \sum_i^n max(0, 1 - y_i(x^T a_i + b))$$ 15 16 Deep Neural Nets $$1/n \sum_i^n loss(y_i, DNN(x; a_i)) = 1/n \sum_i^n f_(x)$$ 17 18 19 All of these optimization problems have the common feature $$1/n \sum_i^n f_i(x)$$ 20 21 So instead of computing at each step the gradient of all the f_i functions we 22 pick only one (unifrom distribution). 23 24 ## SGD vs GD 25 26 We want to minimize functions of the form $$n \sum_{i}^{n} f_i(x)$$ 27 28 GD would compute the gradient of every term, updates the next iterate 29 30 SGD pics a uniformly random $$ i(r) \in \{1, 2, ..., n\} $$ -> 31 f_{i(k)} and computes its gradient then updates the next iterate 32 33 x^{k+1} = x^{k} - t_k * \nabla f_{i(r)}(x^k) 34 35 Key property : Expectation $$ E[\nabla f_{i(r)}(x)] = \nabla f(x) $$ 36 is an unbiased estimator !!! 37 38 There are two options in choosing: 39 Randomly pick an index i 40 1. with replacement until the epoch is filled each time 41 2. without replacement 42 43 All toolkits use Option 2.! 44 All papers use Option 1., because its better for analyzing! 45 46 ## Large Stepsizes induce Sparse Features? 47 48 "sparse" in NN-terminolog meaining that for a feature vector \psi(x) 49 only a few features are active and others are 0. 50 51 Using large step sizes often leads iterates to jump = `loss stabilization` 52 this is the phase of large step sizes where the loss seams to be on average 53 constant this induces a hidden dynamics to ward simple predictors 54 55 The longer a larger steps size is used the better the implicit regularization 56 can operate and find sparse representations. 57 58 Justification: Theoretically on simple NN models (Diagonal linear networks, 59 ReLU networks) and qualitatively with 60 stochastic processes 61 62 Picture! 63 64 Residual Network 18 Layers trained on CIFAR-10 (60k 32x32 images) for 100 65 epochs left training loss, right test error 66 67 Two key phases regarding the large step size: 68 1. after the start of training loss remains constant (loss stabilization) 69 2. despite no progress, running this phase for longer leads to better 70 generalization (hidden dynamics) 71 72 73 ## To be continued... 74 75 76 77 78 79 80 81 82 83 84 #REFS 85 https://fa.bianp.net/teaching/2018/eecs227at/stochastic_gradient.html 86 https://www.youtube.com/watch?v=k3AiUhwHQ28