notes

uni notes
git clone git://popovic.xyz/notes.git
Log | Files | Refs

pres_intro.md (2567B)


      1 # SGD with Large Step Sizes Learns Sparse Features
      2 
      3 ## WHY?
      4 
      5 Setup:
      6 
      7 Training data $$ {(x_1, y_1), ..., (x_n, y_n)} \in \R^{d} x Y$$
      8 large-scale ML: large dimension $$d$$, large number of training data $$n$$
      9 
     10 Classic examples for fitting the data:
     11 
     12 Least squares: $$1/n || Ax - b ||_2^2 =1/n \sum_i^n(a_i^T x - b_i)^2 = 1/n\sum_i^n f_i(x)$$
     13 
     14 Support Vector Machine (SVM): $$1/2||x||_2^2 + C/n \sum_i^n max(0, 1 - y_i(x^T a_i + b))$$
     15 
     16 Deep Neural Nets $$1/n \sum_i^n loss(y_i, DNN(x; a_i)) = 1/n \sum_i^n f_(x)$$
     17 
     18 
     19 All of these optimization problems have the common feature $$1/n \sum_i^n f_i(x)$$
     20 
     21 So instead of computing at each step the gradient of all the f_i functions we
     22 pick only one (unifrom distribution).
     23 
     24 ## SGD vs GD
     25 
     26 We want to minimize functions of the form $$n \sum_{i}^{n} f_i(x)$$
     27 
     28 GD would compute the gradient of every term, updates the next iterate
     29 
     30 SGD pics a uniformly random $$ i(r) \in \{1, 2, ..., n\} $$  ->
     31 f_{i(k)} and computes its gradient then updates the next iterate
     32 
     33     x^{k+1} = x^{k} - t_k * \nabla f_{i(r)}(x^k)
     34 
     35 Key property : Expectation $$ E[\nabla f_{i(r)}(x)] = \nabla f(x) $$
     36                 is an unbiased estimator !!!
     37 
     38 There are two options in choosing:
     39 Randomly pick an index i
     40     1. with replacement until the epoch is filled each time
     41     2. without replacement
     42 
     43 All toolkits use Option 2.!
     44 All papers use Option 1., because its better for analyzing!
     45 
     46 ## Large Stepsizes induce Sparse Features?
     47 
     48 "sparse" in NN-terminolog meaining that for a feature vector \psi(x)
     49 only a few features are active and others are 0.
     50 
     51 Using large step sizes often leads iterates to jump = `loss stabilization`
     52 this is the phase of large step sizes where the loss seams to be on average
     53 constant this induces a hidden dynamics to ward simple predictors
     54 
     55 The longer a larger steps size is used the better the implicit regularization
     56 can operate and find sparse representations.
     57 
     58 Justification: Theoretically on simple NN models (Diagonal linear networks,
     59 ReLU networks) and qualitatively with
     60 stochastic processes
     61 
     62     Picture!
     63 
     64 Residual Network 18 Layers trained on CIFAR-10 (60k 32x32 images) for 100
     65 epochs left training loss, right test error
     66 
     67 Two key phases regarding the large step size:
     68     1. after the start of training loss remains constant (loss stabilization)
     69     2. despite no progress, running this phase for longer leads to better
     70        generalization (hidden dynamics)
     71 
     72 
     73 ## To be continued...
     74 
     75 
     76 
     77 
     78 
     79 
     80 
     81 
     82 
     83 
     84 #REFS
     85 https://fa.bianp.net/teaching/2018/eecs227at/stochastic_gradient.html
     86 https://www.youtube.com/watch?v=k3AiUhwHQ28