plan.md (948B)
1 # Plan for the presentation 2 3 ## Introduction 4 * Setup 5 * Method 6 * main focus -> learning rate 7 * conjecture -> large step size learns sparse features & generalizes 8 better 9 * but to do this we need to dive deeper in to the dynamics of an SGD 10 iteration 11 * SGD is GD with specific label noise 12 13 ## Loss stabilization for quadratic loss 14 * Setup 15 * Iterates 16 * Proposition for Loss staiblization 17 * Proof 18 * Explanation 19 20 21 ## SGD Dynamics 22 * Stochastic differential equations 23 * what does it mean for sgd 24 * Utilization for loss stabilizaion 25 * measurement of sparse feature learning 26 * feature sparsity coefficient 27 28 29 ## Diagonal Networks 30 * Setup 31 * Measuring 32 * Results 33 34 35 # SGD and GD have different implicit biases 36 37 38 ## ReLU Networks 39 * Setup 40 * Measuring 41 * Results 42 43 ## Outlook to more comlex cases 44 * Setup 45 * Datasets 46 * Warmup step size 47 * Results 48 49 * Thats it