notes

uni notes
git clone git://popovic.xyz/notes.git
Log | Files | Refs

commit 1daf477014ef3936383824f54e216ca907bf24bd
parent c4535b7714216664d1adb6b0f4ed3b8bbabeb70d
Author: miksa234 <milutin@popovic.xyz>
Date:   Mon, 22 Jan 2024 16:41:56 +0100

fixing up the summary

Diffstat:
Mopt_sem/summary/main.tex | 21+++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/opt_sem/summary/main.tex b/opt_sem/summary/main.tex @@ -8,10 +8,12 @@ \tableofcontents \section{Introduction} -Large step sizes may lead the loss to stabilize by making SGD bounce above a -valley. Showcase is done with mean square error. Consider a family of +Large step sizes may lead the loss to stabilize by making Stochastic Gradient +Descent (SGD) bounce above a valley. + +The showcase is done with mean square error. Considering a family of prediction functions $\mathcal{H} := \{x \to h_\theta(x), \theta \in -\mathbb{R}^{p}\}$. The training loss wrt. input/output samples $(x_i, +\mathbb{R}^{p}\}$, The training loss wrt. input/output samples $(x_i, y_i)_{i=1}^{n} \in \mathbb{R}^{d}\times\mathbb{R}$ is \begin{align} \mathcal{L}(\theta) := \frac{1}{2n} \sum_{i=1}^{n} \left( h_\theta(x_i) - @@ -28,7 +30,18 @@ consider a SGD recursion with step size $\eta > 0$, with initial $\theta_0 \nabla_{\theta} h_{\theta_t}(x_{i_t}), \label{eq: sgd_it} \end{align} where $i_t \sim U(\{1,\ldots,n\})$, is a random variable following the -discrete uniform distribution over a sample of indices. +discrete uniform distribution over a sample of indices. The main focus of +this summary is about the parameter $\eta$ in literature called \textit{step +size} or \textit{learning rate}. Authors of the paper +\cite{andriushchenko2023sgd} conjecture that larger step sizes lead to a so +called \textit{loss stabilization}, where the loss will almost surely stay +around a level set. They argue that this so called loss stabilization also +causes sparse feature learning, i.e. that the learned prediction function has +a better \textit{prediction performance} considering sparse (mostly zero) +entry input data. First they prove their conjecture on simplified model +rigorously and then they give empirical evidence for more complex cases. +This summary tries to give an overview and background required to understand +what the paper is trying to achieve. \subsection{GD and SGD Relation} The authors highlight the importance of gradient and noise, by explaining the connection between the SGD dynamics and full batch GD plus a specific label