notes

uni notes
git clone git://popovic.xyz/notes.git
Log | Files | Refs

commit 43ceeff1b4ceac12b3aa3ab2d78b161d9964fe60
parent b226f751a4b9fadd7e6bc446ae8099f6ddaefae1
Author: miksa234 <milutin@popovic.xyz>
Date:   Mon,  1 Jan 2024 13:27:17 +0100

fo

Diffstat:
Mopt_sem/summary/main.tex | 138++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 134 insertions(+), 4 deletions(-)

diff --git a/opt_sem/summary/main.tex b/opt_sem/summary/main.tex @@ -119,6 +119,7 @@ $\{\nabla_{\theta}h_{\theta}(x_1),\ldots,\nabla_{\theta}h_{\theta}(x_n)\}$ and has constant intensity corresponding to the loss stabilization at $\delta >0$. The following SDE model is proposed \begin{align} + \label{eq: sde-sgd-dynamics} d\theta_t = -\nabla_{\theta} \mathcal{L}(\theta_t)dt + \sqrt{\eta\delta} \phi_{\theta_t}(X)^{T}dB_t, \end{align} @@ -127,7 +128,7 @@ $\phi_{\theta}(X) := [\nabla_{\theta}h_{\theta}(x_i)^{T}]_{i=1}^{n} \in \mathbb{R}^{n\times p}$ referred to as the Jacobian. This SDE can be interpreted as the effective slow dynamics that dives the iterates while they bounce rapidly in some directions at the level set $\delta$. -\subsection{Sparse Feature Learning} +\section{Sparse Feature Learning} It is begun with a simple example on diagonal linear networks to show a sparsity inducing dynamics and then further disclosed to a general message about the implicit bias prompted by the effective dynamics. @@ -140,7 +141,7 @@ linear predictor $\beta:=u\odot v \in \mathbb{R}^{d}$ is not in $(u, v)$. Hence we can see from this example a rich non-convex dynamics. Then $\nabla_u h_{u, v}(x) = v \odot x$ and the SDE model is \begin{align} - du_t = -\nabla_u \mathcal{L}(u_t, v_t) dt + \sqrt{\eta\delta} v_t \odot + du_t = -\nabla_u \mathcal{L}(u_t, v_t) dt + \sqrt{\eta\delta}\; v_t \odot [X^{T}dB_t], \end{align} where $(B_t)_{t\ge 0}$ is the standard Brownian motion in $\mathbb{R}^{n}$ @@ -162,16 +163,145 @@ coordinates outside of the support of the sparsest signal and oscillates in the parameter space at level $\mathcal{O}(\sqrt{\eta\delta})$ on its support. Thereby decreasing the step size later leads to perfect recovery of the sparsest predictor. TODO: Experiments. +\newline +The diagonal linear nets show noisy dynamics which induce a sparcity bias, +which is by HaoChen et al. (2021) due to the term $v_t \odot [X^{T}dB_t]$, +which has a shrinking effect on the coordinates (due to the element wise +multiplication). In general from Equation \ref{eq: sde-sgd-dynamics} the same +multiplicative structure happends w.r.t. the Jacobian $\phi_\theta(X)$. This +suggests that the implicit bias of the noise can lead to a shrinking age +effect applied to $\phi_\theta(X)$, which depends on the noise intensity +$\delta$ and step size of the SGD. Also note the property of the Browninan +motions: $v \in \mathbb{R}^{p}$ then $\langle v, B_t\rangle = \|v\|_2 W_t$, +where $(W_t)_{t\ge 0}$ is a one dimensional Brownian motion. Thereby the +process in Equation \ref{eq: sde-sgd-dynamics} is equivalent to the process +whose $i$-th coordinate is driven by a noise proportional to +$\|\phi_i\|dW_{t}^{i}$. This SDE structure, similar to the geometric +Brownian motion, is expected to induce the shrinking age of each +multiplicative factor $(\|\nabla_\theta h(x_i)\|)_{i=1}^{n}$, hence the +authors conjecture: \textit{The noise part of Equation +\ref{eq: sde-sgd-dynamics} seeks to minimize the $l_2$-norm of the columns +of $\phi_\theta(X)$.} +\newline +Also note that the fitting part of the dynamics prevents the Jacobian to +collapse totally to zero, but as soon as they are not needed to fit the +signal, the columns can be reduced to zero. Now the specification of the +implicit bias for different architectures is provided +\begin{itemize} + \item \textbf{Diagonal linear networks:} $h_{u, v}(x) = \langle u \odot v, + x\rangle$ and the gradient $\nabla_{u, v}h_{u,v}= [v\odot x, u \odot + x]$. For a generic data matrix $X$, minimizing the norm of each + column of $\phi_{u, v}(X)$ amounts to put the maximal number of zero + coordinates and hence to minimize $\|u \odot v\|_0$. + + \item \textbf{ReLU networks:} Let $h_{a, W}(x) = \langle a, + \sigma(Wx)\rangle$, then $\nabla_{a}h_{a, W}(x) = \sigma(Wx)$ and + $\nabla_{w_j}h_{a, W}(x) = a_j x \mathbf{1}_{\langle w_j, x\rangle > + 0}$. The implicit bias enables the learning of sparse data-active + features. Activated neurons align to fit reducing the rank of + $\phi_\theta(X)$. +\end{itemize} + +The main insight is that the Jacobian can be significantly simplified during +the loss stabilization phase. While the gradient part tries to fit the data +and align neurons the noise part intends to minimize the $l_2$-norm of the +columns of $\phi(X)$. Thus the combination suggests to count the average +number of \textbf{distinct} (counting the group of aligned neurons as one), +\textbf{non-zero} activations over the training set. This will be referred to +as the \textbf{feature sparsity coefficient}. In the next section the authors +show that the conjectured sparsity is observed empirically for a variety of +models. Where both the feature sparsity coefficient and the rank of +$\phi_\theta (X)$ can be used as a good proxy to track the hidden progress +during the loss stabilization phase. + +\section{Empirical Evidence} +In the follwing section the results for diagonal nets, deep nets, deep dense +nets on CIFAR-10, CIFAR-100 and TinyImageNet are made. The common +observations are +\begin{enumerate} + \item \textbf{Loss Stabilization:} Training loss stabilizes around a high + level set untill step size is decayed, + \item \textbf{Generalization benefit:} Longer loss stabilization leads + to better generalization, + \item \textbf{Sparese feature learning:} Longer loss stabilization + leads to sparse features. +\end{enumerate} +In some cases a large steps size that would lead to loss stabilization could +not be found, hence a \textit{warmup} step size is used which is explicitly +mentioned. The \textit{warmup} step size schedule utilizes an increasing step +size, according to some schedule, to make sure the loss stabilization occurs. +\newline +To measure the sparse feature learning feature both the rank of the Jacobian +$\phi_{\theta}(X)$ and the feature sparsity is measured. Specifically the +rank over iterations of each model is computed, except for deep networks +because it this is computationally expensive. This is done by using a +threshold on the singular values of $\phi_\theta(X)$ normalized by the +largest singular eigenvalue. The reason for this is to ensure the difference +in rank is not due to different scales of $\phi_\theta(X)$ at other +iterations. The Jacobian $\phi_\theta(X)$ is computed on the fresh samples +equal to the number of parameters $ |\theta |$, to make sure the rank +deficiency is not coming from $n \ll \theta $. \newline +As for the measurement of the feature sparsity coefficient, the average +fraction of \textbf{distinct} and non-zero \textbf{activations} at some layer +over the training set is being counted. Where a value of $100\%$ means a +completely dense feature vector and a value of $0\%$ is a feature vector with +all zeros. A pair of activations is counted as highly correlated if their +Pearson's correlation coefficient is at lease $0.95$. +\subsection{Diagonal Linear Networks} +The setup for testing diagonal linear networks is the following. The inputs +$(x_i)_{i=1}^{n}$ with $n=80$ are sampled from $\mathcal{N}(0, +\mathbf{I}_d)$, where $\mathbf{I}_d$ is the identity matrix with $d=200$. The +outputs are generated with $y_i = \langle \beta_* , x_i\rangle$ where +$\beta_* \in \mathbb{R}^{d}$ is $r=20$ sparse. Four different SGD runs are +considered one with a small step size and three with initial large step size +decayed after $10\%, 30\%$ and $50\%$ of iterations, respectively. +TODO: results and description of results +\subsection{Simple ReLU Networks} +\textbf{Two Layer ReLU networks in 1D.} +Considered is a one dimensional regression task with $12$ points. A ReLU +network with 100 neurons with SGD with a long linear warmup followed by a +step size decay at $2\%$ and $50\%$ of iterations respectively. Loss +stabilization is observed at around $10^{-5}$, the predictor becomes simpler +and is expected to generalize better and both the rank and the feature +sparsity coefficient decrease during loss stabilization. Because of the +low dimensionality of the example it can be directly observed that the final +predictor is sparse in terms of the number of distinct ReLU kinks. -\section{Sparse Feature Learning} +TODO: results and description of results + +\textbf{Three Layer ReLU networks in 1D.} +For deeper ReLU networks a teacher-student setup with a random three layer +teacher ReLU network with 2 neurons at each hidden layer is used. The student +network has 10 neurons on each layer and is trained on $50$ samples. This +kind of setup is useful because it is known that the student network can +implement the ground truth predictor but might not find it due to the small +sample size. The model is trained using SGD with a medium constant step size +and on contrast with a large step size with warmup decayed after $10\%, 30\%$ +an $50\%$ of iterations, respectively. + +TODO: results and description of results + +\subsection{Deep ReLU Networks} +In this section a state of the art example is considered to show the sparse +feature learning. Specifically considered is an image classification task, +where the DenseNet-100-12 is trained on CIFAR-10, CIFAR-100 and TinyImageNet +using SGD with batch size $256$ and different step size schedules. An +exponentially increasing step size schedule is used, with exponent $1.05$ to +establish loss stabilization. The rank of the Jacobian cannot be measured +because of too large matrix dimensions, hence only the feature sparsity +coefficient is taken at two layers: end of super block 3 (middle of network) +and super block-block 4 (before average polling at the end of the network) of +DenseNets. Two cases are tested one basic setting and a state of the art +setting with momentum and standard augmentation. + +TODO: Results and observations -\section{Empirical Results} \section{Comparison of the Results}