notes

uni notes
git clone git://popovic.xyz/notes.git
Log | Files | Refs

commit 81f951bc7e74bd9f6f5ca01fcc0124b1342cd5be
parent 82aca7867796956c571d522c1e6314c73e713714
Author: miksa234 <milutin@popovic.xyz>
Date:   Fri, 15 Dec 2023 10:08:13 +0000

new stuff

Diffstat:
Mricam_sem/summary_talk/main.tex | 160+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
Aricam_sem/summary_talk/todo.md | 37+++++++++++++++++++++++++++++++++++++
2 files changed, 185 insertions(+), 12 deletions(-)

diff --git a/ricam_sem/summary_talk/main.tex b/ricam_sem/summary_talk/main.tex @@ -43,9 +43,28 @@ Shallow neural network coders are of the following form \vec{p} = (\vec{\alpha}, \mathbf{w}, \vec{\theta}) &\mapsto \left(\vec{x} \to \sum_{j=1}^{N} \alpha_j\sigma\left( \vec{\mathbf{w}}_j^{T}\vec{x} + \omega_j \right) \right), -\end{align} +\end{align}q where $\sigma$ is an activation function, such as tanh or sigmoid. +\subsection{Deep neural networks} +A standard DNN with $L$ layers is a function depending on $\vec{x} \in +\mathbb{R}^{n}$ with parameters $\vec{p}:=\left( \vec{\alpha}_l, +\mathbf{w}_l, \vec{\theta}_l \right)_{l=1}^{L}$ +\begin{align} + \vec{x}\to\Psi(\vec{x}) := \sum_{j_L=1}^{N_L} \alpha_{j_L,L}\sigma_L\ + \left( p_{j_L, L} \left( \sum_{j_{L-1}=1}^{N_{L-1}}\cdots + \left( \sum_{j_1=1}^{N_1}\alpha_{j_1,1}\sigma_1\left(p_{j_1,1}(\vec{x}) + \right) \right) \right) \right), +\end{align} +where +\begin{align} + p_{j_l}(\vec{x}) = \mathbf{w}_{j, l}^{T}\vec{x} + \theta_{j,l}, +\end{align} +with $\alpha_{j,l}, \theta_{j,l} \in \mathbb{R}$ and $\vec{x}, +\mathbf{w}_{j,l} \in \mathbb{R}^{n} \;\; \forall l=1,\ldots,L$. And is +probably not a Lipschitz-continuous immersion! + + \section{Solution} The solution involves either reconstructing the function or the coefficient use Tikhonov regularization( TODO: Tikhonov regularization introduction! ) or use @@ -65,7 +84,7 @@ or alternatively state space regularization (some background on this) Alternatively use iterative methods, Newton's iteration would look like the following \begin{align} - \vec{p}^{k+1} = \vec{p} - N'\left(p^{-k}\right)^{-1}\left(N(\vec{p}) - + \vec{p}\;^{k+1} = \vec{p} - N'\left(p^{-k}\right)^{-1}\left(N(\vec{p}) - \mathbf{y} \right), \end{align} where $N'$ is the Jacobian. @@ -108,14 +127,14 @@ the coefficients. Let $F: \mathbf{X} \to \mathbf{Y}$ be linear with trivial nullspace and dense range, $\Psi:\mathcal{D} \subseteq P \to \mathbf{X}$ be Lipschitz-differentiable immersion and $N = F\circ \Psi$ and - $N(\vec{p}^{\dagger}) = \mathbf{y}$. - Also let $\vec{p}^{0} \in \mathcal{D}(\Psi)$ be sufficiently close to - $\vec{p}^{\dagger}$. Then the Gauss-Newton's iteration + $N(\vec{p}\;^{\dagger}) = \mathbf{y}$. + Also let $\vec{p}\;^{0} \in \mathcal{D}(\Psi)$ be sufficiently close to + $\vec{p}\;^{\dagger}$. Then the Gauss-Newton's iteration \begin{align} - \vec{p}^{k+1} = \vec{p}^{k} - N'(\vec{p})^{\dagger} - \left( N\left( \vec{p}^{k} \right) - \mathbf{y} \right) + \vec{p}\;^{k+1} = \vec{p}\;^{k} - N'(\vec{p})^{\dagger} + \left( N\left( \vec{p}\;^{k} \right) - \mathbf{y} \right) \end{align} - is well-defined and converges to $\vec{p}^{\dagger}$ + is well-defined and converges to $\vec{p}\;^{\dagger}$ \end{theorem} \begin{proof} Verification of a sort of Newton-Mysovskii conditions using that @@ -141,13 +160,130 @@ Convergence is based on the immersion property of the network functions \end{align} For $\alpha_i \neq 0$, this, in particular, requires that the functions \begin{align} - &p = \sum_{i=1}^{n}w_s^{i}x_i + \theta_s \\ - & \frac{\partial \Psi}{\partial \alpha_s} =\sigma(p),\quad - \frac{\partial \Psi}{\partial w_s^{t}} =\sigma'(p)x_t,\quad - \frac{\partial \Psi}{\partial \theta_s} =\sigma'(p) + &\rho = \sum_{i=1}^{n}w_s^{i}x_i + \theta_s \\ + & \frac{\partial \Psi}{\partial \alpha_s} =\sigma(\rho),\quad + \frac{\partial \Psi}{\partial w_s^{t}} =\sigma'(\rho)x_t,\quad + \frac{\partial \Psi}{\partial \theta_s} =\sigma'(\rho) \end{align} are \textbf{linearly independent} and that $\alpha_s \neq 0$ - \textbf{sparse} coefficients cannot be recovered. +\subsection{Linear independence problem} +The question is if +\begin{align} + \frac{\partial \Psi}{\partial \alpha_s} , + \frac{\partial \Psi}{\partial w_s^{\dagger}} , + \frac{\partial \Psi}{\partial \theta_s} +\end{align} +Partial answer for $\frac{\partial \Psi}{\partial \alpha_s} (\vec{x}) = +\sigma\left( \sum_{i=1}^{n} w_s^{i}x_i + \theta_s \right)$ in the +Lamperski (2022) theorem: +\begin{theorem} + For all activation functions \textit{HardShrink, HardSigmoid, HardTanh, + HardSwish, LeakyReLU, PReLU, ReLU, ReLU6, RReLU, SoftShring, Threshold, + LogSigmoid, Sigmoid, SoftPlus, Tanh and TanhShring and the PyTorch + functions CELU, ELU, SELU} the shallow neural network functions formed by + \textbf{randomly generated vectors} $(\mathbf{w}, \vec{\theta})$ are + \textbf{linearly independent}. +\end{theorem} +Proof in A. Lamperski 2022 "Neural Network Independence Properties with +Applications to Adaptive Control". But here we need more that the first +derivative of the sigmoid functions and the first moment of the first +derivative together with the above result are linearly independent. +But the answer is not satisfactory because its not known. More or less with a +probability $1$ we can prove that the functions above are linearly +independent + +\subsection{Symmetries} +For the sigmoid function we have some odious symmetries because +\begin{align} + \sigma'(\mathbf{w}^{T}_j \vec{x} + \theta_j) + = \sigma'\left(-\mathbf{w}_j^{T}\vec{x} - \theta_j \right) +\end{align} +or in another formulation +\begin{align} + \Psi'(\vec{\alpha}, \mathbf{w}, \vec{\theta}) = \Psi'(\vec{\alpha}, + -\mathbf{w}, -\vec{\theta}) +\end{align} +Conjecture: ovoious symmetries = "random set" from Lamperski 2022. The +summery of the theorem +\begin{theorem} + Assume that the activation functions are locally linearly independent. + Then the Gauss-Newton method is converging. +\end{theorem} +\section{Results} +\subsection{Numerical results(simplified)} +The simplification is +\begin{align} + &N = F \circ \Psi \\ + &\mathbf{y}^{\dagger} = F\Psi(\vec{p}\;^{\dagger}) \qquad \text{is + attainable} +\end{align} +Then the Gauss-Newton method is +\begin{align} + \vec{p}\;^{k+1} = \vec{p}\;^{k} - \Psi'\left(\vec{p})\;^{k} \right)^{\dagger} + \left( \Psi(\vec{p}\;^{k} - \Psi^{\dagger} \right) \qquad k \in + \mathbb{N}_0. +\end{align} +Do some numerical results or explain the ones in the talk. +\subsection{Landweber iteration} +Instead of the Gauss-Newton iteration we consider the Landweber iteration +\begin{align} + \vec{p}\;^{k+1} = \vec{p}\;^{k} - \lambda \Psi'\left(\vec{p}\;^{k}) \right)^{\dagger} + \left( \Psi(\vec{p}\;^{k} - \Psi^{\dagger} \right) \qquad k \in + \mathbb{N}_0. +\end{align} +Needs about 500 iterations +\subsection{The catch} +If the observed convergence rate of the Gauss-New ton change completely if the +solution is not attainable. Then the conjecture is that the non-convergence +because of multiple solutions. +Also the implementation of the simplified Gauss-Newton requires inversion of +$F$ , which is not done in practice, this is for Landweber. + +\subsection{Alternative to DNNs} +Instead of using Deep Neural Networks where we do not know the result if the +the immersion is invertible, we consider Quadratic neural network functions +defined as follows +\begin{align} + \Psi(\vec{x}) := \sum_{j=1}^{N} \alpha_j\sigma\left(\vec{x}^{T}A_j\vec{x} + + \mathbf{w}_j^{T}\vec{x} + \theta_j \right), +\end{align} +with $\alpha_j, \theta_j \in \mathbb{R}, \mathbf{w}_j \in \mathbb{R}^{n}$ +and $A_j \in \mathbb{R}^{n \times n}$. We can also constrain the class of +$A_j$ and $\mathbf{w}_j$ which leads us to circular networks, circular +affine, elliptic, parabolic... +\begin{theorem} + Quadratic neural network functions satisfy the universal approximation + property. +\end{theorem} +The immersion property of circular network functions +\begin{align} + \Psi(\vec{x}) := \sum_{j=1}^{N} \alpha_j\sigma\left(r_j\vec{x}^{T}\vec{x} + + \mathbf{w}_j^{T}\vec{x} + \theta_j \right), +\end{align} +and +\begin{align} + \text{span}\{\partial_{p_i}\Psi(\vec{p})\;:\;i=1,\ldots,n_*\}, \qquad + \text{has rank}(n_*). +\end{align} +For $\alpha_i \neq 0$, this in particular requires that the functions +\begin{align} + \frac{\partial \Psi}{\partial r_s} = \sigma\left( \rho \right) + \vec{x}^{T}\vec{x}, \quad + \frac{\partial \Psi}{\partial \alpha_s} = \sigma\left( \rho \right) + , \quad + \frac{\partial \Psi}{\partial w_s^{t}} = \sigma'\left( \rho \right) + x_t,\quad + \frac{\partial \Psi}{\partial \theta_s} = \sigma'\left( \rho \right) +\end{align} +are \textbf{linearly independent}. +\newline + +Results of these types of networks is that the Shepp-Logan can be represented +with \textbf{10 nodes} with elliptic neurons and \textbf{one layer}. Where as +for affine networks, both shallow and deep we need infinity neurons. Here +figure tensorflow approximations of a circle 15 neurons linear. + %\printbibliography diff --git a/ricam_sem/summary_talk/todo.md b/ricam_sem/summary_talk/todo.md @@ -0,0 +1,37 @@ +TODO: + +Structure: +* introduction to problem + * What are we trying to solve +* Background + * Coding + * Neural Networks + * Decomposition Cases + * Gauss Newton and Landweber + * Tikhonov Regularization + * Space Regularization + * symmetries +* Theoretical Results + * Gauss-Newton type method for problem + * Convergence of Gauss-Newton + * Newton's method with nn operator and linear independence + * Results of linear independence +* Experimental Results + * Gauss-Newton + * Landweber + * Circular NNs with result. + +We need + * more on coding + * Tikhonov regularization + * Space regularization + * more on the decomposition cases + * Proof of local convergence of GN + * Proof of linear indepencdence thm + * Proof of GN convergence with linear independence + * Reproduction of numerical results of GN + * Reproduction of numerical results of landweber + * Reproduction of numerical results of circualr networks. + * (reproduction or more in depth explanations) + +