new stuff - notes - uni notes

commit 81f951bc7e74bd9f6f5ca01fcc0124b1342cd5be
parent 82aca7867796956c571d522c1e6314c73e713714
Author: miksa234 <milutin@popovic.xyz>
Date:   Fri, 15 Dec 2023 10:08:13 +0000

new stuff

Diffstat:
M ricam_sem/summary_talk/main.tex  | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
A ricam_sem/summary_talk/todo.md  | 37 +++++++++++++++++++++++++++++++++++++

2 files changed, 185 insertions(+), 12 deletions(-)
diff --git a/ricam_sem/summary_talk/main.tex b/ricam_sem/summary_talk/main.tex
@@ -43,9 +43,28 @@ Shallow neural network coders are of the following form
     \vec{p} = (\vec{\alpha}, \mathbf{w}, \vec{\theta}) &\mapsto
     \left(\vec{x} \to \sum_{j=1}^{N} \alpha_j\sigma\left(
     \vec{\mathbf{w}}_j^{T}\vec{x} + \omega_j \right)  \right),
-\end{align}
+\end{align}q
 where $\sigma$ is an activation function, such as tanh or sigmoid.
 
+\subsection{Deep neural networks}
+A standard DNN with $L$ layers is a function depending on $\vec{x} \in
+\mathbb{R}^{n}$ with parameters $\vec{p}:=\left( \vec{\alpha}_l,
+\mathbf{w}_l, \vec{\theta}_l  \right)_{l=1}^{L}$
+\begin{align}
+    \vec{x}\to\Psi(\vec{x}) := \sum_{j_L=1}^{N_L} \alpha_{j_L,L}\sigma_L\
+    \left( p_{j_L, L} \left( \sum_{j_{L-1}=1}^{N_{L-1}}\cdots
+    \left( \sum_{j_1=1}^{N_1}\alpha_{j_1,1}\sigma_1\left(p_{j_1,1}(\vec{x})
+    \right)  \right)  \right)  \right),
+\end{align}
+where
+\begin{align}
+    p_{j_l}(\vec{x}) = \mathbf{w}_{j, l}^{T}\vec{x} + \theta_{j,l},
+\end{align}
+with $\alpha_{j,l}, \theta_{j,l} \in \mathbb{R}$ and $\vec{x},
+\mathbf{w}_{j,l} \in \mathbb{R}^{n} \;\; \forall l=1,\ldots,L$. And is
+probably not a Lipschitz-continuous immersion!
+
+
 \section{Solution}
 The solution involves either reconstructing the function or the coefficient use
 Tikhonov regularization( TODO: Tikhonov regularization introduction! ) or use
@@ -65,7 +84,7 @@ or alternatively state space regularization (some background on this)
 Alternatively use iterative methods, Newton's iteration would look like the
 following
 \begin{align}
-    \vec{p}^{k+1} = \vec{p} - N'\left(p^{-k}\right)^{-1}\left(N(\vec{p}) -
+    \vec{p}\;^{k+1} = \vec{p} - N'\left(p^{-k}\right)^{-1}\left(N(\vec{p}) -
     \mathbf{y}  \right),
 \end{align}
 where $N'$ is the Jacobian.
@@ -108,14 +127,14 @@ the coefficients.
     Let $F: \mathbf{X} \to \mathbf{Y}$ be linear with trivial nullspace and
     dense range, $\Psi:\mathcal{D} \subseteq P \to \mathbf{X}$ be
     Lipschitz-differentiable immersion and $N = F\circ \Psi$ and
-    $N(\vec{p}^{\dagger}) = \mathbf{y}$.
-    Also let $\vec{p}^{0} \in \mathcal{D}(\Psi)$ be sufficiently close to
-    $\vec{p}^{\dagger}$. Then the Gauss-Newton's iteration
+    $N(\vec{p}\;^{\dagger}) = \mathbf{y}$.
+    Also let $\vec{p}\;^{0} \in \mathcal{D}(\Psi)$ be sufficiently close to
+    $\vec{p}\;^{\dagger}$. Then the Gauss-Newton's iteration
     \begin{align}
-        \vec{p}^{k+1} = \vec{p}^{k} - N'(\vec{p})^{\dagger}
-        \left( N\left( \vec{p}^{k} \right) - \mathbf{y} \right)
+        \vec{p}\;^{k+1} = \vec{p}\;^{k} - N'(\vec{p})^{\dagger}
+        \left( N\left( \vec{p}\;^{k} \right) - \mathbf{y} \right)
     \end{align}
-    is well-defined and converges to $\vec{p}^{\dagger}$
+    is well-defined and converges to $\vec{p}\;^{\dagger}$
 \end{theorem}
 \begin{proof}
     Verification of a sort of Newton-Mysovskii conditions using that
@@ -141,13 +160,130 @@ Convergence is based on the immersion property of the network functions
 \end{align}
 For $\alpha_i \neq 0$, this, in particular, requires that the functions
 \begin{align}
-    &p = \sum_{i=1}^{n}w_s^{i}x_i + \theta_s \\
-    & \frac{\partial \Psi}{\partial \alpha_s} =\sigma(p),\quad
-     \frac{\partial \Psi}{\partial w_s^{t}} =\sigma'(p)x_t,\quad
-     \frac{\partial \Psi}{\partial \theta_s} =\sigma'(p)
+    &\rho = \sum_{i=1}^{n}w_s^{i}x_i + \theta_s \\
+    & \frac{\partial \Psi}{\partial \alpha_s} =\sigma(\rho),\quad
+     \frac{\partial \Psi}{\partial w_s^{t}} =\sigma'(\rho)x_t,\quad
+     \frac{\partial \Psi}{\partial \theta_s} =\sigma'(\rho)
 \end{align}
 are \textbf{linearly independent} and that $\alpha_s \neq 0$ -
 \textbf{sparse} coefficients cannot be recovered.
+\subsection{Linear independence problem}
+The question is if
+\begin{align}
+    \frac{\partial \Psi}{\partial \alpha_s} ,
+    \frac{\partial \Psi}{\partial w_s^{\dagger}} ,
+    \frac{\partial \Psi}{\partial \theta_s}
+\end{align}
+Partial answer for $\frac{\partial \Psi}{\partial \alpha_s} (\vec{x}) =
+\sigma\left( \sum_{i=1}^{n} w_s^{i}x_i + \theta_s \right)$ in the
+Lamperski (2022) theorem:
+\begin{theorem}
+    For all activation functions \textit{HardShrink, HardSigmoid, HardTanh,
+    HardSwish, LeakyReLU, PReLU, ReLU, ReLU6, RReLU, SoftShring, Threshold,
+    LogSigmoid, Sigmoid, SoftPlus, Tanh and TanhShring and the PyTorch
+    functions CELU, ELU, SELU} the shallow neural network functions formed by
+    \textbf{randomly generated vectors} $(\mathbf{w}, \vec{\theta})$ are
+    \textbf{linearly independent}.
+\end{theorem}
+Proof in A. Lamperski 2022 "Neural Network Independence Properties with
+Applications to Adaptive Control". But here we need more that the first
+derivative of the sigmoid functions and the first moment of the first
+derivative together with the above result are linearly independent.
+But the answer is not satisfactory because its not known. More or less with a
+probability $1$ we can prove that the functions above are linearly
+independent
+
+\subsection{Symmetries}
+For the sigmoid function we have some odious symmetries because
+\begin{align}
+    \sigma'(\mathbf{w}^{T}_j \vec{x} + \theta_j)
+    = \sigma'\left(-\mathbf{w}_j^{T}\vec{x} - \theta_j  \right)
+\end{align}
+or in another formulation
+\begin{align}
+    \Psi'(\vec{\alpha}, \mathbf{w}, \vec{\theta}) = \Psi'(\vec{\alpha},
+        -\mathbf{w}, -\vec{\theta})
+\end{align}
+Conjecture: ovoious symmetries = "random set" from Lamperski 2022. The
+summery of the theorem
+\begin{theorem}
+    Assume that the activation functions are locally linearly independent.
+    Then the Gauss-Newton method is converging.
+\end{theorem}
+\section{Results}
+\subsection{Numerical results(simplified)}
+The simplification is
+\begin{align}
+    &N = F \circ \Psi \\
+    &\mathbf{y}^{\dagger} = F\Psi(\vec{p}\;^{\dagger}) \qquad \text{is
+    attainable}
+\end{align}
+Then the Gauss-Newton method is
+\begin{align}
+    \vec{p}\;^{k+1} = \vec{p}\;^{k} - \Psi'\left(\vec{p})\;^{k}  \right)^{\dagger}
+    \left( \Psi(\vec{p}\;^{k} - \Psi^{\dagger} \right) \qquad k \in
+    \mathbb{N}_0.
+\end{align}
+Do some numerical  results or explain the ones in the talk.
+\subsection{Landweber iteration}
+Instead of the Gauss-Newton iteration we consider the Landweber iteration
+\begin{align}
+    \vec{p}\;^{k+1} = \vec{p}\;^{k} - \lambda \Psi'\left(\vec{p}\;^{k})  \right)^{\dagger}
+    \left( \Psi(\vec{p}\;^{k} - \Psi^{\dagger} \right) \qquad k \in
+    \mathbb{N}_0.
+\end{align}
+Needs about 500 iterations
+\subsection{The catch}
+If the observed convergence rate of the Gauss-New ton change completely if the
+solution is not attainable. Then the conjecture is that the non-convergence
+because of multiple solutions.
+Also the implementation of the simplified Gauss-Newton requires inversion of
+$F$ , which is not done in practice, this is for Landweber.
+
+\subsection{Alternative to DNNs}
+Instead of using Deep Neural Networks where we do not know the result if the
+the immersion is invertible, we consider Quadratic neural network functions
+defined as follows
+\begin{align}
+    \Psi(\vec{x}) := \sum_{j=1}^{N} \alpha_j\sigma\left(\vec{x}^{T}A_j\vec{x}
+        + \mathbf{w}_j^{T}\vec{x} + \theta_j \right),
+\end{align}
+with $\alpha_j, \theta_j \in \mathbb{R}, \mathbf{w}_j \in \mathbb{R}^{n}$
+and $A_j \in \mathbb{R}^{n \times n}$. We can also constrain the class of
+$A_j$ and $\mathbf{w}_j$ which leads us to circular networks, circular
+affine, elliptic, parabolic...
+\begin{theorem}
+    Quadratic neural network functions satisfy the universal approximation
+    property.
+\end{theorem}
+The immersion property of circular network functions
+\begin{align}
+    \Psi(\vec{x}) := \sum_{j=1}^{N} \alpha_j\sigma\left(r_j\vec{x}^{T}\vec{x}
+        + \mathbf{w}_j^{T}\vec{x} + \theta_j \right),
+\end{align}
+and
+\begin{align}
+    \text{span}\{\partial_{p_i}\Psi(\vec{p})\;:\;i=1,\ldots,n_*\}, \qquad
+    \text{has rank}(n_*).
+\end{align}
+For $\alpha_i \neq 0$, this in particular requires that the functions
+\begin{align}
+    \frac{\partial \Psi}{\partial r_s}  = \sigma\left( \rho \right)
+    \vec{x}^{T}\vec{x}, \quad
+    \frac{\partial \Psi}{\partial \alpha_s}  = \sigma\left( \rho \right)
+    , \quad
+    \frac{\partial \Psi}{\partial w_s^{t}}  = \sigma'\left( \rho \right)
+    x_t,\quad
+    \frac{\partial \Psi}{\partial \theta_s}  = \sigma'\left( \rho \right)
+\end{align}
+are \textbf{linearly independent}.
+\newline
+
+Results of these types of networks is that the Shepp-Logan can be represented
+with \textbf{10 nodes} with elliptic neurons and \textbf{one layer}. Where as
+for affine networks, both shallow and deep we need infinity neurons. Here
+figure tensorflow approximations of a circle 15 neurons linear.
+
 
 
 %\printbibliography
diff --git a/ricam_sem/summary_talk/todo.md b/ricam_sem/summary_talk/todo.md
@@ -0,0 +1,37 @@
+TODO:
+
+Structure:
+* introduction to problem
+    * What are we trying to solve
+* Background
+    * Coding
+    * Neural Networks
+    * Decomposition Cases
+    * Gauss Newton and Landweber
+    * Tikhonov Regularization
+    * Space Regularization
+    * symmetries
+* Theoretical Results
+    * Gauss-Newton type method for problem
+    * Convergence of Gauss-Newton
+    * Newton's method with nn operator and linear independence
+    * Results of linear independence
+* Experimental Results
+    * Gauss-Newton
+    * Landweber
+    * Circular NNs with result.
+
+We need
+    * more on coding
+    * Tikhonov regularization
+    * Space regularization
+    * more on the decomposition cases
+    * Proof of local convergence of GN
+    * Proof of linear indepencdence thm
+    * Proof of GN convergence with linear independence
+    * Reproduction of numerical results of GN
+    * Reproduction of numerical results of landweber
+    * Reproduction of numerical results of circualr networks.
+    * (reproduction or more in depth explanations)
+
+

	notes uni notes
	git clone git://popovic.xyz/notes.git
	Log \| Files \| Refs

M	ricam_sem/summary_talk/main.tex	\|	160	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
A	ricam_sem/summary_talk/todo.md	\|	37	+++++++++++++++++++++++++++++++++++++