report.tex - vrproject - Complex Network Analysis VR-project

report.tex (12353B)
      1 \documentclass[a4paper]{article}
      2 \usepackage[T1]{fontenc}
      3 \usepackage[utf8]{inputenc}
      4 \usepackage{mlmodern}
      5 \usepackage{graphicx}
      6 \usepackage{geometry}
      7 \usepackage{float}
      8 \geometry{a4paper, top=15mm}
      9 \usepackage[parfill]{parskip}
     10 \usepackage[colorlinks=true,naturalnames=true,plainpages=false,pdfpagelabels=true]{hyperref}
     11 
     12 \usepackage[backend=biber, sorting=none]{biblatex}
     13 \addbibresource{uni.bib}
     14 
     15 \usepackage{amssymb}
     16 \usepackage{amsthm}
     17 \usepackage{mathtools}
     18 
     19 
     20 
     21 \title{University of Vienna\\
     22 \vspace{1cm}Seminar:\\Complex Network Analysis\\
     23 \vspace{0.5cm}Python Package-Dependency Network
     24 }
     25 \author{Milutin Popović}
     26 
     27 \begin{document}
     28 \maketitle
     29 
     30 \section{Introduction}
     31 
     32 The following project will map python packages and their dependencies into a
     33 time dependent directed graph(complex network), analyze the graph and
     34 visualize it Virtual-Reality (VR). Most python packages have a given set of
     35 dependencies that are needed to use the specific package. Both the package
     36 and the dependency will represent a vertex. The directional edges between
     37 packages will represent their dependence. Note that it is also possible for a
     38 package to have no dependency, in this case it will have no outgoing edges
     39 and the construction of the graph will avoid self loops for such specific
     40 cases. The main analysis of the graph will evolve around the growth exponent
     41 of nodes. Based on the covariance of the fit of the cumulative logarithmic
     42 growth function. The analysis will separate the best performing
     43 packages(nodes) released not before 2016.
     44 
     45 \section{Data}
     46 At the time of this doing this project the number of nodes was around
     47 $3.6\cdot10^{5}$ and the number of links was around $10^{6}$. The data is
     48 sourced from the Python Package Index(PYPI) main website \cite{pypi}. The
     49 gathering of the data resorts to first of all finding out the names of all
     50 the packages then with this information browse the \textit{json}-quivery of
     51 PYPI for every package. The provided data by the Python Package Index has a
     52 \textit{json} type structure for all packages, which includes both the
     53 release date and the dependencies of a specific package(and a lot of other
     54 information too).
     55 
     56 
     57 Querying through all packages
     58 and writing an edge-list in the package-dependency format gives a directed
     59 network as is.
     60 
     61 The important thing is that, given the release information for each package
     62 the graph can be restructured by time steps from the time the first package
     63 was released to the time the last package was released. The reconstruction
     64 will create a network for each month from around the start of the year 2005
     65 to the middle of the year 2022. The edge-list as described above together
     66 with the time stamps of the package releases will be treated as a single data
     67 entry, even though it is not because the edge list represents a connection
     68 between a package and its dependency while the release date is only of the
     69 package and not not its dependency. The first sorting will create a file for
     70 the package) inside the file. The second sorting will go through all of the
     71 files(months) again and check if the dependency of a package was indeed
     72 released, if not it will buffer it and delete the connection to the package
     73 in the current file. Later on when the actual dependency is released, only
     74 then the deleted entry (package-repository) would be added from the buffer to
     75 the currently selected file.
     76 
     77 This turns out this is a sorting problem requiring a lot of buffer and
     78 relying on loops, where python may not the best language for the task. That
     79 said I would advise against executing this code since it needs to sort and
     80 resort $10^{6}$ links based on their time stamp.
     81 
     82 The data and code for sorting can be found under my Github instance
     83 \cite{sorting}.
     84 
     85 \section{Implementation and Analysis}
     86 The implementation of the whole project can be found under my Github instance
     87 \cite{impl}.
     88 
     89 To first of all build a general understanding of then network, a simple
     90 degree distribution observation of the last time step (current day network)
     91 was made. As most complex networks in the internet, this network also follows
     92 a power law distribution. In the figure bellow the degree distribution is
     93 show for the network, without incorporating directionality of the network.
     94 \begin{figure}[H]
     95     \centering
     96     \includegraphics[width=0.8\textwidth]{../pres/pics/dist_u.png}
     97     \caption{The left picture shows the power law distribution, the right
     98     the cumulative degree distribution, both in log-log scales and both
     99     without incorporating directionality. The red dotted line is the regression fit}
    100     \label{dist_u}
    101 \end{figure}
    102 The second case differentiates between outgoing links and incoming links,
    103 the observations are rather similar
    104 \begin{figure}[H]
    105     \centering
    106     \includegraphics[width=0.8\textwidth]{../pres/pics/dist_d.png}
    107     \caption{The left picture shows the power law distribution of in-degree
    108         represented by green squares, out-degree represented by blue
    109         triangles. The right picture shows the cumulative degree
    110         distribution, both in log-log scales. The blue and green dotted lines are
    111         regression lines of in and out degree respectively}
    112     \label{dist_u}
    113 \end{figure}
    114 Another rather interesting thing, is to look at how well different nodes are
    115 ranked in the network based on centrality measures. The figure bellow shows
    116 such analysis.
    117 \begin{figure}[H]
    118     \centering
    119     \includegraphics[width=1\textwidth]{../pres/pics/sneak_peak.png}
    120     \caption{Centrality measures of the network}
    121 \end{figure}
    122 
    123 As for the analysis of the time evolving network, the given tools by the most
    124 common network analysis python programs do not have a way of dealing with
    125 such networks. That is why a child class of the class \texttt{nx.DiGraph} is
    126 invented, calling it \texttt{TDiGraph}. The general concept of this class is
    127 that it incorporates all the class methods of \texttt{nx.DiGraph} and adds an
    128 additional input and three more methods. The additional input is a list of
    129 tuples containing a time stamped edge list in the format \texttt{(package ->
    130 dependency, at time $t$)}. This allows us to create first of all two methods,
    131 starting with a graph at $t=T_i$ the method \texttt{forward} would add all
    132 edges corresponding to time $t=T_{i+1}$ and label the network to be at time
    133 $t=T_{i+1}$. The second method does exactly the reverse and removes nodes
    134 labeling the network to the time $t=T_{i-1}$ this method is called
    135 \texttt{backward}. The third method allows to instantly go to a specific time
    136 from $t=T_k$ to $t=T_i$ for both $k<i$ and $i>k$, doing nothing at $i=k$.
    137 Specifically the first two methods, \texttt{forward} and \texttt{backward}
    138 allow for the iteration over the network. This is especially practical since at
    139 each time step different analytical information is obtained. For example degree
    140 growth over the months shown in the figure bellow.
    141 \begin{figure}[H]
    142     \centering
    143     \includegraphics[width=0.8\textwidth]{../pres/pics/deg_growth.png}
    144     \caption{Node growth in the network over the months}
    145 \end{figure}
    146 Now that the basics are out of the way, a rather interesting prediction to
    147 make is to separate python packages that have performed based on their
    148 \textbf{in}-degree growth rather well and were created in the last couple of
    149 years. The first problem encountered, is that the degree growth of a specific
    150 node has a lot of noise, compared to theoretical models. To avoid this it is
    151 more intuitive to consider the cumulative degree growth, in this case the
    152 curve will be somewhat smoother. The second problem is that even thought this
    153 is a scale free network there are all kinds of degree growths in this network
    154 which do not coincide with theory in \cite{barabasi}. The theory fits a model $k_i(t) =
    155 \left(t/t_i\right)^{\beta_i}$ of degree growth, as mentioned above
    156 the analysis considers the cumulative degree growth and additionally will fit
    157 in the standardized log scale. Then the function fit is linear $\beta_i \cdot x
    158 + c$ for given $x$. The main objective is to find all the $\beta_i$'s
    159 corresponding to the nodes. Since there are all kind of degree growths, it is
    160 worth only considering certain fits satisfying that $\text{CoV}(\beta)$ is
    161 under a certainly well chosen threshold, then  a wide separation of nodes is
    162 already established. It is also worth to motion here, that some test
    163 statistic can be used such as the $\chi^2$ test. But when dealing with a
    164 large number of fits, for the sake of computational cost only a $\text{CoV}$
    165 threshold is considered. Together with considering only packages that were
    166 released not before 2016 the following packages are given by the analysis,
    167 shown in the figure below.
    168 \begin{figure}[H]
    169     \centering
    170     \includegraphics[width=1\textwidth]{../pres/pics/deg_2016_growth.png}
    171     \caption{Top performing packages based on \textbf{in}-degree growth of
    172     nodes, released in not before 2016 in a y-log axis}
    173     \label{fig:findings}
    174 \end{figure}
    175 
    176 The final thing now is to visualize the network in Virtual Reality. For this
    177 a layout needed to be created, since the standard spring layout or similar
    178 does not represent the ``pseudo degree hierarchy'' of the network. The main idea is
    179 to visualize the findings in an intuitive way. Based on this a layout
    180 evolving around the degree distribution of the network was made. To make the
    181 explanation comprehensible the reader should think about a bar diagram
    182 representing a power law distribution, the first few bars would be
    183 very big while the last would be very small. The idea is to take all the
    184 nodes in one bar and map them uniformly on a circle starting from the first
    185 bar to the last bar. The disks would be constructed in such a way that they
    186 will visually fit all the nodes inside, i.e. the radius of the disk would
    187 depend on the node size. Have now $n$ circles/disks corresponding to $n$
    188 bars. To make a three dimensional visualization possible, a stacking of these
    189 disks based on a distance function is made. The most visually compelling is
    190 the $\sqrt{\cdot}$ distance function between the disks, so the $n$-th layer
    191 would be $\sqrt{n}$-away from the first layer where the first layer is the
    192 layer with most nodes. The network would look something like the figure above
    193 \begin{figure}[H]
    194     \centering
    195     \includegraphics[width=0.8\textwidth]{../pres/pics/plot_sqrt.png}
    196     \caption{Scale-Free network with layout based on degree distribution}
    197 \end{figure}
    198 
    199 In the Virtual-Reality setting, a cut-of of the network is made, removing all
    200 nodes, that have an in-degree below $10$ and painting the predicted nodes in
    201 \ref{fig:findings} red.
    202 
    203 \section{Discussion}
    204 In summary, based on an idea, data from the Python Package Index \cite{pypi}
    205 was mapped to a complex network. With the help of release information of each
    206 package, a time evolving network was created. For the sake of dealing with
    207 such graphs a very practical python child class called \texttt{TDiGraph} was
    208 created. It is verified that the network is a scale-free network following a
    209 power-law distribution. Additionally both in the static case and in the
    210 dynamic case of the network, a node degree analysis was made. In the static
    211 case, based on centrality measures. In the dynamic case based on the degree
    212 growth exponent and a release of the node not before 2016. Lastly for the
    213 VR-setting a specific type of layout was constructed based on the networks
    214 degree distribution.
    215 
    216 The initial idea for the VR-setting was to really visualize the time
    217 evolution, in a sense  make a ``gif'' in VR. This ``gif'' would start with
    218 a graph $t=T_{2005}$ add nodes and links until $t=T_{2022}$ last step. While
    219 for each step the networks layout needs to be recalculated, because the
    220 setting of the degrees of each time step is different. This kind of structure
    221 would visualize the ``coming up'' of the predicted nodes in \ref{fig:findings}
    222 labeled in red, as they rise through the layers of the network. However this
    223 could not be realized since the unreal engine of the VR-framework did not
    224 allow for node-link removal nor appendage while changing structure.
    225 
    226 An additional tweak one could go through is to include user downloads for
    227 each package, this data is also provided by PYPI. The number of user
    228 downloads can be incorporated to construct a weighted graph, which would open
    229 doors to further analysis.
    230 
    231 \nocite{barabasi}
    232 \printbibliography
    233 \end{document}
	vrproject Complex Network Analysis VR-project
	git clone git://popovic.xyz/vrproject.git
	Log \| Files \| Refs \| LICENSE