report.tex (12353B)
1 \documentclass[a4paper]{article} 2 \usepackage[T1]{fontenc} 3 \usepackage[utf8]{inputenc} 4 \usepackage{mlmodern} 5 \usepackage{graphicx} 6 \usepackage{geometry} 7 \usepackage{float} 8 \geometry{a4paper, top=15mm} 9 \usepackage[parfill]{parskip} 10 \usepackage[colorlinks=true,naturalnames=true,plainpages=false,pdfpagelabels=true]{hyperref} 11 12 \usepackage[backend=biber, sorting=none]{biblatex} 13 \addbibresource{uni.bib} 14 15 \usepackage{amssymb} 16 \usepackage{amsthm} 17 \usepackage{mathtools} 18 19 20 21 \title{University of Vienna\\ 22 \vspace{1cm}Seminar:\\Complex Network Analysis\\ 23 \vspace{0.5cm}Python Package-Dependency Network 24 } 25 \author{Milutin Popović} 26 27 \begin{document} 28 \maketitle 29 30 \section{Introduction} 31 32 The following project will map python packages and their dependencies into a 33 time dependent directed graph(complex network), analyze the graph and 34 visualize it Virtual-Reality (VR). Most python packages have a given set of 35 dependencies that are needed to use the specific package. Both the package 36 and the dependency will represent a vertex. The directional edges between 37 packages will represent their dependence. Note that it is also possible for a 38 package to have no dependency, in this case it will have no outgoing edges 39 and the construction of the graph will avoid self loops for such specific 40 cases. The main analysis of the graph will evolve around the growth exponent 41 of nodes. Based on the covariance of the fit of the cumulative logarithmic 42 growth function. The analysis will separate the best performing 43 packages(nodes) released not before 2016. 44 45 \section{Data} 46 At the time of this doing this project the number of nodes was around 47 $3.6\cdot10^{5}$ and the number of links was around $10^{6}$. The data is 48 sourced from the Python Package Index(PYPI) main website \cite{pypi}. The 49 gathering of the data resorts to first of all finding out the names of all 50 the packages then with this information browse the \textit{json}-quivery of 51 PYPI for every package. The provided data by the Python Package Index has a 52 \textit{json} type structure for all packages, which includes both the 53 release date and the dependencies of a specific package(and a lot of other 54 information too). 55 56 57 Querying through all packages 58 and writing an edge-list in the package-dependency format gives a directed 59 network as is. 60 61 The important thing is that, given the release information for each package 62 the graph can be restructured by time steps from the time the first package 63 was released to the time the last package was released. The reconstruction 64 will create a network for each month from around the start of the year 2005 65 to the middle of the year 2022. The edge-list as described above together 66 with the time stamps of the package releases will be treated as a single data 67 entry, even though it is not because the edge list represents a connection 68 between a package and its dependency while the release date is only of the 69 package and not not its dependency. The first sorting will create a file for 70 the package) inside the file. The second sorting will go through all of the 71 files(months) again and check if the dependency of a package was indeed 72 released, if not it will buffer it and delete the connection to the package 73 in the current file. Later on when the actual dependency is released, only 74 then the deleted entry (package-repository) would be added from the buffer to 75 the currently selected file. 76 77 This turns out this is a sorting problem requiring a lot of buffer and 78 relying on loops, where python may not the best language for the task. That 79 said I would advise against executing this code since it needs to sort and 80 resort $10^{6}$ links based on their time stamp. 81 82 The data and code for sorting can be found under my Github instance 83 \cite{sorting}. 84 85 \section{Implementation and Analysis} 86 The implementation of the whole project can be found under my Github instance 87 \cite{impl}. 88 89 To first of all build a general understanding of then network, a simple 90 degree distribution observation of the last time step (current day network) 91 was made. As most complex networks in the internet, this network also follows 92 a power law distribution. In the figure bellow the degree distribution is 93 show for the network, without incorporating directionality of the network. 94 \begin{figure}[H] 95 \centering 96 \includegraphics[width=0.8\textwidth]{../pres/pics/dist_u.png} 97 \caption{The left picture shows the power law distribution, the right 98 the cumulative degree distribution, both in log-log scales and both 99 without incorporating directionality. The red dotted line is the regression fit} 100 \label{dist_u} 101 \end{figure} 102 The second case differentiates between outgoing links and incoming links, 103 the observations are rather similar 104 \begin{figure}[H] 105 \centering 106 \includegraphics[width=0.8\textwidth]{../pres/pics/dist_d.png} 107 \caption{The left picture shows the power law distribution of in-degree 108 represented by green squares, out-degree represented by blue 109 triangles. The right picture shows the cumulative degree 110 distribution, both in log-log scales. The blue and green dotted lines are 111 regression lines of in and out degree respectively} 112 \label{dist_u} 113 \end{figure} 114 Another rather interesting thing, is to look at how well different nodes are 115 ranked in the network based on centrality measures. The figure bellow shows 116 such analysis. 117 \begin{figure}[H] 118 \centering 119 \includegraphics[width=1\textwidth]{../pres/pics/sneak_peak.png} 120 \caption{Centrality measures of the network} 121 \end{figure} 122 123 As for the analysis of the time evolving network, the given tools by the most 124 common network analysis python programs do not have a way of dealing with 125 such networks. That is why a child class of the class \texttt{nx.DiGraph} is 126 invented, calling it \texttt{TDiGraph}. The general concept of this class is 127 that it incorporates all the class methods of \texttt{nx.DiGraph} and adds an 128 additional input and three more methods. The additional input is a list of 129 tuples containing a time stamped edge list in the format \texttt{(package -> 130 dependency, at time $t$)}. This allows us to create first of all two methods, 131 starting with a graph at $t=T_i$ the method \texttt{forward} would add all 132 edges corresponding to time $t=T_{i+1}$ and label the network to be at time 133 $t=T_{i+1}$. The second method does exactly the reverse and removes nodes 134 labeling the network to the time $t=T_{i-1}$ this method is called 135 \texttt{backward}. The third method allows to instantly go to a specific time 136 from $t=T_k$ to $t=T_i$ for both $k<i$ and $i>k$, doing nothing at $i=k$. 137 Specifically the first two methods, \texttt{forward} and \texttt{backward} 138 allow for the iteration over the network. This is especially practical since at 139 each time step different analytical information is obtained. For example degree 140 growth over the months shown in the figure bellow. 141 \begin{figure}[H] 142 \centering 143 \includegraphics[width=0.8\textwidth]{../pres/pics/deg_growth.png} 144 \caption{Node growth in the network over the months} 145 \end{figure} 146 Now that the basics are out of the way, a rather interesting prediction to 147 make is to separate python packages that have performed based on their 148 \textbf{in}-degree growth rather well and were created in the last couple of 149 years. The first problem encountered, is that the degree growth of a specific 150 node has a lot of noise, compared to theoretical models. To avoid this it is 151 more intuitive to consider the cumulative degree growth, in this case the 152 curve will be somewhat smoother. The second problem is that even thought this 153 is a scale free network there are all kinds of degree growths in this network 154 which do not coincide with theory in \cite{barabasi}. The theory fits a model $k_i(t) = 155 \left(t/t_i\right)^{\beta_i}$ of degree growth, as mentioned above 156 the analysis considers the cumulative degree growth and additionally will fit 157 in the standardized log scale. Then the function fit is linear $\beta_i \cdot x 158 + c$ for given $x$. The main objective is to find all the $\beta_i$'s 159 corresponding to the nodes. Since there are all kind of degree growths, it is 160 worth only considering certain fits satisfying that $\text{CoV}(\beta)$ is 161 under a certainly well chosen threshold, then a wide separation of nodes is 162 already established. It is also worth to motion here, that some test 163 statistic can be used such as the $\chi^2$ test. But when dealing with a 164 large number of fits, for the sake of computational cost only a $\text{CoV}$ 165 threshold is considered. Together with considering only packages that were 166 released not before 2016 the following packages are given by the analysis, 167 shown in the figure below. 168 \begin{figure}[H] 169 \centering 170 \includegraphics[width=1\textwidth]{../pres/pics/deg_2016_growth.png} 171 \caption{Top performing packages based on \textbf{in}-degree growth of 172 nodes, released in not before 2016 in a y-log axis} 173 \label{fig:findings} 174 \end{figure} 175 176 The final thing now is to visualize the network in Virtual Reality. For this 177 a layout needed to be created, since the standard spring layout or similar 178 does not represent the ``pseudo degree hierarchy'' of the network. The main idea is 179 to visualize the findings in an intuitive way. Based on this a layout 180 evolving around the degree distribution of the network was made. To make the 181 explanation comprehensible the reader should think about a bar diagram 182 representing a power law distribution, the first few bars would be 183 very big while the last would be very small. The idea is to take all the 184 nodes in one bar and map them uniformly on a circle starting from the first 185 bar to the last bar. The disks would be constructed in such a way that they 186 will visually fit all the nodes inside, i.e. the radius of the disk would 187 depend on the node size. Have now $n$ circles/disks corresponding to $n$ 188 bars. To make a three dimensional visualization possible, a stacking of these 189 disks based on a distance function is made. The most visually compelling is 190 the $\sqrt{\cdot}$ distance function between the disks, so the $n$-th layer 191 would be $\sqrt{n}$-away from the first layer where the first layer is the 192 layer with most nodes. The network would look something like the figure above 193 \begin{figure}[H] 194 \centering 195 \includegraphics[width=0.8\textwidth]{../pres/pics/plot_sqrt.png} 196 \caption{Scale-Free network with layout based on degree distribution} 197 \end{figure} 198 199 In the Virtual-Reality setting, a cut-of of the network is made, removing all 200 nodes, that have an in-degree below $10$ and painting the predicted nodes in 201 \ref{fig:findings} red. 202 203 \section{Discussion} 204 In summary, based on an idea, data from the Python Package Index \cite{pypi} 205 was mapped to a complex network. With the help of release information of each 206 package, a time evolving network was created. For the sake of dealing with 207 such graphs a very practical python child class called \texttt{TDiGraph} was 208 created. It is verified that the network is a scale-free network following a 209 power-law distribution. Additionally both in the static case and in the 210 dynamic case of the network, a node degree analysis was made. In the static 211 case, based on centrality measures. In the dynamic case based on the degree 212 growth exponent and a release of the node not before 2016. Lastly for the 213 VR-setting a specific type of layout was constructed based on the networks 214 degree distribution. 215 216 The initial idea for the VR-setting was to really visualize the time 217 evolution, in a sense make a ``gif'' in VR. This ``gif'' would start with 218 a graph $t=T_{2005}$ add nodes and links until $t=T_{2022}$ last step. While 219 for each step the networks layout needs to be recalculated, because the 220 setting of the degrees of each time step is different. This kind of structure 221 would visualize the ``coming up'' of the predicted nodes in \ref{fig:findings} 222 labeled in red, as they rise through the layers of the network. However this 223 could not be realized since the unreal engine of the VR-framework did not 224 allow for node-link removal nor appendage while changing structure. 225 226 An additional tweak one could go through is to include user downloads for 227 each package, this data is also provided by PYPI. The number of user 228 downloads can be incorporated to construct a weighted graph, which would open 229 doors to further analysis. 230 231 \nocite{barabasi} 232 \printbibliography 233 \end{document}