My work has an isolated network that developers write applications to run inside of. These developers often write Python code. This Python code often requires modules from the Python Package Index (PyPi) to be downloaded using something like pip. In order to allow these developers to run their software on this isolated network, I need to provide them with a PyPi mirror so they can get the modules their software depends upon. To accomplish this, I setup a bandersnatch mirror on a VM in my DMZ and configured our VMs to use this mirror instead of the (unreachable from their perspective) internet. So far, so good.
Policy dictates that software coming into our isolated network must be approved first, meaning I cannot simply do a full mirror of the entire Python Package Index (if I even had the disk space for such a thing... I'm not sure I do) and let people have whatever they want. Bandersnatch supports this no problem; by including this in it's config file, I essentially have a list of ONLY the approved packages that will show up on the mirror:
[allowlist]
packages =
absl-py
astunparse
caproto
confluent_kafka
ConfigArgParse
ess-streaming-data-types
flatbuffers
gast
gpytorch
...<and so on>...
Unfortunately, Bandersnatch does not do any dependency resolution on this list of packages to allow mirroring of and it does not seem likely they will implement that feature. It does not seem any other PyPi mirror servers are setup to allow this either, but if I have missed something, please let me know.
So I find myself having a list of Python Packages that I need to know all of their dependencies of, recursively, so I can mirror that whole list. I am not a Python programmer, so I wrote it in Perl. I think it works, but I'd really appreciate another set of eyes on it.
My Code: https://pastebin.com/hR6Yru4W
#!/usr/bin/env perl
use strict;
use warnings;
# This script is a (probably naive) attempt at writing a recursive dependency resolver for the Python package index (PyPi)
# A list of (space seperated) Python Package names provided on the command line are the starting point
# Each package in that list is added to a graph, then every node(vertex) in that graph has edges defined between the node itself
# and each of it's dependencies (which become their own nodes as a result)
# Every node in the graph is iterated through, has it's dependencies(edges) defined, and the process is repeated until
# the graph stops changing. Once there are no more dependencies to find, it prints out a (alphabetically sorted) list
# of every package that would be required to install every package given as an argument to this script. Note that this list
# WILL INCLUDE the packages that were initially provided as part of the list.
use JSON;
use Graph;
# Our main dependency graph objects
# We need two so we can check if the graph changed between dependency runs
# When the two graphs stay the same after a dependency run, we know our graph is complete
my $Agraph = Graph->new();
my $Bgraph = Graph->new();
# Accepts a list of python packages to check dependencies for
# Returns a list of python packages the arguments depend upon
# Note that this function is NOT RECURSIVE. It ONLY provides the direct dependencies.
sub get_deps {
my $ret = [];
my $json = JSON->new;
foreach my $package ( @_ ) {
my $curl = `curl -s "https://pypi.org/pypi/$package/json"`;
my $reqs = $json->decode($curl)->{info}->{requires_dist};
if (defined($reqs)) {
foreach my $dep (@$reqs) {
$dep =~ s/[^a-zA-Z0-9-].*$//;
push(@$ret, $dep);
}
}
}
return $ret;
}
# @ARGV is a list of packages we need to find all dependencies for and is the roots of our dependency graph
foreach my $package (@ARGV) {
$Agraph->add_vertex($package);
}
# A list of packages we have already gotten dependnecies for so we don't check twice
my $checked = [];
# Loop until our graphs stop changing between dependency runs
while ( $Agraph ne $Bgraph ) {
$Bgraph = $Agraph->deep_copy();
# Check every single vertice in our graph
my @vertlist = $Agraph->vertices;
foreach my $package (@vertlist) {
# If we haven't checked this package before, get it added to the graph with all of it's dependencies
if (! grep( /^$package$/, @$checked )) {
my $deplist = get_deps($package);
foreach my $dep (@$deplist) {
$Agraph->add_edge($package, $dep);
}
# Add this package to our list of already checked packages so we don't waste time
push(@$checked, $package);
}
}
}
my @fulldeplist = sort $Agraph->vertices;
foreach my $package (@fulldeplist) {
print "$package\n";
}
pip-compileto translate a collection of required packagesrequirements.ininto a full list including their dependenciesrequirements.txt. This tool will do all of the graph resolution for you, including resolving specific compatible versions for every package. \$\endgroup\$requires_distfield might include more than just the runtime packages you would care about. Looking at pypi.org/pypi/pip-tools/json for example I can see a lot of packages that are only needed for someone testing the package, which would not be necessary for you to simply use the package. If you take that to the extreme I could easily see your package list blowing up as you grab every test and non-prod package for every transitive dependency in the tree. \$\endgroup\$