ReproZip – reproducible experiments from command-line executions

dimatura 8 hours ago

This sounds pretty similar to CDE, which I see they cite in the paper. Back in the pre-docker days I remember using CDE a few times to package some C++ code to run on some servers that didn't have the libraries I needed. Pretty cool tool.

zahlman a day ago

Maybe they're just using "experiment" as some kind of data-scientist jargon that I don't understand, but this reads to me like just a way to package Python code, and from the description I don't understand why or when I would prefer this to making an sdist or wheel with standard tools.

Edit: I guess the idea is that this is automatically discovering non-Python system dependencies and attempting to include them as well? Either way, the developers should probably get in touch with the people behind https://pypackaging-native.github.io/ which has been trying to identify and solve problems with using the standard Python ecosystem tools in the "PyData ecosystem". (This effort has led to proposals such as https://peps.python.org/pep-0725/.)

westurner a day ago

Does manylinux help with this? https://news.ycombinator.com/item?id=43553198 :
> Manylinux requires tools called auditwheel for Linux, delocate for MacOS, and delvewheel for windows; which do something like ldd to list the shared libraries.
From the auditwheel readme: https://github.com/pypa/auditwheel :
> auditwheel show: shows external shared libraries that the wheel depends on (beyond the libraries included in the manylinux policies), and checks the extension modules for the use of versioned symbols that exceed the manylin
> auditwheel repair: copies these external shared libraries into the wheel itself, and automatically modifies the appropriate RPATH entries such that these libraries will be picked up at runtime. This accomplishes a similar result as if the libraries had been statically linked without requiring changes to the build system. Packagers are advised that bundling, like static linking, may implicate copyright concerns
PyInstaller docs: https://pyinstaller.org/en/stable/ :
> PyInstaller bundles a Python application and all its dependencies into a single package. The user can run the packaged app without installing a Python interpreter or any modules. PyInstaller supports Python 3.8 and newer, and correctly bundles many major Python packages such as numpy, matplotlib, PyQt, wxPython, and others.
conda/constructor is a tool for creating installers from conda packages: https://github.com/conda/constructor
Grayskull creates conda-forge recipes from PyPI and other packages: https://github.com/conda/grayskull
conda-forge builds for Windows, Max, Linux, amd64, and arm4. and emscripten-forge builds conda packages for WASM WebAssembly.
SBOM tools attempt to discover package metadata, which should include a manifest with per-file checksums. Can dependency auto-discovery discover package metadata relevant to software supply chain security?
dvc is a workflow tool layered on git that supports Experiments: https://dvc.org/doc/start/experiments/experiment-tracking :
> Experiment: A versioned iteration of ML model development. DVC tracks experiments as Git commits that DVC can find but that don't clutter your Git history or branches. Experiments may include code, metrics, parameters, plots, and data and model artifacts.
A sufficient packaging format must have per-file checksums and signatures. https://SLSA.dev/ says any of TUF, Sigstore.dev, and/or OCI containers with signatures suffice.
- zahlman a day ago
  
  All of these tools definitely help for the people who use them. In particular, the manylinux standard and associated tools are why I can reliably `pip install numpy` without even thinking about whether it will work, and regardless of whether (on Linux) there is a system package for OpenBLAS (which will be disregarded, unless of course you use a system-packaged version of Numpy instead). But there are also definitely still unmet needs.

gnat a day ago

> It tracks operating system calls and creates a package that contains all the binaries, files and dependencies required to run a given command on the author's computational environment (packing step). A reviewer can then extract the experiment in his environment to reproduce the results (unpacking step).

Vagrant and Docker behind the scenes. Very cool, and a welcome step up from a tarball.

lorenzohess a day ago

Would be cool to have a native integration with Git: - preserve archive integrity - signed archives for security - metadata (commit messages, tags) can associate each experiment with e.g. procedure, methodology, technicians - branches for modified experiments - easy cloud storage