This is the second article in the “reproducible builds” series. Previously: Improvements in testing and building: GitLab CI and reproducible builds.

In the previous article, Improvements in testing and building: GitLab CI and reproducible builds, we discussed reproducible builds and our current short-term goals for them in Qubes OS. Notably, we aimed to start by building our Debian templates such that packages can be installed only when configured rebuilders confirm that they really came from the source code we publish. Today, we go beyond this expectation.

Reproducible builds: retrieve the past

The challenge in reproducible builds lies in rebuilding a package in the same environment in which it was officially published. This means that we need to retrieve every single package version that was used as dependency to rebuild a given package. For Debian, some packages in the current release were built several releases in the past but not necessarily with the exact same dependencies. In order to retrieve them, there is only one solution: a Debian service called snapshot.debian.org, which is an archive acting as a Wayback Machine that allows access to old packages based on dates and version numbers. It contains all past and present packages that the Debian archive provides. Unfortunately, this service is known to suffer significant blocking issues on usability. For example, watch the DebConf 2021 talk Making use of snapshot.debian.org for fun and profit and have a look at some related Debian issues like #977653, #960304, #969906, #969603, and #782857. To summarize: There are throttling limits and availability issues such as repeatedly cutting off connections, returning partial content, etc. As announced in our previous article, we developed our own rebuilder tool, debrebuild, which is able to rebuild a single Debian package together with a rebuilder orchestrator PackageRebuilder. We started to put it in production in order to actively rebuild Qubes OS and Debian packages, but it quickly ceased to function, as the snapshot.debian.org service was unable to sustain the load of rebuilding even a single Debian package. That said, the question was: How should we proceed in order to make it work? Clearly, those issues are critical and make the snapshot.debian.org service awful or useless for reproducible builds.

Is rebuilding Debian really possible?

The snapshot.debian.org issues have still not been addressed even after several years. The service has existed for more than a decade, yet it still suffers from the aforementioned limitations. It’s either a design problem or a lack of resources, but we still had to do something.

That’s why we decided to create our own snapshot service. Easy to say, but not to do. First, the original snapshot service from Debian is roughly 90 TB of repository data. Second, we cannot download files easily because only HTTP(S) is available, and downloading multiple files means we are impeded by availability issues. In order to work around the huge volume of data, we decided to get repositories from 2017 to today (which corresponds approximately to when Debian “Buster” was released) and only related architectures amd64, source, and all. (all indicates no specific architecture in the Debian world.) For the download part itself, we needed to parse the metadata of each Debian repository in order to get the list of files to download for every timestamp for which a snapshot had been made. Then, we developed resume and retry download functions, which unfortunately are brute force download functions. For storing the data, a simple approach has been employed: storing files as SHA-256 names, then creating symlinks to reconstruct the repository layout. In order to get file information (package and repository metadata), we rely on simply reading a symlink. It took 3-4 months to get 4.2 TB of data, which represents 2017 to the present. Most of the information about the downloaded files and their source repository is stored in a database. In parallel, we added — like the original snapshot.debian.org — an API, snapshot-api, to expose information about repositories. Unlike the original one, we added much more information that rebuilder software, e.g. debrebuild, needs to have when requesting package information, such as the exact location of a given package in terms of Debian archive, timestamp, suite, architecture and component. The service is now publicly exposed at https://snapshot.notset.fr and the API endpoints at https://snapshot.notset.fr/mr. The service is home-hosted by the author.

This is exactly where the dream of rebuilding Debian packages in the same environment in which they were official published became a reality. Thanks to our standalone orchestrator and rebuilder software debrebuild, results of the rebuilding process, links to reproducible attestations called in-toto metadata, and even why a package is not reproducible can all be found at https://rebuild.notset.fr. As of this writing, we have successfully rebuilt more than 80% of the latest Debian packages for the unstable release while doing tests. Since it started, several adjustments have been made, and we have finally reached a stable rebuilding process. That is why, after a few late improvements during this almost first full rebuild, we flushed it all and started again for latest Debian stable release, Bullseye. We will again rebuild unstable after the full rebuild of Bullseye is complete. As time passes, we will have fewer and fewer pending tasks, as there are a couple thousand package rebuilds remaining. Please note that, in addition to the initial package build, the process of rebuilding a package means querying the snapshot.notset.fr API multiple times to get package information and location, set up the same environment as the original published one, and finally, actually build it. All of this is possible thanks to several servers, home-hosted by the author, that intensively build packages non-stop for more than a month.

What’s next?

For Qubes OS, we already track reproducibility status in our continuous integration (CI) tests (see the previous article for details), and they are also rebuilt independently like Debian packages in the same Package Rebuilder instance. We already have most of the reproducible attestations for our specific Debian packages (see https://rebuild.notset.fr/qubesos.html), and we will soon have all the needed ones for Debian. In consequence, we are happy to announce that we have already started the process of integrating the rebuild check status both at the build phase of our Debian templates and when later installing a package in the template itself. That’s the reason we restarted the whole process of a full rebuild for Bullseye.

There is preliminary work for integrating Fedora into the orchestrator, but that deserves a separate effort. The rebuilder rpmreproduce can be used to rebuild Fedora packages, but some discussions with RPM upstream are still needed (see https://github.com/rpm-software-management/rpm/pull/1532). Also, we plan to support input other than a buildinfo file for RPM, such as a Koji build description (which is the build infrastructure used by Fedora and CentOS) or any description piece that would make it clear how an RPM package was built. We also plan to add other distributions pretty easily and quickly, like Arch Linux, which we are going to ship officially soon.

Conclusion

Improved documentation for the orchestrator is in progress to make it easier for others who want to rebuild Qubes OS or Debian in the same way that we are currently doing it. Having more independent rebuilders publishing reproducibility attestations would be especially good for the community.

In all of these efforts, we are really satisfied that the Reproducible Builds Project has decided to use our work and results as an example of what it has been advocating for years, notably for Debian. The official website https://beta.tests.reproducible-builds.org currently mirrors our results website https://rebuild.notset.fr.

The author warmly thanks Marta Marczykowska-Górecka and Marek Marczykowski-Górecki for their moral support and technical discussions throughout this rough and intensive journey while juggling other projects.