Improved Gradle Dependency Cache


(Hans Dockter) #1

summary: A new dependency cache that solves a couple of fundamental problems with current repository based enterprise builds based on Maven, Ant + Ivy or Gradle. status: Fixing the last remaining issues. Available in milestone 5. code: completed

This is a copy of an announcement posting ( http://gsfn.us/t/2gb0i ). The content is identical. As we are working on deeper integration with our roadmap status board we needed to re-post it as an idea.

Introduction

A key requirements of an enterprise build system is build reproducability. Current local dependency caches, such as those implemented by Ivy or Maven, are creating many problems in this respect for repository based enterprise builds. This has been the case regardless of the build system in use, be it Ant + Ivy, Gradle, Leiningen, SBT, or Maven.

With our new cache implementation, Gradle addresses many of the caching challenges. We are very excited about our investment in this segment of the build system. In this document we want to share the reasons why we have implemented a new cache. First we would like to say a big thank you to Fred Simon from JFrog, co-founder of Artifactory, for starting the new cache project and implementing the initial version.

Problems with current caches

Status Quo

So far local dependency caches don’t take the artifact origin (specifically, the URL or other source address) properly into account. The artifact is simply stored together with its metadata (e.g. pom.xml or ivy.xml). Regardless of the build and its repository configuration, always the same artifacts are returned, based on a simple name matching pattern. In the following sections, we will describe scenarios where this behavior leads to problems and how we are solving it in Gradle. We will also describe other common problems of current dependency caches and how Gradle aims to trump those solutions.

Hiding problems due to repository changes

Imagine a new repository is introduced with a different URL and not all artifacts from the original repository have been properly migrated. For people who have successfully built the projects depending on the original repository, the build will still work since the local cache will provide all the artifacts. When a new developer checks out the projects however, they won’t build, failing on unresolved dependencies. The cache is hiding a configuration problem.

Jar’s with the same name might be different.

Imagine a developer who has worked on project Foo and is now also working on project Bar. Foo and Bar are using different repositories. Both use an inhouse library with the artifact name superlib. The Foo and the Bar repository have a superlib-1.0.jar in it. But in this case the jars are not the same. This is a messy situation. But it is also a frequent reality in the enterprise. The developer now builds Bar which is using the the superlib-1.0.jar from Foo because the local cache will return it. The build fails compilation or tests and nobody knows why. The other Bar developers can’t reproduce the problem because they are not working on Foo. The cache is creating special behaviour which is hard to debug.

Multiple latest snapshots.

Another scenario for the Foo and Bar project from above is that they use both snapshots from the latest version of Lucene. Their respective repositories have differents lucene snapshots in it. The build master of Bar has uploaded the latest snapshot from yesterday because it has a new feature the Bar team desperately need. Foo goes with a snapshot that is two weeks old because the latest Lucene snapshots don’t work for them. It causes an out of memory exception. They also don’t need any of the newest features. The developer now builds Bar which takes the latest snapshot of Lucene into the local cache. The next time she builds Foo, the tests fail with the out of memory exception from above. Her colleagues on the Foo team can’t reproduce the problem. The cache creates incorrect and difficult to debug behaviour. The strategies you can apply with dynamic revision numbers are severely affected by such a cache behaviour.

Local builds are polluting the cache

Another scenario for the Foo and Bar project using Lucene is the following: A developer is also working on the code base of the latest version of Lucene. He makes some changes to the codebase and builds it with Maven. He has a sample project that consumes his latest build of Lucene. To make the sample project work, he installs the necessary JARs into the local cache. Now the local Foo and Bar build will also pick up the locally built Lucene snapshot. Again, the cache creates incorrect special behavior.

Concurrency behaviour

The common dependency caches easily get corrupted when multiple builds run in parallel. They are not concurrency safe.

The new Gradle Dependency Cache

The objectives for our new cache are:

  • Optimize local disk usage
  • Minimize bandwidth consumption and download time
  • Identify valid artifacts
  • Prevent the creation of corrupted jars
  • Enable concurrent access to the artifact cache
  • Identify locally built artifacts
  • Identify and maintain metadata on each artifact’s origin
  • Support resolver configuration changes

Cache Structure

The new dependency cache has a per user store for artifacts (e.g. binaries like jars). In that store there is one and only one artifact stored per checksum. The metadata (e.g. pom.xml or ivy.xml) is stored in a per-repository cache which links to the corresponding artifacts. The name of the link is based on the artifact name described in the metadata. The actual file it links to (e.g. the jar) is solely identified by its checksum, much like how Git points to a blob in its object bag.

Bandwidth Efficiency

Before downloading an artifact, Gradle tries to inspect the checksum of the artifact to be downloaded. For example either by downloading the sha file or if Artifactory is used, by asking the repository manager. If the checksum can be retrieved, an artifact is only downloaded if no such artifact already exists with that checksum in the local cache.. If the checksum can’t be retrieved the artifact is always downloaded and ignored if such an artifact already exists.

Origin Validity

As described above, for each repository there is a separate metadata cache. The repository is identified by its URL, type and layout. A build will fail if the required artifacts are not in the repository specified by the build regardless whether the local cache has retrieved this artifact from a different repository. For example if you have changed the primary repository for your project, Gradle will check whether the new repository contains all the necessary artifacts. If not, it will fail. It will not download the artifacts, if the artifacts are already in the cache.

Origin Validity will isolate builds from each other in an advanced way that no build tool has done before. It is a key feature to avoid incorrect and surprising behavior of a local build.

Checksum Validity

It is possible that you have links with different names in different repository caches that point to the same artifact, or, you might have the absolute reverse of that situation. Links with the same name in different repository caches can point to different artifacts. The job of the cache is to exactly reflect the state of the repositories thus enabling reproducable builds independent of which projects are checked out and the history of the cache usage.

Checksum Validity will isolate builds from each other. It is a key feature to avoid incorrect and surprising behaviour of a local build.

Concurrency

The cache is concurrency safe.

Conclusions

The new Gradle cache prevents the local cache from hiding problems and creating mysterious and difficult to debug behavior that has been a challenge with many build tools. This new behavior is implemented in a bandwidth and storage efficient way. It enables reliable and reproducible enterprise builds which is exactly what you should and now can expect of an advanced build tool such as Gradle.