"Push only" mode for Build cache

Hi. I’m playing with Build cache to speed up pull request builds in a quite big monolith project. To improve hit rate I would like to enable cache also in builds from master branch on CI server (in the deployment pipeline) which is a base branch to create PRs from. However, there are some limitations and potential issues mentioned in the documentation. While it’s ok for me to take a risk in the PR builds I would prefer to make a clean production builds (clean build with build cache). Therefore I have 3 questions.

  1. Do you think my concerns regarding using build cache in production builds are sane?
  2. Is it possible to enable a “push only” mode in build cache (i.e. Gradle build doesn’t use build cache (there is “clean” called so I exclude the 1st level cache in the build) to speed up the build, but in the same time it publishes the task execution results to the remote cache (to be reused in PR build)?
  3. If not, do you see any obstacles in writing my own cache provider (probably subclassing the default HTTP one provided in Gradle distribution) to achieve that?
1 Like

This has come up before, but I think it’s a bad idea because it pushes problems onto the client and can mask problems.

  1. As a data point, all Gradle CI builds use the build cache, but our promotion builds do not (which build the distribution and upload it). I think it’s sane to want to leave this feature off until it’s enabled by default.
  2. You can accomplish the same thing by running the build with --rerun-tasks. It’ll ignore the build cache for retrievals and push after executing the tasks.
  3. I think this could be done at the build cache level, but it would appear to be a cache that always misses vs a separate “push only” mode.

IIUC, you want to set your pipeline up so that your master branch builds will only push to the build cache and your PR builds will only pull from the build cache?

If that’s the case, I think you should reconsider. I think it’s fine to not want to use the build cache for master/production builds for now, but only pulling for the PR builds will have some problems:

  1. You’ll get fewer hits overall because PRs will either build on each other, iteratively change or contain changes from each other (e.g., abandoned/reopened PRs). This isn’t the worst problem.
  2. When things go wrong (e.g., a custom cacheable task has a problem with cacheability), you may not notice it in the PR builds or on CI because the master builds would constantly be overwriting the build cache. Only local builds will suffer from problems where things should be/shouldn’t be cached.
  3. Soon we’ll allow caching of remote cache entries locally, so the problems from #2 above may cause problems across your build agents (depending on which cache result they use) and make a “push only” cache less useful.

Some of this behavior is also made worse depending on how the build cache backend is implemented. If it allows overwrites, then a push only cache causes a certain kind of weirdness (cache entries change when there’s a misbehaving task). If it doesn’t, the problem will be masked.

If you decide to also allow PR builds to push, you don’t need to treat master special and have it prime the cache, so I don’t think you’ll need a “push only” option. In that case, only PR builds would populate the cache. That’s similar to what we do for Gradle–every branch can build with the build cache, the CI pipeline uses the build cache and the final promotion build does not (we have a small amount of sanity checks that happen after we build a distribution).

And as a counterpoint to what I said above:

Internally, we have a team that runs the first stage of their pipeline with --rerun-tasks to get a “push only” sort of thing. They were an early adopter (3.3 or so?) and suffered from some of the things I described.

When we had problems with the Groovy compiler producing slightly different bytecode randomly for the same sources, we would see the hit rate jump around and had some hard to explain cache misses. It all would depend on if the last build that ran with --rerun-tasks happened to produce the same combination of classes as the runs before it and if you caught the right combination of artifacts (when builds ran in parallel). It was frustrating.

They’re still running with --rerun-tasks and seem to be happy.

Thanks @sterling for the comprehensive response!

--rerun-tasks seems very sensible for me as there is clean task executed in the master builds anyway.

In fact in my case I would like to have master build to push changes to cache (PR build should benefit with higher hit rate for fresh changes just from the current HEAD) and PR builds in pull/push mode (with local build cache disabled to not clutter Jenkins executors - remote cache is easier to clean up periodically). Very often there is a problem with broken test checkstyle-like violation and the second build from that PR branch is pretty fast (thanks to data cached in the first run).

I haven’t checked how often Gradle fetch/send cache entries under the hood (before and after every task execution?) and how much slower would it be to use just remote cache vs local+remote cache on executors. Do you have an experience with that? Can Gradle leverage multiplexed HTTP/2 connection to reduce creating connection overhead (assuming the remote backend supports that)?

I wonder about overriding cache entries. Would it occur in “normal” circumstances? A change of Gradle version should result in different inputs. Minor Java version upgrade could be a case, but I would expect to have the same bytecode generated in most of the cases. Maybe some issues with Gradle itself. Therefore, would it be good to have cache overriding disabled (+ have metrics/altering to let you know about that to explain and react)?

Depending on how you have your deployment pipeline setup, you might consider what we’re doing (which sounds like what you want, except we allow master to pull too).

  1. All branches (PR + master) use the build cache (push and pull)
  2. Deployment jobs don’t use the build cache (so everything is built from scratch).

It’s basically after up-to-date checking. If we know we need to execute the task and the build cache is enabled and the task is cacheable, we check if there’s a cache hit. If not, we execute the task and then upload the outputs.

I don’t think our HTTP build cache connector does anything fancy at the moment. It’s just HTTPClient get + put, but I don’t see why it couldn’t be made to do that. I think HTTPClient does do some connection sharing.

Yes, it can happen if two builds start at about the same time and both produce the same cache key for a task.

It does.

Not for Groovy, unfortunately. :slight_smile:

Yes, plug for the build cache node we maintain (doesn’t require GE). There’s a log that keeps track of overwrites currently.

And you can get more through build scan features, like linking back to the build that produced a particular cache entry: 2017.4 | Gradle

I think we think the same. There just seem to be 2 differences in the build approach. In my case:

  1. Changes to master can be put there only by a PR in which it is build, tests are executed, etc. - to keep the master as green as possible (having a lot of people working on the codebase)
  2. Every commit in master (merge in fact) triggers a Jenkins job (the first in the deployments pipeline) which builds the artifacts which are later deployed to different environments for extra verification (potentially including deployment to production). Therefore, there is no difference between master build and deployment job - master build generates deployable (production) artifacts.

As a result I don’t want to have master (deployment) build to use case, but still to fill the case for the new PRs (and --rerun-tasks with remote cache enabled seems ok for that). Am I getting it right?

Good point. It can happen in our case. Luckily disabled update should be harmless in that case.

Nice. I will take a look at it.

Build cache in general can be really useful. I have to admit that with it Gradle Enterprise seems much more useful with all those extra diagnostic stuff provided. Manual explaining what broke caching is tedious… Maybe it’s a good time to request a price quote :).

Btw, do you think having Groovy in production code can also cause different classpath order (somehow random, but with limited number of combinations) in dependent modules (which effectively reduces cache usefulness)?

Yep, that sounds OK.

Groovy alone shouldn’t cause a difference in ordering. Groovy compilation is really sensitive to JDK versions in ways that affect generated bytecode (e.g., order of methods/fields).

My suspicion is that there’s a HashSet lurking somewhere or we rely on filesystem ordering somewhere.