Project level build cache for javascript and yarn projects

I have a very large, complicated multi project build, and I am attempting to speed it up by applying the build cache, however, I am finding that task level caching is a little bit too granular for my needs, and wondering how you would recommend caching a subproject’s outputs.

For a concrete example, the subproject in question is a javascript project with a yarn install step, and the output is some resulting javascript files.

My build task declares inputs and outputs, and I can cache it, but it depends on a yarn_install task.

gulp_build {
dependsOn "install_dep"

inputs.dir 'assets'
inputs.dir 'languages'
inputs.dir 'node_modules'
inputs.dir 'src'
inputs.dir 'type-definitions'
inputs.file 'gulpfile.js'
inputs.file 'tsconfig.json'

outputs.dir 'codegen'
outputs.dir 'public'

outputs.cacheIf { true }

}

My first thought was not to declare the node_modules directory as an input, since it itself is a function of package.json and yarn.lock, I could declare them as inputs instead.

Either way, the install_dep task is always executed, so I need to add the node_modules directory to the build cache (which is very large and would still take a long time to download, just to do nothing and be consumed by nothing).

I’m trying to find a way to have a ‘project’ level concept of inputs/outputs, or otherwise force the install_deps step to skip if we’re going to get a cache hit on the build step.

I was imagining a ‘parent task’, which would have inputs and outputs and trigger the ‘child’ tasks if it was executed. This seems to go agains the gradle model of task execution flow control through declaring dependencies, and then evaluating all the dependent tasks first.

What is the ‘gradley’ way to achieve this? Cache the node_modules output even tho I am not using it at all except at build time?

Thanks.

@danroberts, the issue you are running into is basically the idea of whether we should skip pulling intermediary outputs when we know all downstream tasks will be cached. This gets tricky of course because we can’t know that ahead of time as we calculate task inputs just before the task in to be run and these outputs might potentially be necessary for other reasons.

I do get your use case though, we’ve run into it ourselves. On the input bit, I agree with your decision to not track the node_modules directory as an input. There are a few reasons:

  1. This is going to be different on different platforms due to native libraries. Which means any node-based task output will not be usable across different operating systems.
  2. It’s generally huge and has a ton of files which adds overhead to tracking it as an input since we have to hash all its contents.
  3. If you’re using Yarn (I have less faith in NPM) the yarn.lock file should be the source of truth anyhow.

So yes, you’re on the right path there. As for avoiding running yarn install it’s a tough call without knowing more about what the performance numbers look like. In our experience we actually found that it was faster to cache it in CI (since we blow away the directory on every build). It still took a few seconds but was much faster (about 10x) than running the task. In the end it’s still a net win as it’s still faster than without the cache.

But still, I get your point that we are effectively doing work we don’t need to. There really isn’t a good workaround here. Anything to skip execution here would be a nasty hack that would have to make all sorts of assumptions about the build. Mainly, will the downstream tasks be pulled from the cache. As mentioned before, there’s no good way to know this, we’d have to predict it instead, which has the potential to be incorrect and cause build errors when things do change, or there is a cache miss, and downstream tasks do have to run. Personally, I choose reliability over saving a few seconds in this case. As I mentioned above, it’s still going to be way faster than without the cache so I guess my tl;dr is “dont be greedy” :wink:

Hi @mark_vieira,

Thanks for the quick response. Everything you’re saying about the difficulty producing reliable builds with that kind of semantic skipping makes total sense.

However, I have some questions about the node_modules caching. It seems that caching the node_modules directory is tricky due to platform specificity, as well, it’s unclear to me how the build step would behave when the node_modules directory is a cache hit, but the build step has changed.

Are all the symlinks and other node black magic that yarn install does going to survive the caching? Could we be introducing new subtle issues with reliability by caching the yarn install step. You mentioned that maybe this is something you’re already doing internally – have you noticed any problems along these lines?

Right now, I still think it’s an improvement to cache the build step and run the yarn step needlessly every time, but that seems wasteful and there must be a better way to avoid it.

Also paging @eriwen since he has been giving me some advice through other channels as well. He seemed to be anti-node_modules caching.

Thanks.

For the reasons Mark mentions, I typically would avoid caching node_modules unless you stay on top of evicting old entries from your build cache and you have a homogeneous build environment between CI and developers.

My understanding is that you want this specific gulp_build task to be faster by making install_dep faster, on average. Correct? … and something install_dep is executed unnecessarily (you say it’s “always executed”), correct?

You’re right that package.json and yarn.lock ought to be inputs and not node_modules. I feel like there ought to be a way to correctly avoid executing install_dep in more scenarios. Imagine if you had a task that wraps install_dep but is executed onlyIf {} node_modules needs installing and having gulp_build depend on that. I’m not sure what Mark would consider that “hacky”, but I feel there is a reasonable one-liner of scripting that reduces average build time. I cannot give specific solutions because I’m not familiar with the project.

Hope that helps. Let me know if I misinterpreted your questions.

Cheers,
Eric

Hi @eriwen,

I’d like to dig a bit more into a ‘task that wraps install_dep’. That is essentially exactly what I want, but I’m not sure how to express it in gradle.

Install_dep is basically just the yarn install task. The real question would be: if I can create such a wrapper task, why couldn’t I make the yarn install task itself know if it should be run or not across different build machines.

I’m not clear on what the mechanism here would be – how would the yarn install, or whatever wrapped it, know that the inputs have not changed? Would the wrapper task produce some kind of cacheable ‘sentinel’ file, and then we could run the yarn install task onlyIf wrapper didWork?

I’ve run into problems with these kinds of sentinel files in the past so I’m a little bit suspicious of them. Would it be possible to ‘cache’ a task that had inputs only and no declared outputs?

Thanks so much for all the help,
Daniel

We track the current platform as an input, so a MacOS build would not use the cached result from a Linux build. In fact, we simply track the yarn executable itself since the node_modules structure might change between versions (Yarn only guarantees yarn.lock semantics with the same version) and it’ll be different for every platform.

Basically when packing/unpacking cache entries we follow symlinks so what was a symlink will be replaced with a copy of the actual linked file contents. This can cause issues. Some of this can by mitigated by running yarn with the --no-bin-links option.

The trick here is reliably determining when the install task needs to run. This isn’t as simple as incremental build here. What we are talking about is being able to determine whether we need the node_modules directory at all if all downstream tasks can be cached. There’s just no good way to do this right now. Basically we’d have to predict that all tasks that execute node will be cached.

1 Like

I’ve just bumped into this problem too. I’ve got website assets, and they are f(node_version, package.json, input_files)=output_files. It’s quite expensive to install node and all of its node_modules, when the input and output files are quite small. This project changes very infrequently, and there’s a very small group of people who ever change it.

It’s become a somewhat common pattern for us to have a subproject where half of the tasks are just about configuring the tool / environment. Node.js in this case, but we have a similar case with Docker. It’s great that we can use Gradle to combine these disparate toolchains in a repeatable way. The downside is that if anybody (or CI runner) works on the project, they have no choice but to get all of the tools running, even if there’s guaranteed to be a buildcache result for them.

What if we had projects.capsuleTasks = [ 'gulp_build' ]?

capsuleTasks would be a List<String> similar to defaultTasks. Before attempting task execution, each capsuleTask will check to see if the buildcache has a result for it. If and only if every capsuleTask has a cached result, then none of the tasks in the project will be executed (regardless of any task dependencies), and the buildcache results will be placed into the appropriate locations. This allows a project with complex internal dependencies to elide all internal state and reduce the entire project to a single f(inputs)=outputs. If any task fails the cache, then it reverts back to a normal project.

This would be very useful to us, and seems relatively simple to implement, but I’m sure there’s tricky edge-cases for cross-project task dependencies…

This all comes down to the ability to “ask” a task if it will be cached before it executes. This should be “technically” possible if its inputs are available but has safety concerns because asking that question at the beginning of the build, and asking it at task execution might return different results.

I do think there needs to be a solution here though for upstream tasks that are expensive and need not be done if the downstream task is going to be avoided. I’m running into the same thing now where we start up expensive test fixtures, only to have the tests themselves pulled from the cache. The workaround is to embed that expensive process as an addition task action of the tests themselves. This works, but has it’s own issues.

A first-class solution here would be definitely appreciated. I suspect it would be some kind of special kind of task dependency similar to finalizedBy where I “depend on” a task, it doesn’t actually influence my inputs, so it doesn’t need to execute for my to start snapshotting my inputs.

Granularity down to the task level would be great!

To me it’s about pure functions versus state. An up-to-dateable task pretends to be a pure function of its inputs to its outputs. In practice, we often use these tasks to mutate the environment, and write out little status files (e.g. install node, start a docker container and write out a status file as a marker, etc.). When a task dependsOn another task, but doesn’t actually include the dependendee’s outputs as its inputs, that’s a hint that the dependency is really a state dependency rather than a functional input. e.g.

FUNCTIONAL (have to cache everything)
npmInstall(node_version, package.json) => [node.exe, node_modules]
gulp_build(node.exe, node_modules, styles.sass) => [styles.css]
    dependsOn(npmInstall)

STATE (gulp_build is cacheable on its own, but needs state from npmInstall to run)
npmInstall(node_version, package.json) => [node.exe, node_modules]
gulp_build(node_version, package.json, styles.sass) => [styles.css]
    setupBy(npmInstall)

I guess the logic would be something like “if a task’s only dependendees are setupBy dependeees, then it is eligible for restoration from cache without triggering execution of its dependees.”

This task-level approach is definitely more granular, but it pushes complexity into the fundamentals of the task model and task execution graph. The nice thing about the project-level capsuleTask is that it keeps the task model simple, while accomodating the same usecases we’ve described in this thread so far (except maybe your test harnesses, not sure how they work).

If anything like either approach gets implemented, I’ll be a very happy camper :slight_smile: I’m a happy camper already :smiley:

This is becoming a bigger pain point for me. I’ve dug around the gradle issues a bit, and couldn’t find anything regarding a task-level solution nor a project-level solution. What’s the most helpful way to file a feature request?

I built a prototype plugin for this, cache-horizon. Could use some feedback / help if anyone else is interested.