Project level build cache for javascript and yarn projects


(Daniel Roberts) #1

I have a very large, complicated multi project build, and I am attempting to speed it up by applying the build cache, however, I am finding that task level caching is a little bit too granular for my needs, and wondering how you would recommend caching a subproject’s outputs.

For a concrete example, the subproject in question is a javascript project with a yarn install step, and the output is some resulting javascript files.

My build task declares inputs and outputs, and I can cache it, but it depends on a yarn_install task.

gulp_build {
dependsOn "install_dep"

inputs.dir 'assets'
inputs.dir 'languages'
inputs.dir 'node_modules'
inputs.dir 'src'
inputs.dir 'type-definitions'
inputs.file 'gulpfile.js'
inputs.file 'tsconfig.json'

outputs.dir 'codegen'
outputs.dir 'public'

outputs.cacheIf { true }

}

My first thought was not to declare the node_modules directory as an input, since it itself is a function of package.json and yarn.lock, I could declare them as inputs instead.

Either way, the install_dep task is always executed, so I need to add the node_modules directory to the build cache (which is very large and would still take a long time to download, just to do nothing and be consumed by nothing).

I’m trying to find a way to have a ‘project’ level concept of inputs/outputs, or otherwise force the install_deps step to skip if we’re going to get a cache hit on the build step.

I was imagining a ‘parent task’, which would have inputs and outputs and trigger the ‘child’ tasks if it was executed. This seems to go agains the gradle model of task execution flow control through declaring dependencies, and then evaluating all the dependent tasks first.

What is the ‘gradley’ way to achieve this? Cache the node_modules output even tho I am not using it at all except at build time?

Thanks.


(Mark Vieira) #2

@danroberts, the issue you are running into is basically the idea of whether we should skip pulling intermediary outputs when we know all downstream tasks will be cached. This gets tricky of course because we can’t know that ahead of time as we calculate task inputs just before the task in to be run and these outputs might potentially be necessary for other reasons.

I do get your use case though, we’ve run into it ourselves. On the input bit, I agree with your decision to not track the node_modules directory as an input. There are a few reasons:

  1. This is going to be different on different platforms due to native libraries. Which means any node-based task output will not be usable across different operating systems.
  2. It’s generally huge and has a ton of files which adds overhead to tracking it as an input since we have to hash all its contents.
  3. If you’re using Yarn (I have less faith in NPM) the yarn.lock file should be the source of truth anyhow.

So yes, you’re on the right path there. As for avoiding running yarn install it’s a tough call without knowing more about what the performance numbers look like. In our experience we actually found that it was faster to cache it in CI (since we blow away the directory on every build). It still took a few seconds but was much faster (about 10x) than running the task. In the end it’s still a net win as it’s still faster than without the cache.

But still, I get your point that we are effectively doing work we don’t need to. There really isn’t a good workaround here. Anything to skip execution here would be a nasty hack that would have to make all sorts of assumptions about the build. Mainly, will the downstream tasks be pulled from the cache. As mentioned before, there’s no good way to know this, we’d have to predict it instead, which has the potential to be incorrect and cause build errors when things do change, or there is a cache miss, and downstream tasks do have to run. Personally, I choose reliability over saving a few seconds in this case. As I mentioned above, it’s still going to be way faster than without the cache so I guess my tl;dr is “dont be greedy” :wink:


(Daniel Roberts) #3

Hi @mark_vieira,

Thanks for the quick response. Everything you’re saying about the difficulty producing reliable builds with that kind of semantic skipping makes total sense.

However, I have some questions about the node_modules caching. It seems that caching the node_modules directory is tricky due to platform specificity, as well, it’s unclear to me how the build step would behave when the node_modules directory is a cache hit, but the build step has changed.

Are all the symlinks and other node black magic that yarn install does going to survive the caching? Could we be introducing new subtle issues with reliability by caching the yarn install step. You mentioned that maybe this is something you’re already doing internally – have you noticed any problems along these lines?

Right now, I still think it’s an improvement to cache the build step and run the yarn step needlessly every time, but that seems wasteful and there must be a better way to avoid it.

Also paging @eriwen since he has been giving me some advice through other channels as well. He seemed to be anti-node_modules caching.

Thanks.


(Eric Wendelin) #4

For the reasons Mark mentions, I typically would avoid caching node_modules unless you stay on top of evicting old entries from your build cache and you have a homogeneous build environment between CI and developers.

My understanding is that you want this specific gulp_build task to be faster by making install_dep faster, on average. Correct? … and something install_dep is executed unnecessarily (you say it’s “always executed”), correct?

You’re right that package.json and yarn.lock ought to be inputs and not node_modules. I feel like there ought to be a way to correctly avoid executing install_dep in more scenarios. Imagine if you had a task that wraps install_dep but is executed onlyIf {} node_modules needs installing and having gulp_build depend on that. I’m not sure what Mark would consider that “hacky”, but I feel there is a reasonable one-liner of scripting that reduces average build time. I cannot give specific solutions because I’m not familiar with the project.

Hope that helps. Let me know if I misinterpreted your questions.

Cheers,
Eric


(Daniel Roberts) #5

Hi @eriwen,

I’d like to dig a bit more into a ‘task that wraps install_dep’. That is essentially exactly what I want, but I’m not sure how to express it in gradle.

Install_dep is basically just the yarn install task. The real question would be: if I can create such a wrapper task, why couldn’t I make the yarn install task itself know if it should be run or not across different build machines.

I’m not clear on what the mechanism here would be – how would the yarn install, or whatever wrapped it, know that the inputs have not changed? Would the wrapper task produce some kind of cacheable ‘sentinel’ file, and then we could run the yarn install task onlyIf wrapper didWork?

I’ve run into problems with these kinds of sentinel files in the past so I’m a little bit suspicious of them. Would it be possible to ‘cache’ a task that had inputs only and no declared outputs?

Thanks so much for all the help,
Daniel


(Mark Vieira) #6

We track the current platform as an input, so a MacOS build would not use the cached result from a Linux build. In fact, we simply track the yarn executable itself since the node_modules structure might change between versions (Yarn only guarantees yarn.lock semantics with the same version) and it’ll be different for every platform.

Basically when packing/unpacking cache entries we follow symlinks so what was a symlink will be replaced with a copy of the actual linked file contents. This can cause issues. Some of this can by mitigated by running yarn with the --no-bin-links option.

The trick here is reliably determining when the install task needs to run. This isn’t as simple as incremental build here. What we are talking about is being able to determine whether we need the node_modules directory at all if all downstream tasks can be cached. There’s just no good way to do this right now. Basically we’d have to predict that all tasks that execute node will be cached.