I was just reading about the Gradle Build Cache planned for Gradle 3.5. I’ve found Gradle’s support of S3 as a maven wagon EXTREMELY convenient for releasing libraries private to my company. Supporting S3 as a distributed HTTP build cache would also be extremely convenient as well. I think it would lower the bar for setup, making it a lot easier to configure, and drive adoption for the feature.
Hi @Scott_Pierce,
We made the interfaces that are required to add a new build cache connector to be simple to implement.
One of the major differences between publishing your libraries to a repository and using the build cache is that you are typically going to have very few artifacts and we can reliably cache resolved copies. For the build cache, Gradle will produce a cache entry for every cacheable task (for a single Java projct, this will be at least 3 tasks) and could look for a cache entry every build for those cacheable tasks. We’ll also download from the build cache any time we need a cache entry (there isn’t any local caching of remote cache entries yet).
I think you can get by with something simple like S3 for repositories, but for the build cache, it’s useful to have a little more. It’s important to get more information about cache hit rate/statistics and S3 won’t be able to provide on its own (you might be able to work it out from logs/metrics). You’ll also need some sort of eviction policy that doesn’t just drop the oldest entries, too.
For the Gradle project, the stats for the last 30 days have been:
Stored entries 453372
Evicted entries 253292
Hits 1680475
Misses 529150
Hit Rate 76 %
Data Received 307.84 GB
Data Sent 386.21 GB
The last 24 hours have been:
Stored entries 13062
Evicted entries 0
Hits 61167
Misses 16236
Hit Rate 79 %
Data Received 6.40 GB
Data Sent 20.49 GB
I also think it’s super important to get some kind of build diagnostics that goes along with the build cache, so you can figure out why you’re getting cache hits/misses for a particular build. We use the build-scan
plugin to collect custom values that are useful for this. We then use a dogfooding instance of Gradle Enterprise as a build cache backend. That gives us an eviction policy and some simple statistics.
Thanks, you are probably right!
@sterling Do you think that you would be able to acquire a cache access trace? That could be a text file with the key’s hash, one line per cache read. We could then simulate it to see how efficient the cache can be and what eviction policy is the most optimal. That could be helpful for optimizing your build cache implementations.
I know we keep track of that in some way.
@luke_daley @mark_vieira Do we keep a complete cache access log? Is this something we could export to have Ben take a look?
@Scott_Pierce, do you think this would work for you? https://github.com/myniva/gradle-s3-build-cache
Potentially! This looks really cool. Once Gradle 3.5 is out I’ll give it a shot. Thanks!
Yes, this is definitely something we could work up from our logs. I expect it would depend a lot on the project, etc. Right now it’s hard to separate that information from our logs as we use a single cache server for a bunch of different builds.
FWIW we also just use a simple LRU strategy for eviction and for our purposes our cache size is much larger than what we push in a given day so we don’t actually evict a lot. For example, with a 50GB cache we’ve had 0 evictions in the last week. I’m sure organizations with a larger build farm would encounter more evictions.
@Benjamin_Manes you can find this here: http://ge.tt/3pHzFXj2
That’s about 30 days worth of data.
Thanks @luke_daley and @mark_vieira!
The trace has 2 million requests and an unbounded maximum of 92.5% (compulsory misses). The number of elements needed is quite small, in the range of hundreds to the low thousand. The optimal policy reaches this at about 2,500 entries, but practically gets there at around 1,250.
LRU is great with FIFO and Random strong follow ups. Frequency has a negative impact, causing LFU to have a very poor hit rate. When the cache size is reduced, e.g. to 500, then policies that try account for recency and frequency can squeeze a few extra points. But overall the simple classic policies are nearly perfect because history has little benefits.
If the local cache is absorbing hits and this trace only shows those from the perspective of the remote requests, then this could explain the results. In that case the unchanged artifacts are local and the up-to-date check quickly skips over them.
For now LRU is a an excellent strategy and until we get more data, there is no reason to switch.
This is now part of the simulator so you can periodically check to see if the above stays true.
@Benjamin_Manes thanks for doing that.
Note that this is from our internal server which has a reasonably homogenous workload. Other usages where there is a wider variety of projects being built may yield different results.
I’ll be sure to run the simulator on other data sets that I can get.
This is awesome feedback @Benjamin_Manes. Thanks!