I apologize in advance for how huge this post is.
I’m running into problems while increasing the number of tasks in my geb test suite, where gradle can’t seem to handle multiple concurrent builds at the same time. This has been running successfully for 4 years up until now, where it seems as if the addition of new tasks has hit some sort of threshold in my system. Initially this started looking like connection errors from the selenoid grid, but after upgrading versions, new error messages started appearing that make the problem a little more clear. If anyone can help with this problem I would greatly appreciate it, I’ve been at it a month and can’t seem to get to the bottom of it. Ideally the solution would leave space for further scaling up, as this seems to really be a problem with concurrent tasks.
I will note that reducing the number of tasks scheduled starts to fix this problem, but even at this point I’m not running all of the tasks I’d like to.
Tests run just fine for a few hours, and then the first sign of a problem comes through this message, showing up at the end of a test:
Couldn't flush user prefs: java.util.prefs.BackingStoreException: Couldn't get file lock.
Although my tests are set to timeout after 7 minutes (4 minutes is the expected runtime), the builds start taking longer and longer to complete, starting at 25 minutes and lasting as long as 4 hours (increasing as the container gets closer to the memory limit).
after a few of these messages, these start beginning to appear (sensitive info redacted):
[redacted]TestReport: Timeout waiting to lock daemon addresses registry. It is currently in use by another Gradle instance.
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Owner PID: 21092
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Our PID: 21030
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Owner Operation:
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Our operation:
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Lock file: /root/.gradle/daemon/6.8.2/registry.bin.lock
I believe that this has to do with tasks competing for the locked file, the tasks taking increasingly longer to complete, more tasks starting while the previous task is running, and eventually running the docker container out of memory altogether.
Each gradle task takes up to 400mb, but even then there aren’t enough tasks scheduled for this to cause a memory failure solely due to too many tasks running at once
I don’t think there is much code I can show to help with this question, apart from the scheduling of cron tasks and including what gradle settings I have in place.
My attempt to solve this problem started with upgrading to most recent compatible versions which resulted in 2 weeks of dependency hell, so version numbers reflect this.
ways I tried to fix this:
- upgrade dependency versions
- increase jvm heap size, perm size
- split tasks into two docker containers, each running half of the 17 tasks scheduled
- disable gradle build cache (–no-build-cache)
- enable, disable parallel execution (maxParallelForks in build.gradle and org.gradle.parallel in gradle.properties)
- org.gradle.unsafe.configuration-cache=true in gradle.properties
- increase maximum number of browser containers in selenoid to 18 (reduce connection pool throttling/timeout waiting for a connection)
- adding timeouts per build
- improve docker cleanups to make sure ecs containers aren’t swamped with dead containers (docker system prune on a cron schedule)
- separate build directories per task (named after task name)
things i think would help
- a way to designate separate caches per task to avoid fighting for access to locks, can’t find a way to implement this
- I was recommended to try running multiple tasks in the same JVM by a java/maven user, but this doesn’t seem possible in gradle (each daemon has its own jvm as I understand) and these tasks don’t play well sharing build spaces.
code and infrastructure
stack:
- geb(3.0.1)
- groovy
- gradle (4.3.1 → 6.8.2)
- docker (running ubuntu)
- java (jre 1.8 → jre 11/java 8 → java 11)
- selenium (3.141.59 → 4.0.0-alpha-7)
- junit (4.12)
- spock (1.1-groovy-2.4)
- selenoid grid (remote)
- docker distributing to ECS containers (16gb)
general idea of how system works
- test (type: Test) runs test, generates and stores test report and relevant info
- report (JavaExec task) runs immediately after, sending stored info to appropriate places (firestore and slack)
cron schedule
*/5 * * * * root cd /tests ; ./gradlew clean test --tests [test1] -Dtest.single=[test1] -DenableVideo=true -Denvironment=http://[hub address]/wd/hub -Dbrowser=chrome --no-build-cache --stacktrace --info 2>&1 | logger -t [test1] ; ./gradlew report --no-build-cache --stacktrace --info 2>&1 | logger -t [test1]Report
*/11 * * * * root cd /tests ; ./gradlew clean test --tests [test2] -Dtest.single=[test2] -DenableVideo=true -Denvironment=http://[hub address]/wd/hub -Dbrowser=chrome --no-build-cache --stacktrace --info 2>&1 | logger -t [test2] ; ./gradlew report -DclassName=[test2] --no-build-cache --stacktrace --info 2>&1 | logger -t [test2]Report
*/8 * * * * root cd /tests ; ./gradlew clean test --tests [test3] -Dtest.single=[test3] -Denvironment=http://[hub address]/wd/hub -DenableVideo=true -Dbrowser=chrome --no-build-cache --stacktrace --info 2>&1 | logger -t [test3] ; ./gradlew report -DclassName=[test3] --no-build-cache --stacktrace --info 2>&1 | logger -t [test3]Report
3 * * * * root cd /tests ; ./gradlew clean test --tests [test4] -Dtest.single=[test4] -DenableVideo=true -Denvironment=http://[redacted]/wd/hub -Dbrowser=chrome --no-build-cache --stacktrace --info 2>&1 | logger -t [test4] ; ./gradlew report -DclassName=[test4] --no-build-cache --stacktrace --info 2>&1 | logger -t [test4]Report
8 * * * * root cd /tests ; ./gradlew clean test --tests [test5] -Dtest.single=[test5] -DenableVideo=true -Denvironment=http://[redacted]/wd/hub -Dbrowser=chrome --stacktrace --info 2>&1 | logger -t [test5] ; ./gradlew report -DclassName=[test5] --no-build-cache --stacktrace --info 2>&1 | logger -t [test5]Report
gradle infrastructure (from build scan)
Background build scan publication On
Build Cache On
Daemon On
Configuration Cache Off
Configure on demand Off
Continue on failure Off
Continuous Off
Dry run Off
File system watching Off
Offline Off
Parallel Off
Re-run tasks Off
Refresh dependencies Off
Task inputs file capturing Off
ecs/docker infrastructure
Operating system Linux 4.14.33-51.37.amzn1.x86_64
CPU cores 1 core
Max Gradle workers 1 worker
Java runtime Ubuntu OpenJDK Runtime Environment 11.0.10+9-Ubuntu-0ubuntu1.20.04
Java VM Ubuntu OpenJDK 64-Bit Server VM 11.0.10+9-Ubuntu-0ubuntu1.20.04 (mixed mode, sharing)
Max JVM memory heap size 1.9 GiB
Locale English (United States)
Default charset US-ASCII
Username root
here is an example project that my project is based on:
thanks so much for your time if you’re able to help with this.