Multiple geb tests fighting over locks/causing out of memory issues

ivanomilo · February 18, 2021, 10:01pm

I apologize in advance for how huge this post is.

I’m running into problems while increasing the number of tasks in my geb test suite, where gradle can’t seem to handle multiple concurrent builds at the same time. This has been running successfully for 4 years up until now, where it seems as if the addition of new tasks has hit some sort of threshold in my system. Initially this started looking like connection errors from the selenoid grid, but after upgrading versions, new error messages started appearing that make the problem a little more clear. If anyone can help with this problem I would greatly appreciate it, I’ve been at it a month and can’t seem to get to the bottom of it. Ideally the solution would leave space for further scaling up, as this seems to really be a problem with concurrent tasks.

I will note that reducing the number of tasks scheduled starts to fix this problem, but even at this point I’m not running all of the tasks I’d like to.

Tests run just fine for a few hours, and then the first sign of a problem comes through this message, showing up at the end of a test:

Couldn't flush user prefs: java.util.prefs.BackingStoreException: Couldn't get file lock.

Although my tests are set to timeout after 7 minutes (4 minutes is the expected runtime), the builds start taking longer and longer to complete, starting at 25 minutes and lasting as long as 4 hours (increasing as the container gets closer to the memory limit).

after a few of these messages, these start beginning to appear (sensitive info redacted):

[redacted]TestReport: Timeout waiting to lock daemon addresses registry. It is currently in use by another Gradle instance.
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Owner PID: 21092
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Our PID: 21030
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Owner Operation: 
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Our operation: 
Feb 18 15:10:35 ip-172-30-3-121 [redacted]TestReport: Lock file: /root/.gradle/daemon/6.8.2/registry.bin.lock

I believe that this has to do with tasks competing for the locked file, the tasks taking increasingly longer to complete, more tasks starting while the previous task is running, and eventually running the docker container out of memory altogether.

Each gradle task takes up to 400mb, but even then there aren’t enough tasks scheduled for this to cause a memory failure solely due to too many tasks running at once

I don’t think there is much code I can show to help with this question, apart from the scheduling of cron tasks and including what gradle settings I have in place.

My attempt to solve this problem started with upgrading to most recent compatible versions which resulted in 2 weeks of dependency hell, so version numbers reflect this.

ways I tried to fix this:

upgrade dependency versions
increase jvm heap size, perm size
split tasks into two docker containers, each running half of the 17 tasks scheduled
disable gradle build cache (–no-build-cache)
enable, disable parallel execution (maxParallelForks in build.gradle and org.gradle.parallel in gradle.properties)
org.gradle.unsafe.configuration-cache=true in gradle.properties
increase maximum number of browser containers in selenoid to 18 (reduce connection pool throttling/timeout waiting for a connection)
adding timeouts per build
improve docker cleanups to make sure ecs containers aren’t swamped with dead containers (docker system prune on a cron schedule)
separate build directories per task (named after task name)

things i think would help

a way to designate separate caches per task to avoid fighting for access to locks, can’t find a way to implement this
I was recommended to try running multiple tasks in the same JVM by a java/maven user, but this doesn’t seem possible in gradle (each daemon has its own jvm as I understand) and these tasks don’t play well sharing build spaces.

code and infrastructure

stack:

geb(3.0.1)
groovy
gradle (4.3.1 → 6.8.2)
docker (running ubuntu)
java (jre 1.8 → jre 11/java 8 → java 11)
selenium (3.141.59 → 4.0.0-alpha-7)
junit (4.12)
spock (1.1-groovy-2.4)
selenoid grid (remote)
docker distributing to ECS containers (16gb)

general idea of how system works

test (type: Test) runs test, generates and stores test report and relevant info
report (JavaExec task) runs immediately after, sending stored info to appropriate places (firestore and slack)

cron schedule

*/5 * * * * root cd /tests ; ./gradlew clean test --tests [test1] -Dtest.single=[test1] -DenableVideo=true -Denvironment=http://[hub address]/wd/hub -Dbrowser=chrome   --no-build-cache  --stacktrace --info 2>&1 | logger -t [test1] ; ./gradlew report --no-build-cache --stacktrace --info  2>&1  | logger -t [test1]Report

*/11 * * * * root cd /tests ; ./gradlew clean test --tests [test2] -Dtest.single=[test2] -DenableVideo=true -Denvironment=http://[hub address]/wd/hub -Dbrowser=chrome --no-build-cache  --stacktrace --info  2>&1 | logger -t [test2] ; ./gradlew report  -DclassName=[test2] --no-build-cache --stacktrace --info  2>&1  | logger -t [test2]Report

*/8 * * * * root cd /tests ; ./gradlew clean test --tests [test3] -Dtest.single=[test3] -Denvironment=http://[hub address]/wd/hub -DenableVideo=true -Dbrowser=chrome --no-build-cache  --stacktrace --info  2>&1 | logger -t [test3] ; ./gradlew report -DclassName=[test3]  --no-build-cache --stacktrace --info  2>&1 | logger -t [test3]Report

3 * * * * root cd /tests ; ./gradlew clean test --tests [test4] -Dtest.single=[test4] -DenableVideo=true -Denvironment=http://[redacted]/wd/hub -Dbrowser=chrome --no-build-cache  --stacktrace --info  2>&1 | logger -t [test4] ; ./gradlew report -DclassName=[test4] --no-build-cache --stacktrace --info   2>&1  | logger -t [test4]Report

8 * * * * root cd /tests ; ./gradlew clean test --tests [test5] -Dtest.single=[test5] -DenableVideo=true -Denvironment=http://[redacted]/wd/hub -Dbrowser=chrome --stacktrace --info  2>&1 | logger -t [test5] ; ./gradlew report -DclassName=[test5] --no-build-cache --stacktrace --info   2>&1  | logger -t [test5]Report

gradle infrastructure (from build scan)

Background build scan publication	On	
Build Cache	On	
Daemon	On	
Configuration Cache	Off	
Configure on demand	Off	
Continue on failure	Off	
Continuous	Off	
Dry run	Off	
File system watching	Off	
Offline	Off	
Parallel	Off	
Re-run tasks	Off	
Refresh dependencies	Off	
Task inputs file capturing	Off

ecs/docker infrastructure

Operating system	Linux 4.14.33-51.37.amzn1.x86_64	
CPU cores	1 core	
Max Gradle workers	1 worker	
Java runtime	Ubuntu OpenJDK Runtime Environment 11.0.10+9-Ubuntu-0ubuntu1.20.04	
Java VM	Ubuntu OpenJDK 64-Bit Server VM 11.0.10+9-Ubuntu-0ubuntu1.20.04 (mixed mode, sharing)	
Max JVM memory heap size	1.9 GiB	
Locale	English (United States)	
Default charset	US-ASCII	
Username	root

here is an example project that my project is based on:

thanks so much for your time if you’re able to help with this.

jjustinic · February 22, 2021, 6:40am

As you said, this post is pretty huge, and there’s a number of details mentioned that could use some feedback. However, I’ll focus on the ones that stand out most to start with.

First, you seem to be mixing up a few terms. A Gradle build executes tasks in a daemon process. You have each cron job running two Gradle builds. The timeouts you’re seeing are about having multiple build processes (not tasks) competing for and locking the same resources on disk.

The relevant resources are created per GRADLE_USER_HOME. You can specify a different GRADLE_USER_HOME per cron job by passing it as the -g , --gradle-user-home option to ./gradlew.

Running multiple tasks in the same JVM is the usual case, but your cron jobs seem to be designed to specifically avoid that. Generally, if you were just running ./gradlew clean test report it would be exactly that, but you’re splitting up each test and invoking it as a different build. I can’t tell what exactly is driving this, but it is possible to invoke 1 build, but configure it so that each test forks a separate JVM to execute in. This can keep isolation in test execution, but not have the competition between different builds.

There still seems like more to unpack here, but start with that and see if there’s anything that you want to talk about specifically.

ivanomilo · February 22, 2021, 7:36am

James, thanks so much for your response. This actually helps clarify a lot about what I’m misunderstanding here. Also really appreciating you clearing up my mixing up of terms. I will try reconfigure my setups with this in mind and report back!

Having separate builds per cron job is driven by the need to run specific test cases each on their own schedule throughout the day. Might need to do a really deep dive into best practices here to configure this the right way. From what I’m understanding, though, I could keep the same cron schedule going without the cron jobs interfering with one another if I designate a unique GRADLE_USER_HOME variable per build.

Again thanks for the helpful response and I’ll let you know if these solutions solved my problem!

ivanomilo · March 5, 2021, 5:58pm

Hello, I’ve made the adjustments you recommended and they helped a little bit. Everything stays up a little bit longer.

A call to “ps aux” while the container was in a locked up state showed 54 instances of the JVM active (started by the gradle builds). There’s only 19 cron jobs, so it seems to me like there’s an issue with jvms hanging after task execution.

attached is the memory usage graph from AWS. This suggests to me some sort of memory leak. I can’t seem to get to the bottom of what is causing the memory to just continually increase. I’ve checked to make sure the problem isn’t in the file system, so it does seem like it has to do with the processes running.

One solution I’m looking to implement is to kill processes running for longer than a certain period of time, but I don’t think that is an ideal and sustainable solution

jjustinic · March 8, 2021, 5:44am

This number of JVMs is not really unexpected. Your 19 cron jobs would create about 57 JVMs if they were all running at the same time. Every Gradle build executing has a minimum of 2 JVMs. There’s a client JVM which is started from the command line. It connects to the daemon process which does the actual work of the build. However, test executing is actually forked into another separate JVM, so each of your cron jobs would have at least 3 JVMs, but it could be more than that depending on your test configuration in the build.

Additionally, the daemon processes that starts will continue running even when the build completes and will be reused by subsequent builds. Essentially this means that as builds stack up you’ll create additional long-lived JVMs that will stick around if there is not already one available when the next cron job starts. Only the client and test JVMs would be stopped. I would expect this to look very close to your graph. For this environment, you might be better off running with the --no-daemon option, which will stop the daemon used after every build. This won’t reduce your peak memory needs, but may give you additional breathing room if there is any waste from what might be cached from one cron job vs. another.

ivanomilo · March 15, 2021, 3:21pm

Hi James, I figured out the issue. I think this is a bug with gradle running on java 11, even when running with --no-daemon, there still are some lingering jvms. for each build there were 2 that stuck around, the rest of them being cleaned up. I solved it by running a cron command to kill all gradle processes that have been running longer than 10 minutes. That’s not ideal, but it gets the job done for now. Thanks for your help!

Topic		Replies	Views
Timeout waiting to lock file hash cache Help/Discuss	2	16073	March 17, 2017
Timeout waiting to lock jars Help/Discuss	6	3682	February 2, 2017
20 minutes longer build with useJUnitPlatform: hundreds of bogus Test Executors Help/Discuss	1	457	February 1, 2022
Running integration tests in parallel on a CI server (GRADLE-2795, GRADLE-3200) Bugs	2	1646	July 6, 2015
We may have a cache lock problem when running concurrent builds Old Forum Archive	1	460	June 20, 2012

Multiple geb tests fighting over locks/causing out of memory issues

Related topics