Environment (Required on creation):
Camunda Platform, 2 or more node cluster.
Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):
As described in SUPPORT-12684, with:
- Multiple nodes
- The first node in the BPMN after start is an async before delegate task that takes too long long to execute, e.g.:
- Property lock-time-in-millis is 5 minutes.
- Sometimes the delegate takes 15 + minutes to execute, e.g: an unexpected slowdown in SOAP response from external system.
It is possible for the nodes to become locked in a cycle of constantly processing the same jobs, getting OLEs, and produce debug logging like:
2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14012 Job acquisition thread woke up 2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14022 Acquired 0 jobs for process engine 'default': [] 2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14011 Job acquisition thread sleeping for 99 millis 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14012 Job acquisition thread woke up 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14022 Acquired 0 jobs for process engine 'default': [] 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af] 2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14011 Job acquisition thread sleeping for 100 millis 2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14012 Job acquisition thread woke up 2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14022 Acquired 0 jobs for process engine 'default': [] 2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af] 2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af] 2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af] 2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor : ENGINE-14011 Job acquisition thread sleeping for 99 millis
Steps to reproduce (Required on creation):
Analysis from this comment regarding the scenario that causes the above logging:
- As transaction boundary / asynchronous continuation is reached, job is created and inserted to ACT_RU_JOB
- Job acquisition of both the nodes polls the database
- Job acquisition of both the nodes competes and tries to lock the job in ACT_RU_JOB
- But job acquisition of one of the process engine succeeds and locks the job in ACT_RU_JOB
- Job acquisition submits the job to execution thread pool's queue
- Execution thread picks up job and executes it
- Job takes more than 5 minutes to finish (because SOAP requests took more than 5 minutes to return the response)
- Job gets unlocked in the database (as lock-time-in-millis which is 5 minutes got expired)
- The same job is now available to be picked up by Job executor
- Another node picks the same Job
- Now, both the nodes are executing same Job
- This leads to the situation where in same entity is updated by both the jobs at the same time - causing OptimisticLockingExceptions
- In case of OptimisticLockingException, job keeps on retrying until it succeeds. However it you case because of this situation, OLE keeps on happening and hence job keeps on retrying.
- As Thread pool queue gets full with such long running Jobs, new jobs are rejected and eventually no new Jobs are acquired, leading to the job blocked situation.
Observed Behavior (Required on creation):
The DEBUG logs are unhelpful and a bit confusing:
- All that is logged is:
- Acquired 0 jobs for process engine 'default'
- Then both nodes process 3 jobs.
- Repeat above for the same exact 3 jobs, ad infinitum.
If 0 jobs were acquired, how were there 3 jobs to execute?
Where are the OLE exception errors?
Expected behavior (Required on creation):
Helpful DEBUG logs, including:
- available job execution threads
- jobs rejected
- number of jobs to acquire