[CAM-14334] Improve job acquisition job execution logging

Type: Bug Report
Resolution: Unresolved
Priority: L3 - Default
Fix Version/s: None
Affects Version/s: None
Component/s: engine
Labels:
- logging

Environment (Required on creation):

Camunda Platform, 2 or more node cluster.

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):

As described in SUPPORT-12684, with:

Multiple nodes
The first node in the BPMN after start is an async before delegate task that takes too long long to execute, e.g.:
- Property lock-time-in-millis is 5 minutes.
- Sometimes the delegate takes 15 + minutes to execute, e.g: an unexpected slowdown in SOAP response from external system.

It is possible for the nodes to become locked in a cycle of constantly processing the same jobs, getting OLEs, and produce debug logging like:

2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.850 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 99 millis
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af]
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 100 millis
2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [32480db8-8029-11ec-acaf-005056bf79af]
2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [334d21ec-8029-11ec-acaf-005056bf79af]
2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [359d2bb0-8029-11ec-acaf-005056bf79af]
2022-01-28 11:59:00.050 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 99 millis

Steps to reproduce (Required on creation):

Analysis from this comment regarding the scenario that causes the above logging:

As transaction boundary / asynchronous continuation is reached, job is created and inserted to ACT_RU_JOB
Job acquisition of both the nodes polls the database
Job acquisition of both the nodes competes and tries to lock the job in ACT_RU_JOB
But job acquisition of one of the process engine succeeds and locks the job in ACT_RU_JOB
Job acquisition submits the job to execution thread pool's queue
Execution thread picks up job and executes it
Job takes more than 5 minutes to finish (because SOAP requests took more than 5 minutes to return the response)
Job gets unlocked in the database (as lock-time-in-millis which is 5 minutes got expired)
The same job is now available to be picked up by Job executor
Another node picks the same Job
Now, both the nodes are executing same Job
This leads to the situation where in same entity is updated by both the jobs at the same time - causing OptimisticLockingExceptions
In case of OptimisticLockingException, job keeps on retrying until it succeeds. However it you case because of this situation, OLE keeps on happening and hence job keeps on retrying.
As Thread pool queue gets full with such long running Jobs, new jobs are rejected and eventually no new Jobs are acquired, leading to the job blocked situation.

Observed Behavior (Required on creation):

The DEBUG logs are unhelpful and a bit confusing:

All that is logged is:
- Acquired 0 jobs for process engine 'default'
- Then both nodes process 3 jobs.
  - Repeat above for the same exact 3 jobs, ad infinitum.

If 0 jobs were acquired, how were there 3 jobs to execute?
Where are the OLE exception errors?

Expected behavior (Required on creation):

Helpful DEBUG logs, including:

available job execution threads
jobs rejected
number of jobs to acquire

Root Cause (Required on prioritization):

Solution Ideas (Optional):

Hints (optional):

This is the controller panel for Smart Panels app

Thorben Lindhauer added a comment - 12/Oct/22 11:47 AM

This ticket was migrated to github: https://github.com/camunda/camunda-bpm-platform/issues/2666. Please use this link for any future references and continue any discussion there.

Thorben Lindhauer added a comment - 12/Oct/22 11:47 AM This ticket was migrated to github: https://github.com/camunda/camunda-bpm-platform/issues/2666 . Please use this link for any future references and continue any discussion there.

Details

Description

Environment (Required on creation):

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):

Steps to reproduce (Required on creation):

Observed Behavior (Required on creation):

Expected behavior (Required on creation):

Root Cause (Required on prioritization):

Solution Ideas (Optional):

Hints (optional):

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

Collapse comment: Thorben Lindhauer added a comment - 12/Oct/22 11:47 AM

Expand comment: Thorben Lindhauer added a comment - 12/Oct/22 11:47 AM

People

Dates

Salesforce