[CAM-9964] The same history cleanup job is acquired and executed concurrently

Type: Bug Report
Resolution: Fixed
Priority: L3 - Default
Fix Version/s: 7.11.0, 7.10.5, 7.11.0-alpha4
Affects Version/s: 7.11.0-alpha3
Component/s: engine
Labels:
- SUPPORT

To keep it simple, the following three transactions are involved whenever a history cleanup job is being executed:

Job Acquisiton: tx1 acquires the history cleanup job j1 by setting a lock owner and lock time.
Job Execution: tx2 executes the actual history cleanup job j1 by deleting historical data, and in addition in that transaction the history cleanup job gets unlocked.
Job Scheduling: tx3 updates history cleanup's (j1) due date - cf. [1]. Also updating the due date any existing job lock is removed. This transaction is executed, whenever the tx2 gets commited (i.e. a transaction listener is registered).

The transactions are being executed in the following order:

tx1 acquires j1.
tx1 commits.
tx2 unlocks j1 (and executes the actual work).
tx2 commits.
tx1 acquires j1 (because j1 is unlocked - see 2. and 3.).
tx1 commits.
tx3 updates the due date and unlocks j1.
tx3 commits.
tx1 acquires j1 (because j1 is unlocked - see 7. and 8.).
tx1 commits.
...

As a result of that execution sequence, the job j1 is acquired twice, because right after the successful acquisition the jobs gets unlocked again (without even being executed). So because of the succesfull acquistion the history cleanup job is published multiple times to the job queue.

Possible solution:

Make sure that the job is unlocked only at one place
Unlocking could happen for example in HistoryCleanupSchedulerCmd

Hint:

OptimisticLockingException: As tx2 unlocks the job, the job can be acquired again. So whenever afterwards tx1 and tx3 are executed concurrently, one of them will most likely fail with an OptimisticLockingException.
Lock cannot be acquired: Within tx2 unlocking j1 results in an UPDATE statement that acquires a database lock (at least) on that row to update. So while tx2 is active and the acutal deletion of historical data lasts for more than the job's lock time, the job executor will try to acquire job j1 again. Since the tx2 holds the database lock, the job acquisition will fail because the required database lock to update j1 cannot be acquired within a specific timeout. As long as tx2 holds the database lock, the job acquistion of j1 will fail. The impact of that is, whenever the job executor tries to acquire different jobs (and one of them is j1), the acquisition will always fail and the jobs are not executed.

[1]: https://github.com/camunda/camunda-bpm-platform/blob/22e3e464cebce957e74917dcaf2d89731a590c54/engine/src/main/java/org/camunda/bpm/engine/impl/jobexecutor/historycleanup/HistoryCleanupSchedulerCmd.java

This is the controller panel for Smart Panels app

Thorben Lindhauer added a comment - 20/Mar/19 11:59 AM

roman.smirnov, where's the code in which tx2 unlocks the job?

Thorben Lindhauer added a comment - 20/Mar/19 11:59 AM roman.smirnov , where's the code in which tx2 unlocks the job?

Roman Smirnov added a comment - 20/Mar/19 5:18 PM

thorben.lindhauer, here https://github.com/camunda/camunda-bpm-platform/blob/680a518a3c8b3defd48b123b5d2fa0d18bee014c/engine/src/main/java/org/camunda/bpm/engine/impl/persistence/entity/EverLivingJobEntity.java#L66-L67

Roman Smirnov added a comment - 20/Mar/19 5:18 PM thorben.lindhauer , here https://github.com/camunda/camunda-bpm-platform/blob/680a518a3c8b3defd48b123b5d2fa0d18bee014c/engine/src/main/java/org/camunda/bpm/engine/impl/persistence/entity/EverLivingJobEntity.java#L66-L67

Assignee:: Thorben Lindhauer

Reporter:: Roman Smirnov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 19/Mar/19 9:16 PM

Updated:: 27/Sep/19 12:34 PM

Resolved:: 10/Apr/19 6:46 PM

Details

Description

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

Collapse comment: Thorben Lindhauer added a comment - 20/Mar/19 11:59 AM

Expand comment: Thorben Lindhauer added a comment - 20/Mar/19 11:59 AM

Collapse comment: Roman Smirnov added a comment - 20/Mar/19 5:18 PM

Expand comment: Roman Smirnov added a comment - 20/Mar/19 5:18 PM

People

Dates

Salesforce