Uploaded image for project: 'camunda BPM'
  1. camunda BPM
  2. CAM-9964

The same history cleanup job is acquired and executed concurrently




      To keep it simple, the following three transactions are involved whenever a history cleanup job is being executed:

      1. Job Acquisiton: tx1 acquires the history cleanup job j1 by setting a lock owner and lock time.
      2. Job Execution: tx2 executes the actual history cleanup job j1 by deleting historical data, and in addition in that transaction the history cleanup job gets unlocked.
      3. Job Scheduling: tx3 updates history cleanup's (j1) due date - cf. [1]. Also updating the due date any existing job lock is removed. This transaction is executed, whenever the tx2 gets commited (i.e. a transaction listener is registered).

      The transactions are being executed in the following order:

      1. tx1 acquires j1.
      2. tx1 commits.
      3. tx2 unlocks j1 (and executes the actual work).
      4. tx2 commits.
      5. tx1 acquires j1 (because j1 is unlocked - see 2. and 3.).
      6. tx1 commits.
      7. tx3 updates the due date and unlocks j1.
      8. tx3 commits.
      9. tx1 acquires j1 (because j1 is unlocked - see 7. and 8.).
      10. tx1 commits.
      11. ...

      As a result of that execution sequence, the job j1 is acquired twice, because right after the successful acquisition the jobs gets unlocked again (without even being executed). So because of the succesfull acquistion the history cleanup job is published multiple times to the job queue.

      Possible solution:

      • Make sure that the job is unlocked only at one place
      • Unlocking could happen for example in HistoryCleanupSchedulerCmd


      • OptimisticLockingException: As tx2 unlocks the job, the job can be acquired again. So whenever afterwards tx1 and tx3 are executed concurrently, one of them will most likely fail with an OptimisticLockingException.
      • Lock cannot be acquired: Within tx2 unlocking j1 results in an UPDATE statement that acquires a database lock (at least) on that row to update. So while tx2 is active and the acutal deletion of historical data lasts for more than the job's lock time, the job executor will try to acquire job j1 again. Since the tx2 holds the database lock, the job acquisition will fail because the required database lock to update j1 cannot be acquired within a specific timeout. As long as tx2 holds the database lock, the job acquistion of j1 will fail. The impact of that is, whenever the job executor tries to acquire different jobs (and one of them is j1), the acquisition will always fail and the jobs are not executed.

      [1]: https://github.com/camunda/camunda-bpm-platform/blob/22e3e464cebce957e74917dcaf2d89731a590c54/engine/src/main/java/org/camunda/bpm/engine/impl/jobexecutor/historycleanup/HistoryCleanupSchedulerCmd.java


        This is the controller panel for Smart Panels app




              thorben.lindhauer Thorben Lindhauer
              roman.smirnov Roman Smirnov
              0 Vote for this issue
              2 Start watching this issue