Uploaded image for project: 'camunda BPM'
  1. camunda BPM
  2. CAM-13883

When using a cluster of job executors with different priority ranges, exclusive execution is not guaranteed

XMLWordPrintable

    • Icon: Bug Report Bug Report
    • Resolution: Unresolved
    • Icon: L3 - Default L3 - Default
    • None
    • None
    • engine
    • None

      Environment (Required on creation):

      • A cluster of two engines
      • Active job executors
      • Job executors of the nodes are acquiring jobs for different priority ranges (using feature CAM-13486)

      Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):

      • When the two job acquisitions run in parallel, there is a race condition that they select jobs (with different priorities) from the same process instance successfully, even if the jobs are configured to be exclusive

      Steps to reproduce (Required on creation):

      • There are two jobs from the same process instance (e.g. parallel gateway with following async continuations):
        • Job A has prio 50
        • Job B has prio 100
      • Job acquisition 1 is configured for priority range 30-70
      • Job acqusition 2 is configured for priority range 71-120
      • Both job acqusition commands run in parallel, each selecting their respective job before the other one commits (i.e. the in-query check to ensure exclusiveness does not trigger)

      Observed Behavior (Required on creation):

      • Both job acquisition commands commit successfully (they can each lock their respective job)
      • They then submit the jobs for execution, which means that the exclusive principle is broken

      Expected behavior (Required on creation):

      • One of the job acqusition commands should not return a result until the other job has finished (or is unlocked in some other way)

      Root Cause (Required on prioritization):

      • Exclusiveness is only checked as part of the acquisition query, which requires that any previous job acqusition attempts have already been committed; if the job acquisition commands run in parallel before they commit, this is not detected
      • This situation is already possible without the feature from CAM-13486 (e.g. there could be 10 jobs from the same process instance and for some reason the job acqusition commands may receive different sets of jobs from the database); however, it seems like these cases don't occur in practice. The feature CAM-13486 is the first one that restricts job acquisition to certain jobs within the same process instance (other features like deployment-aware acquisition work on a higher level)

      Solution Ideas (Optional):

      • Not sure there is a solution that doesn't make acquisition more complex or less performant
      • One idea could be to synchronize on the process instance; we could consider creating a job acquisition revision attribute that is incremented whenever a job from a process instance is acquired and that raises a failure like OLE when this happens in parallel (the existing revision attribute can also be used, but it would overall increase the likelihood of OLE happening); in addition, this fix could only be applied when priority ranges are used (since in other cases the problem doesn't seem to occur practically)

      Hints (Optional):

        This is the controller panel for Smart Panels app

              Unassigned Unassigned
              thorben.lindhauer Thorben Lindhauer
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: