-
Bug Report
-
Resolution: Unresolved
-
L3 - Default
-
None
-
None
-
None
Environment (Required on creation):
- A cluster of two engines
- Active job executors
- Job executors of the nodes are acquiring jobs for different priority ranges (using feature
CAM-13486)
Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):
- When the two job acquisitions run in parallel, there is a race condition that they select jobs (with different priorities) from the same process instance successfully, even if the jobs are configured to be exclusive
Steps to reproduce (Required on creation):
- There are two jobs from the same process instance (e.g. parallel gateway with following async continuations):
- Job A has prio 50
- Job B has prio 100
- Job acquisition 1 is configured for priority range 30-70
- Job acqusition 2 is configured for priority range 71-120
- Both job acqusition commands run in parallel, each selecting their respective job before the other one commits (i.e. the in-query check to ensure exclusiveness does not trigger)
Observed Behavior (Required on creation):
- Both job acquisition commands commit successfully (they can each lock their respective job)
- They then submit the jobs for execution, which means that the exclusive principle is broken
Expected behavior (Required on creation):
- One of the job acqusition commands should not return a result until the other job has finished (or is unlocked in some other way)
Root Cause (Required on prioritization):
- Exclusiveness is only checked as part of the acquisition query, which requires that any previous job acqusition attempts have already been committed; if the job acquisition commands run in parallel before they commit, this is not detected
- This situation is already possible without the feature from
CAM-13486(e.g. there could be 10 jobs from the same process instance and for some reason the job acqusition commands may receive different sets of jobs from the database); however, it seems like these cases don't occur in practice. The featureCAM-13486is the first one that restricts job acquisition to certain jobs within the same process instance (other features like deployment-aware acquisition work on a higher level)
Solution Ideas (Optional):
- Not sure there is a solution that doesn't make acquisition more complex or less performant
- One idea could be to synchronize on the process instance; we could consider creating a job acquisition revision attribute that is incremented whenever a job from a process instance is acquired and that raises a failure like OLE when this happens in parallel (the existing revision attribute can also be used, but it would overall increase the likelihood of OLE happening); in addition, this fix could only be applied when priority ranges are used (since in other cases the problem doesn't seem to occur practically)
Hints (Optional):
This is the controller panel for Smart Panels app
- is related to
-
CAM-13486 I can assign the job executor to a job priority range
- Closed