Job executors of the nodes are acquiring jobs for different priority ranges (using feature CAM-13486)
Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):
When the two job acquisitions run in parallel, there is a race condition that they select jobs (with different priorities) from the same process instance successfully, even if the jobs are configured to be exclusive
Steps to reproduce (Required on creation):
There are two jobs from the same process instance (e.g. parallel gateway with following async continuations):
Job A has prio 50
Job B has prio 100
Job acquisition 1 is configured for priority range 30-70
Job acqusition 2 is configured for priority range 71-120
Both job acqusition commands run in parallel, each selecting their respective job before the other one commits (i.e. the in-query check to ensure exclusiveness does not trigger)
Observed Behavior (Required on creation):
Both job acquisition commands commit successfully (they can each lock their respective job)
They then submit the jobs for execution, which means that the exclusive principle is broken
Expected behavior (Required on creation):
One of the job acqusition commands should not return a result until the other job has finished (or is unlocked in some other way)
Root Cause (Required on prioritization):
Exclusiveness is only checked as part of the acquisition query, which requires that any previous job acqusition attempts have already been committed; if the job acquisition commands run in parallel before they commit, this is not detected
This situation is already possible without the feature from CAM-13486 (e.g. there could be 10 jobs from the same process instance and for some reason the job acqusition commands may receive different sets of jobs from the database); however, it seems like these cases don't occur in practice. The feature CAM-13486 is the first one that restricts job acquisition to certain jobs within the same process instance (other features like deployment-aware acquisition work on a higher level)
Solution Ideas (Optional):
Not sure there is a solution that doesn't make acquisition more complex or less performant
One idea could be to synchronize on the process instance; we could consider creating a job acquisition revision attribute that is incremented whenever a job from a process instance is acquired and that raises a failure like OLE when this happens in parallel (the existing revision attribute can also be used, but it would overall increase the likelihood of OLE happening); in addition, this fix could only be applied when priority ranges are used (since in other cases the problem doesn't seem to occur practically)
Hints (Optional):
This is the controller panel for Smart Panels app
is related to
CAM-13486I can assign the job executor to a job priority range