Loading...

XML

Word

Printable

A cluster of two engines
Active job executors
Job executors of the nodes are acquiring jobs for different priority ranges (using feature ~~CAM-13486~~)

When the two job acquisitions run in parallel, there is a race condition that they select jobs (with different priorities) from the same process instance successfully, even if the jobs are configured to be exclusive

There are two jobs from the same process instance (e.g. parallel gateway with following async continuations):
- Job A has prio 50
- Job B has prio 100
Job acquisition 1 is configured for priority range 30-70
Job acqusition 2 is configured for priority range 71-120
Both job acqusition commands run in parallel, each selecting their respective job before the other one commits (i.e. the in-query check to ensure exclusiveness does not trigger)

Both job acquisition commands commit successfully (they can each lock their respective job)
They then submit the jobs for execution, which means that the exclusive principle is broken

One of the job acqusition commands should not return a result until the other job has finished (or is unlocked in some other way)

Exclusiveness is only checked as part of the acquisition query, which requires that any previous job acqusition attempts have already been committed; if the job acquisition commands run in parallel before they commit, this is not detected
This situation is already possible without the feature from ~~CAM-13486~~ (e.g. there could be 10 jobs from the same process instance and for some reason the job acqusition commands may receive different sets of jobs from the database); however, it seems like these cases don't occur in practice. The feature ~~CAM-13486~~ is the first one that restricts job acquisition to certain jobs within the same process instance (other features like deployment-aware acquisition work on a higher level)

Not sure there is a solution that doesn't make acquisition more complex or less performant
One idea could be to synchronize on the process instance; we could consider creating a job acquisition revision attribute that is incremented whenever a job from a process instance is acquired and that raises a failure like OLE when this happens in parallel (the existing revision attribute can also be used, but it would overall increase the likelihood of OLE happening); in addition, this fix could only be applied when priority ranges are used (since in other cases the problem doesn't seem to occur practically)

is related to

CAM-13486 I can assign the job executor to a job priority range