-
Task
-
Resolution: Unresolved
-
L3 - Default
-
None
-
None
-
None
Reasoning
The webhook mechanism established with CAM-12032 is used to get updates on bots' status.
This mechanism has a potential issue of missing out on updates for bots' status.
This might happen when the update events are sent and the bridge is not running
This leads to external tasks not being updated although the bot it started has.
The external task will be unlocked after the lock time has exceeded and another bot will be started until, eventually, a bot will mark the task as finished or failed.
Regarding the "executes the worker at least once" promise this is alright.
In order to improve the "executes the worker only once" use case, a recovery mechanism in the bridge could help mitigate a respective part of those missed updates.
Technical Outline
- persist the id of the started bot execution
- ideally this would be done outside of the bridge in order to not introduce a persistence layer with all its attachments (e.g. configuration, migration) in the bridge itself
- one possibility of doing this would be to send the id to the external task as a local variable
- on bridge startup, recreate the list of already known but unhandled bot executions and poll for their status
- if the ids are persisted in the engine as local variables with the external tasks, the still locked tasks for the worker could be fetched together with the bot execution id in order to recreate the list of bots the bridge handles
- an initial polling to the bot vendors API requesting updates for those bot executions could catch up on missed updates