[OPT-7047] Optimize importer can get stuck permanently after temporary import blockage

Type: Bug Report
Resolution: Fixed
Priority: L3 - Default
Fix Version/s: 3.10.4, 3.11.0-alpha5, 3.11.0
Affects Version/s: None
Component/s: backend
Labels:
None

Effort:
Not defined

Brief summary of the bug. What is it ? Where is it ?

Since Optimize 3.10.0, it has been possible for Optimize to import based on the sequence of the Zeebe record. With this, we build a range query based on the sequence of the previously seen record and the size of the next batch.

However, we had a scenario where the Optimize importer was stuck on a batch of records and repeatedly attempting the same batch for five days. After this time, the zeebe records that were unprocessable were deleted, but Optimize was still not able to move to the next importable batch of records because the range query uses the last imported record sequence as the lower boundary and this sequence plus the batch size as the upper boundary. Because the fixed recent records have a sequence higher than this upper number, the Optimize importer could never catch up without manual intervention.

In short, Optimize needs to handle both the case where there are no records to import in the given range, but also the case that there still could be records to process beyond the given empty result set (and thus the import indexes should still be updated)

Steps to reproduce:

Set the batch size as 1
Block the import with a record with an unimportable value. In the real scenario, this was having a null value for the bpmnElementType field.
Observe the repeated blocked record not being imported
Delete the broken record
Observe the importer not catching up

Actual result:

Optimize remains stuck and doesn't import newer records

Expected result:

Optimize can always get the next records, no matter how high much higher their sequence is than the previously imported record

Notes:

This should be fixed on master and also backported to the maintenance/3.10 branch
It might be that we can revert to the 'position' query if we can identify being in such a position where the Optimize importer is stuck

Testing Notes:

Behaviour to verify before:

Do all of this in the context of a single partition

Deploy Zeebe process instance data and complete an instance
Modify the sequence of the last data point(s) to be more than the page size greater than the previous records. Basically make it so that only part of the process can be imported, and that the sequence query misses the last data points. This can also be done via data deletion
Observe in the logs that the importer is working but not importing any data
Check in ES that the instance state is still running

Behaviour to verify with fix:

Do all of this in the context of a single partition

Deploy Zeebe process instance data and complete an instance
Modify the sequence of the last data point(s) to be more than the page size greater than the previous records. Basically make it so that only part of the process can be imported, and that the sequence query misses the last data points. This can also be done via data deletion
Observe in the logs that the importer uses the position based query to "catch up"
Check in ES that the instance state is completed

This is the controller panel for Smart Panels app

There are no comments yet on this issue.

Assignee:: Unassigned

Reporter:: Joshua Windels

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 05/Jun/23 6:54 PM

Updated:: 18/Jan/24 3:44 PM

Resolved:: 22/Aug/23 3:28 PM

Camunda Optimize

Details

Description

Brief summary of the bug. What is it ? Where is it ?

Steps to reproduce:

Actual result:

Expected result:

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

People

Dates

Salesforce