Loading...

XML

Word

Printable

Type: Bug Report
Resolution: Fixed
Priority: L3 - Default
Fix Version/s: 3.10.0
Affects Version/s: None
Component/s: None
Labels:
None

Effort:
Not defined

Whenever Elasticsearch restarts, it recovers the indices/shards and assigns them to an ELS node (if there is more than one ELS available). The recovery of the indices happens in a non-deterministic order, meaning, it cannot be expected that they are recovered in the order they have been created (for example).
In the meantime, the importer continues polling Elasticsearch to get the next batch of Zeebe events to import. Therefore, it performs a search request via the index alias. While the respective search request itself might succeed, it might be that the search on some indices failed because they haven't recovered yet (i.e., the search on the index failed with `all shards failed`). That way, the importer might skip some Zeebe events resulting in a potential data loss.

Steps to reproduce:

Two indices exist
1. zeebe-record_variable_8.2.0_2023-03-30: contains evens from position 100 to 199
2. zeebe-record_variable_8.2.0_2023-03-31: contains events from position 200 to 300
The importer still imports Zeebe events from zeebe-record_variable_8.2.0_2023-03-30 (because the importer lags behind or there was a rollover at midnight). The last imported event is on position 150.
Elasticsearch restarts and recovers first the index zeebe-record_variable_8.2.0_2023-03-31 but not yet the index zeebe-record_variable_8.2.0_2023-03-30.
The importer submits a search request via an index alias to get the next batch of Zeebe events.
Elasticsearch executes a search request on both indices
1. zeebe-record_variable_8.2.0_2023-03-30: Search fails because the index isn't recovered yet.
2. zeebe-record_variable_8.2.0_2023-03-31: Search succeeds and a batch of events is returned starting at position 200.
Elasticsearch responds with the Zeebe events from zeebe-record_variable_8.2.0_2023-03-31 start at position 200.
The importer acknowledges that new batch of Zeebe events and continues to import Zeebe events from zeebe-record_variable_8.2.0_2023-03-31.

Actual Result:

The importer never imports the events from position 150 to 199.

In such a case, a sample Elasticsearch response looks like this:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 0,
        "index" : "zeebe-record_variable_8.2.0_2023-03-30",
        "node" : null,
        "reason" : {
          "type" : "no_shard_available_action_exception",
          "reason" : null,
          "index_uuid" : "BZVEuXemQn6O1Otwd_PN-w",
          "shard" : "0",
          "index" : "zeebe-record_variable_8.2.0_2023-03-30"
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "zeebe-record_variable_8.2.0_2023-03-31",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "field1" : "index1_value1"
        }
      },
      {
        "_index" : "zeebe-record_variable_8.2.0_2023-03-31",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "field1" : "index1_value2"
        }
      }
    ]
  }
}

Expected result:

The importer checks the Elasticsearch response for failures, e.g., by checking the failed field in the response or by checking total > failed+successful, etc.
If there is at least a failure or the total count is greater than failed+successful, then importer needs to retry at the same position.

Hint

Note on failed+successful: Elasticsearch's documentation says the following:

Note that shards that are not allocated will be considered neither successful nor failed. Having failed+successful less than total is thus an indication that some of the shards were not allocated.

This is the controller panel for Smart Panels app

Assignee:: Unassigned
Reporter:: Roman Smirnov
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: 31/Mar/23 2:16 PM
Updated:: 10/Oct/23 1:55 PM
Resolved:: 06/Apr/23 11:05 AM

Details

Description

Steps to reproduce:

Actual Result:

Expected result:

Hint

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

People

Dates

Salesforce