Uploaded image for project: 'Camunda Optimize'
  1. Camunda Optimize
  2. OPT-6843

Unassigned Indices/Shards might cause data loss during import

    XMLWordPrintable

Details

    • Bug Report
    • Resolution: Fixed
    • L3 - Default
    • 3.10.0
    • None
    • None
    • None
    • Not defined

    Description

      Whenever Elasticsearch restarts, it recovers the indices/shards and assigns them to an ELS node (if there is more than one ELS available). The recovery of the indices happens in a non-deterministic order, meaning, it cannot be expected that they are recovered in the order they have been created (for example).
      In the meantime, the importer continues polling Elasticsearch to get the next batch of Zeebe events to import. Therefore, it performs a search request via the index alias. While the respective search request itself might succeed, it might be that the search on some indices failed because they haven't recovered yet (i.e., the search on the index failed with `all shards failed`). That way, the importer might skip some Zeebe events resulting in a potential data loss.

      Steps to reproduce:

      1. Two indices exist
        1. zeebe-record_variable_8.2.0_2023-03-30: contains evens from position 100 to 199
        2. zeebe-record_variable_8.2.0_2023-03-31: contains events from position 200 to 300
      2. The importer still imports Zeebe events from zeebe-record_variable_8.2.0_2023-03-30 (because the importer lags behind or there was a rollover at midnight). The last imported event is on position 150.
      3. Elasticsearch restarts and recovers first the index zeebe-record_variable_8.2.0_2023-03-31 but not yet the index zeebe-record_variable_8.2.0_2023-03-30.
      4. The importer submits a search request via an index alias to get the next batch of Zeebe events.
      5. Elasticsearch executes a search request on both indices
        1. zeebe-record_variable_8.2.0_2023-03-30: Search fails because the index isn't recovered yet.
        2. zeebe-record_variable_8.2.0_2023-03-31: Search succeeds and a batch of events is returned starting at position 200.
      6. Elasticsearch responds with the Zeebe events from zeebe-record_variable_8.2.0_2023-03-31 start at position 200.
      7. The importer acknowledges that new batch of Zeebe events and continues to import Zeebe events from zeebe-record_variable_8.2.0_2023-03-31.

      Actual Result:

      1. The importer never imports the events from position 150 to 199.

      In such a case, a sample Elasticsearch response looks like this:

      {
        "took" : 1,
        "timed_out" : false,
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "skipped" : 0,
          "failed" : 1,
          "failures" : [
            {
              "shard" : 0,
              "index" : "zeebe-record_variable_8.2.0_2023-03-30",
              "node" : null,
              "reason" : {
                "type" : "no_shard_available_action_exception",
                "reason" : null,
                "index_uuid" : "BZVEuXemQn6O1Otwd_PN-w",
                "shard" : "0",
                "index" : "zeebe-record_variable_8.2.0_2023-03-30"
              }
            }
          ]
        },
        "hits" : {
          "total" : {
            "value" : 2,
            "relation" : "eq"
          },
          "max_score" : 1.0,
          "hits" : [
            {
              "_index" : "zeebe-record_variable_8.2.0_2023-03-31",
              "_type" : "_doc",
              "_id" : "1",
              "_score" : 1.0,
              "_source" : {
                "field1" : "index1_value1"
              }
            },
            {
              "_index" : "zeebe-record_variable_8.2.0_2023-03-31",
              "_type" : "_doc",
              "_id" : "2",
              "_score" : 1.0,
              "_source" : {
                "field1" : "index1_value2"
              }
            }
          ]
        }
      }
      

      Expected result:

      • The importer checks the Elasticsearch response for failures, e.g., by checking the failed field in the response or by checking total > failed+successful, etc.
      • If there is at least a failure or the total count is greater than failed+successful, then importer needs to retry at the same position.

      Hint

      • Note on failed+successful: Elasticsearch's documentation says the following:

        Note that shards that are not allocated will be considered neither successful nor failed. Having failed+successful less than total is thus an indication that some of the shards were not allocated.

      mgm-controller-panel

        This is the controller panel for Smart Panels app

        Attachments

          Activity

            People

              Unassigned Unassigned
              roman.smirnov Roman Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Salesforce