[OPT-6591] Optimize is blocking Threads in Elasticsearch for C8 SaaS

Type: Bug Report
Resolution: Unresolved
Priority: L3 - Default
Fix Version/s: None
Affects Version/s: None
Component/s: backend
Labels:
None

Effort:
Not defined

Brief summary of the bug. What is it ? Where is it ?

Optimize generates too many requests for Elasticsearch, which blocks other tools from performing requests

Steps to reproduce:

Actual result:

Starting from Support Case https://jira.camunda.com/browse/SUPPORT-15115, from the logs, I can see the following:

Operate, and Tasklist failed to import (and archive) data with timeout exceptions. This state last for at least 6 hours.
Elasticsearch threw many exceptions during that 6h, mostly two types of exceptions:

The response could not be sent because the connection was already closed
Many search failures because the search scroll id was invalid in the meantime

According to the Optimize logs, Optimize looks okay, as no import or timeout exceptions are logged. So far, so good
But according to the Elasticsearch logs, during that 6h, most of the time, indexing requests to the Optimize indices were throttled. When an index is throttled in Elasticsearch, only one write thread is allowed to write to that index. The other writer thread won't be able to acquire a lock and will be blocked when the other writer thread holds that lock. In other words, it seems that Optimize blocked all available writer threads in Elasticsearch (in that setup, there are only two writer threads available).

Why were the Optimize indices being throttled? It may have different reasons:

Due to the nested documents in Optimize (and depending on the usage of multi-instance), the documents in Optimize might get quite big
Optimize tries to import a max 10_000 Zeebe events at once to Optimize. This results in a "big" bulk request to Elasticsearch.
The write buffer in Elasticsearch filled up, and Elasticsearch got under memory pressure. The write buffer filled up because Optimize tried to index many "big" documents within a bulk request. Therefore, Elasticsearch tried to flush the write buffers to files to free up the write buffer's memory.
But Elasticsearch could not flush to file at the same pace as Optimize was indexing data. Meaning even after flushing the write buffer, it was immediately full again.
Now, Elasticsearch started to throttle the Optimize indices.
According to Grafana, we can see that the refresh queue contained >250 refreshes to be executed (i.e., Elasticsearch could not catch up with refreshing/flushing the data)

Why did throttling impact the other's application?

As mentioned, with the resource settings, there are just 2 writer threads available in Elasticsearch.
Optimize sends (many) index requests to the same index in Elasticsearch in parallel.
Due to throttling, Elasticsearch needs to acquire a lock and only one thread wins to do so. The other is blocked during that time.
According to the logs, the throttling happened for a few seconds to ~40s.
This slowed down the entire system.
According to Grafana, we can see that the indexing queue contained up to >400 bulk requests (in peak time) due to the throttling.

Expected result:

The importer needs to be adjusted to avoid sending a burst of requests. For example, in Operate, a dedicated thread pool exists to limit the number of index requests sent to Elasticsearch.

This is the controller panel for Smart Panels app

Omran Abazeed added a comment - 27/Dec/23 2:33 PM

This ticket was migrated to github: https://github.com/camunda/camunda-optimize/issues/10599. Please use this link for any future references and continue any discussion there.

Omran Abazeed added a comment - 27/Dec/23 2:33 PM This ticket was migrated to github: https://github.com/camunda/camunda-optimize/issues/10599 . Please use this link for any future references and continue any discussion there.

Camunda Optimize

Details

Description

Brief summary of the bug. What is it ? Where is it ?

Steps to reproduce:

Actual result:

Expected result:

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

Collapse comment: Omran Abazeed added a comment - 27/Dec/23 2:33 PM

Expand comment: Omran Abazeed added a comment - 27/Dec/23 2:33 PM

People

Dates

Salesforce