Uploaded image for project: 'Camunda Optimize'
  1. Camunda Optimize
  2. OPT-6591

Optimize is blocking Threads in Elasticsearch for C8 SaaS



    • Bug Report
    • Status: Open
    • L3 - Default
    • Resolution: Unresolved
    • None
    • None
    • backend
    • None
    • Not defined


      Brief summary of the bug. What is it ? Where is it ?

      Optimize generates too many requests for Elasticsearch, which blocks other tools from performing requests 

      Steps to reproduce:

      Actual result:

      Starting from Support Case https://jira.camunda.com/browse/SUPPORT-15115, from the logs, I can see the following:

      • Operate, and Tasklist failed to import (and archive) data with timeout exceptions. This state last for at least 6 hours.
      • Elasticsearch threw many exceptions during that 6h, mostly two types of exceptions:
      • The response could not be sent because the connection was already closed
      • Many search failures because the search scroll id was invalid in the meantime
      • According to the Optimize logs, Optimize looks okay, as no import or timeout exceptions are logged. So far, so good
      • But according to the Elasticsearch logs, during that 6h, most of the time, indexing requests to the Optimize indices were throttled. When an index is throttled in Elasticsearch, only one write thread is allowed to write to that index. The other writer thread won't be able to acquire a lock and will be blocked when the other writer thread holds that lock. In other words, it seems that Optimize blocked all available writer threads in Elasticsearch (in that setup, there are only two writer threads available).
      • Why were the Optimize indices being throttled? It may have different reasons:
      • Due to the nested documents in Optimize (and depending on the usage of multi-instance), the documents in Optimize might get quite big
      • Optimize tries to import a max 10_000 Zeebe events at once to Optimize. This results in a "big" bulk request to Elasticsearch.
      • The write buffer in Elasticsearch filled up, and Elasticsearch got under memory pressure. The write buffer filled up because Optimize tried to index many "big" documents within a bulk request. Therefore, Elasticsearch tried to flush the write buffers to files to free up the write buffer's memory.
      • But Elasticsearch could not flush to file at the same pace as Optimize was indexing data. Meaning even after flushing the write buffer, it was immediately full again.
      • Now, Elasticsearch started to throttle the Optimize indices.
      • According to Grafana, we can see that the refresh queue contained >250 refreshes to be executed (i.e., Elasticsearch could not catch up with refreshing/flushing the data)
      • Why did throttling impact the other's application?
      • As mentioned, with the resource settings, there are just 2 writer threads available in Elasticsearch.
      • Optimize sends (many) index requests to the same index in Elasticsearch in parallel.
      • Due to throttling, Elasticsearch needs to acquire a lock and only one thread wins to do so. The other is blocked during that time.
      • According to the logs, the throttling happened for a few seconds to ~40s.
      • This slowed down the entire system.
      • According to Grafana, we can see that the indexing queue contained up to >400 bulk requests (in peak time) due to the throttling.

      Expected result:

      The importer needs to be adjusted to avoid sending a burst of requests. For example, in Operate, a dedicated thread pool exists to limit the number of index requests sent to Elasticsearch.


        This is the controller panel for Smart Panels app




              Unassigned Unassigned
              giuliano.rodrigues-lima Giuliano Rodrigues Lima
              0 Vote for this issue
              1 Start watching this issue