Uploaded image for project: 'Camunda Optimize'
  1. Camunda Optimize
  2. OPT-6223

Optimize's importer app crash loops when the pod restarts

    XMLWordPrintable

Details

    • Bug Report
    • Resolution: Fixed
    • L3 - Default
    • 3.10.0-alpha3, 3.10.0
    • None
    • None
    • None
    • 3
    • Not defined

    Description

      Brief summary of the bug. What is it ? Where is it ?

      In a long-running benchmark, the deployment of Optimize's importer starts to crash loop sometimes for a few minutes and sometimes for many hours. So that, the importer starts to lag behind.

      The first time the crash loop has been observed was when there were 2.4 million (mostly completed) process instances in the system. There, the pod was crash looping for ~10h, and then suddenly the situation got resolved on its own magically. Typically, the following is logged when the pod is crash looping:

      2022-05-30 11:48:03.014 CEST wait-for-it.sh: waiting 60 seconds for elasticsearch:9200
      2022-05-30 11:48:03.019 CEST wait-for-it.sh: elasticsearch:9200 is available after 0 seconds
      2022-05-30 11:48:03.024 CEST {}
      2022-05-30 11:48:03.024 CEST Starting Camunda Optimize 3.8.1...
      2022-05-30 11:48:03.024 CEST {}
      2022-05-30 11:48:10.339 CEST [main] INFO o.c.o.s.e.OptimizeElasticsearchClientFactory - Initializing Elasticsearch rest client...
      2022-05-30 11:48:11.734 CEST [main] INFO o.c.o.s.e.OptimizeElasticsearchClientFactory - Elasticsearch client has successfully been started
      2022-05-30 11:48:11.745 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-metadata?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
      2022-05-30 11:48:12.039 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-alert,optimize-business-key,optimize-collection,optimize-dashboard,optimize-dashboard-share,optimize-decision-definition,optimize-license,optimize-metadata,optimize-onboarding-state,optimize-process-definition?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
      2022-05-30 11:48:12.046 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-report-share,optimize-settings,optimize-tenant,optimize-terminated-user-session,optimize-variable-update-instance,optimize-event,optimize-event-process-definition,optimize-event-process-mapping,optimize-event-process-publish-state,optimize-import-index?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
      2022-05-30 11:48:12.050 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-timestamp-based-import-index,optimize-position-based-import-index,optimize-combined-report,optimize-single-decision-report,optimize-single-process-report,optimize-external-process-variable,optimize-variable-label,optimize-process-goals?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
      2022-05-30 11:48:12.050 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Updating Optimize schema...
      2022-05-30 11:48:12.971 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-variable-update-instance_v2
      2022-05-30 11:48:13.047 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-variable-update-instance_v2?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
      2022-05-30 11:48:13.060 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-event_v4
      2022-05-30 11:48:13.149 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-event_v4?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
      2022-05-30 11:48:13.651 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-external-process-variable_v2
      2022-05-30 11:48:13.661 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-external-process-variable_v2?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
      2022-05-30 11:48:14.161 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Finished updating Optimize schema.
      2022-05-30 11:48:14.994 CEST [main] INFO o.c.o.s.i.CCSaaSUserIdentityCache - Initializing Cloud user identity sync.
      2022-05-30 11:48:15.037 CEST [main] INFO o.c.o.s.i.CCSaaSUserIdentityCache - Scheduling Cloud user identity sync
      2022-05-30 11:48:15.049 CEST [CCSaaSUserIdentityCache-1] INFO o.c.o.rest.cloud.CCSaaSUserClient - Fetching Cloud users.
      2022-05-30 11:48:15.051 CEST [main] INFO o.c.o.s.telemetry.TelemetryScheduler - Initializing TelemetryScheduler
      2022-05-30 11:48:15.056 CEST [CCSaaSUserIdentityCache-1] INFO o.c.o.rest.cloud.CCSaaSUserClient - Fetching access token for Cloud user retrieval.
      2022-05-30 11:48:15.057 CEST [main] INFO o.c.o.s.telemetry.TelemetryScheduler - Starting telemetry scheduling
      2022-05-30 11:48:15.058 CEST [TelemetryScheduler-1] INFO o.c.o.s.telemetry.TelemetryScheduler - Checking whether telemetry data can be sent.
      2022-05-30 11:48:15.063 CEST [TelemetryScheduler-1] INFO o.c.o.s.telemetry.TelemetryScheduler - Telemetry disabled.
      2022-05-30 11:48:15.132 CEST [CCSaaSUserIdentityCache-1] ERROR o.c.o.s.i.CCSaaSUserIdentityCache - Could not sync Cloud user identities. Will stop retrying as next scheduled sync is approaching
      2022-05-30 11:48:15.132 CEST org.camunda.optimize.service.exceptions.OptimizeRuntimeException: There was a problem fetching access token. at org.camunda.optimize.rest.cloud.CCSaaSUserClient.refreshAccessTokenIfRequired(CCSaaSUserClient.java:111) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.performRequest(CCSaaSUserClient.java:87) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.fetchAllCloudUsers(CCSaaSUserClient.java:71) at org.camunda.optimize.service.identity.CCSaaSUserIdentityCache.populateCache(CCSaaSUserIdentityCache.java:56) at org.camunda.optimize.service.identity.AbstractIdentityCache.synchronizeIdentities(AbstractIdentityCache.java:148) at org.camunda.optimize.service.identity.AbstractIdentityCache.syncIdentitiesWithRetry(AbstractIdentityCache.java:114) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.http.client.ClientProtocolException: null at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:187) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.refreshAccessTokenIfRequired(CCSaaSUserClient.java:98) ... 12 common frames omitted Caused by: org.apache.http.ProtocolException: Target host is not specified at org.apache.http.impl.conn.DefaultRoutePlanner.determineRoute(DefaultRoutePlanner.java:71) at org.apache.http.impl.client.InternalHttpClient.determineRoute(InternalHttpClient.java:125) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 15 common frames omitted
      2022-05-30 11:48:15.361 CEST [main] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - Initializing platform assignee/candidateGroup identity sync.
      2022-05-30 11:48:15.363 CEST [main] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - Scheduling platform assignee/candidateGroup identity sync
      2022-05-30 11:48:15.366 CEST [PlatformUserTaskIdentityCache-1] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - platform assignee/candidateGroup identity sync complete
      2022-05-30 11:48:15.734 CEST [main] INFO o.c.o.s.i.i.m.ExternalVariableUpdateEngineImportMediator - Engine 'optimize' could not be found in the configuration
      2022-05-30 11:48:18.937 CEST [TerminatedSessionService-1] INFO o.c.o.s.e.w.ElasticsearchWriterUtil - Progress of task (ID:jUpMpfOySbOnMitPVJ4THA:274890) on terminated user sessions with timestamp older than 2022-05-30T08:48:18.731Z: 0% (total: 0, updated: 0, created: 0, deleted: 0). Completed is false
      2022-05-30 11:48:19.243 CEST [main] INFO o.c.o.s.cleanup.CleanupScheduler - Initializing OptimizeCleanupScheduler
      2022-05-30 11:48:19.253 CEST [main] INFO o.c.o.s.cleanup.CleanupScheduler - Stopping cleanup scheduling
      2022-05-30 11:48:23.855 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-process-instance-customer_onboarding_en?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
      2022-05-30 11:52:07.000 CEST Back-off restarting failed container
      

      Please note: The exception org.camunda.optimize.service.exceptions.OptimizeRuntimeException: There was a problem fetching access token. in the logs also happens when the pod got scheduled without any failures. So, to my understanding, this does not cause the crash loop.

      It impacts the system in the sense that the data in Optimize is not up-to-date and lagging behind depending on how long the crash loop is ongoing. Luckily, Optimize is able to catch up quickly, for example, with 10h lagging behind, Optimize needed 30 minutes to be up-to-date.

      Steps to reproduce:

      1. Start a long-running test and observe the pod scheduling

      Actual result:

      • The importer gets into a crash loop for some time

      Expected result:

      • There is no crash loop anymore.

      mgm-controller-panel

        This is the controller panel for Smart Panels app

        Attachments

          Issue Links

            Activity

              People

                Unassigned Unassigned
                roman.smirnov Roman Smirnov
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:

                  Salesforce