[OPT-6223] Optimize's importer app crash loops when the pod restarts

Type: Bug Report
Resolution: Fixed
Priority: L3 - Default
Fix Version/s: 3.10.0-alpha3, 3.10.0
Affects Version/s: None
Component/s: None
Labels:
None

PM Priority:
3
Effort:
Not defined

Brief summary of the bug. What is it ? Where is it ?

In a long-running benchmark, the deployment of Optimize's importer starts to crash loop sometimes for a few minutes and sometimes for many hours. So that, the importer starts to lag behind.

The first time the crash loop has been observed was when there were 2.4 million (mostly completed) process instances in the system. There, the pod was crash looping for ~10h, and then suddenly the situation got resolved on its own magically. Typically, the following is logged when the pod is crash looping:

2022-05-30 11:48:03.014 CEST wait-for-it.sh: waiting 60 seconds for elasticsearch:9200
2022-05-30 11:48:03.019 CEST wait-for-it.sh: elasticsearch:9200 is available after 0 seconds
2022-05-30 11:48:03.024 CEST {}
2022-05-30 11:48:03.024 CEST Starting Camunda Optimize 3.8.1...
2022-05-30 11:48:03.024 CEST {}
2022-05-30 11:48:10.339 CEST [main] INFO o.c.o.s.e.OptimizeElasticsearchClientFactory - Initializing Elasticsearch rest client...
2022-05-30 11:48:11.734 CEST [main] INFO o.c.o.s.e.OptimizeElasticsearchClientFactory - Elasticsearch client has successfully been started
2022-05-30 11:48:11.745 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-metadata?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2022-05-30 11:48:12.039 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-alert,optimize-business-key,optimize-collection,optimize-dashboard,optimize-dashboard-share,optimize-decision-definition,optimize-license,optimize-metadata,optimize-onboarding-state,optimize-process-definition?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2022-05-30 11:48:12.046 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-report-share,optimize-settings,optimize-tenant,optimize-terminated-user-session,optimize-variable-update-instance,optimize-event,optimize-event-process-definition,optimize-event-process-mapping,optimize-event-process-publish-state,optimize-import-index?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2022-05-30 11:48:12.050 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-timestamp-based-import-index,optimize-position-based-import-index,optimize-combined-report,optimize-single-decision-report,optimize-single-process-report,optimize-external-process-variable,optimize-variable-label,optimize-process-goals?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2022-05-30 11:48:12.050 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Updating Optimize schema...
2022-05-30 11:48:12.971 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-variable-update-instance_v2
2022-05-30 11:48:13.047 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-variable-update-instance_v2?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
2022-05-30 11:48:13.060 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-event_v4
2022-05-30 11:48:13.149 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-event_v4?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
2022-05-30 11:48:13.651 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Creating or updating template with name optimize-external-process-variable_v2
2022-05-30 11:48:13.661 CEST [main] WARN org.elasticsearch.client.RestClient - request [PUT http://elasticsearch:9200/_template/optimize-external-process-variable_v2?master_timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "Legacy index templates are deprecated in favor of composable templates."]
2022-05-30 11:48:14.161 CEST [main] INFO o.c.o.s.e.s.ElasticSearchSchemaManager - Finished updating Optimize schema.
2022-05-30 11:48:14.994 CEST [main] INFO o.c.o.s.i.CCSaaSUserIdentityCache - Initializing Cloud user identity sync.
2022-05-30 11:48:15.037 CEST [main] INFO o.c.o.s.i.CCSaaSUserIdentityCache - Scheduling Cloud user identity sync
2022-05-30 11:48:15.049 CEST [CCSaaSUserIdentityCache-1] INFO o.c.o.rest.cloud.CCSaaSUserClient - Fetching Cloud users.
2022-05-30 11:48:15.051 CEST [main] INFO o.c.o.s.telemetry.TelemetryScheduler - Initializing TelemetryScheduler
2022-05-30 11:48:15.056 CEST [CCSaaSUserIdentityCache-1] INFO o.c.o.rest.cloud.CCSaaSUserClient - Fetching access token for Cloud user retrieval.
2022-05-30 11:48:15.057 CEST [main] INFO o.c.o.s.telemetry.TelemetryScheduler - Starting telemetry scheduling
2022-05-30 11:48:15.058 CEST [TelemetryScheduler-1] INFO o.c.o.s.telemetry.TelemetryScheduler - Checking whether telemetry data can be sent.
2022-05-30 11:48:15.063 CEST [TelemetryScheduler-1] INFO o.c.o.s.telemetry.TelemetryScheduler - Telemetry disabled.
2022-05-30 11:48:15.132 CEST [CCSaaSUserIdentityCache-1] ERROR o.c.o.s.i.CCSaaSUserIdentityCache - Could not sync Cloud user identities. Will stop retrying as next scheduled sync is approaching
2022-05-30 11:48:15.132 CEST org.camunda.optimize.service.exceptions.OptimizeRuntimeException: There was a problem fetching access token. at org.camunda.optimize.rest.cloud.CCSaaSUserClient.refreshAccessTokenIfRequired(CCSaaSUserClient.java:111) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.performRequest(CCSaaSUserClient.java:87) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.fetchAllCloudUsers(CCSaaSUserClient.java:71) at org.camunda.optimize.service.identity.CCSaaSUserIdentityCache.populateCache(CCSaaSUserIdentityCache.java:56) at org.camunda.optimize.service.identity.AbstractIdentityCache.synchronizeIdentities(AbstractIdentityCache.java:148) at org.camunda.optimize.service.identity.AbstractIdentityCache.syncIdentitiesWithRetry(AbstractIdentityCache.java:114) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.http.client.ClientProtocolException: null at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:187) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at org.camunda.optimize.rest.cloud.CCSaaSUserClient.refreshAccessTokenIfRequired(CCSaaSUserClient.java:98) ... 12 common frames omitted Caused by: org.apache.http.ProtocolException: Target host is not specified at org.apache.http.impl.conn.DefaultRoutePlanner.determineRoute(DefaultRoutePlanner.java:71) at org.apache.http.impl.client.InternalHttpClient.determineRoute(InternalHttpClient.java:125) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 15 common frames omitted
2022-05-30 11:48:15.361 CEST [main] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - Initializing platform assignee/candidateGroup identity sync.
2022-05-30 11:48:15.363 CEST [main] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - Scheduling platform assignee/candidateGroup identity sync
2022-05-30 11:48:15.366 CEST [PlatformUserTaskIdentityCache-1] INFO o.c.o.s.i.PlatformUserTaskIdentityCache - platform assignee/candidateGroup identity sync complete
2022-05-30 11:48:15.734 CEST [main] INFO o.c.o.s.i.i.m.ExternalVariableUpdateEngineImportMediator - Engine 'optimize' could not be found in the configuration
2022-05-30 11:48:18.937 CEST [TerminatedSessionService-1] INFO o.c.o.s.e.w.ElasticsearchWriterUtil - Progress of task (ID:jUpMpfOySbOnMitPVJ4THA:274890) on terminated user sessions with timestamp older than 2022-05-30T08:48:18.731Z: 0% (total: 0, updated: 0, created: 0, deleted: 0). Completed is false
2022-05-30 11:48:19.243 CEST [main] INFO o.c.o.s.cleanup.CleanupScheduler - Initializing OptimizeCleanupScheduler
2022-05-30 11:48:19.253 CEST [main] INFO o.c.o.s.cleanup.CleanupScheduler - Stopping cleanup scheduling
2022-05-30 11:48:23.855 CEST [main] WARN org.elasticsearch.client.RestClient - request [HEAD http://elasticsearch:9200/optimize-process-instance-customer_onboarding_en?ignore_throttled=false&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=false] returned 1 warnings: [299 Elasticsearch-7.17.2-de7261de50d90919ae53b0eff9413fd7e5307301 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2022-05-30 11:52:07.000 CEST Back-off restarting failed container

Please note: The exception org.camunda.optimize.service.exceptions.OptimizeRuntimeException: There was a problem fetching access token. in the logs also happens when the pod got scheduled without any failures. So, to my understanding, this does not cause the crash loop.

It impacts the system in the sense that the data in Optimize is not up-to-date and lagging behind depending on how long the crash loop is ongoing. Luckily, Optimize is able to catch up quickly, for example, with 10h lagging behind, Optimize needed 30 minutes to be up-to-date.

Steps to reproduce:

Start a long-running test and observe the pod scheduling

Actual result:

The importer gets into a crash loop for some time

Expected result:

There is no crash loop anymore.

This is the controller panel for Smart Panels app

Roman Smirnov added a comment - 18/Oct/22 3:20 PM - edited

joshua.windels, I finally understand what causes the crash looping.

Customer Onboarding Process: I see that importing the onboarding process is configured with the SaaS offering. But importing the related data may take too long. As it is done during the bootstrapping phase of the application, the pod may crash because of a timeout (as long as the importing didn't finish). After restarting the pod, it will again try to import the related data, and so on. It comes in a loop that Optimize floods ELS with bulk requests with each restart; only when the pod rescheduling backoff increases over time will ELS succeed in finishing all the submitted bulk requests. Other applications cannot export/import data during that time as all ELS writer threads are busy with Optimize's bulk requests.

I see the following options to mitigate the situation:

Disable to import onboarding process in SaaS
Don't import the onboarding process during the bootstrapping phase. Instead, do it asynchronously to avoid blocking the application bootstrapping phase.
Submit smaller bulk requests to avoid "long-running" actions in ELS.

However, once disabling the import of the onboarding process, the Optimize importer starts in most cases. Only sometimes, the app fails to start when any indexing request is awaiting a refresh before completion. When the refresh queue is full, it may take too long to start executing the refresh so that the request times out. (But this happened to me only a few times so far).

Roman Smirnov added a comment - 18/Oct/22 3:20 PM - edited joshua.windels , I finally understand what causes the crash looping. Customer Onboarding Process: I see that importing the onboarding process is configured with the SaaS offering. But importing the related data may take too long. As it is done during the bootstrapping phase of the application, the pod may crash because of a timeout (as long as the importing didn't finish). After restarting the pod, it will again try to import the related data, and so on. It comes in a loop that Optimize floods ELS with bulk requests with each restart; only when the pod rescheduling backoff increases over time will ELS succeed in finishing all the submitted bulk requests. Other applications cannot export/import data during that time as all ELS writer threads are busy with Optimize's bulk requests. I see the following options to mitigate the situation: Disable to import onboarding process in SaaS Don't import the onboarding process during the bootstrapping phase. Instead, do it asynchronously to avoid blocking the application bootstrapping phase. Submit smaller bulk requests to avoid "long-running" actions in ELS. However, once disabling the import of the onboarding process, the Optimize importer starts in most cases. Only sometimes, the app fails to start when any indexing request is awaiting a refresh before completion. When the refresh queue is full, it may take too long to start executing the refresh so that the request times out. (But this happened to me only a few times so far).

Joshua Windels added a comment - 12/Dec/22 6:18 PM

Customer onboarding data has been disabled for non-internal and non-trial clusters, marking this as fixed

Joshua Windels added a comment - 12/Dec/22 6:18 PM Customer onboarding data has been disabled for non-internal and non-trial clusters, marking this as fixed

Details

Description

Brief summary of the bug. What is it ? Where is it ?

Steps to reproduce:

Actual result:

Expected result:

mgm-controller-panel

This is the controller panel for Smart Panels app

Attachments

Activity

Collapse comment: Roman Smirnov added a comment - 18/Oct/22 3:20 PM, Edited by Roman Smirnov - 18/Oct/22 3:23 PM

Expand comment: Roman Smirnov added a comment - 18/Oct/22 3:20 PM, Edited by Roman Smirnov - 18/Oct/22 3:23 PM

Collapse comment: Joshua Windels added a comment - 12/Dec/22 6:18 PM

Expand comment: Joshua Windels added a comment - 12/Dec/22 6:18 PM

People

Dates

Salesforce