What/Where is the issue ?
Issue was discovered during investigation of this incident: https://app.incident.io/camunda/incidents/307
What happens is basically this: Optimize starts, PostConstruct from ManagementDashboardService is called, that sends lots of requests, ElasticSearch is overwhelmed, requests get queued, main thread gets blocked, Optimize liveness probe doesn't answer, kubernetes kills pod and restarts, everything starts over again, crash looping
Upon analysis of the Optimize Importer thread heap, it was noticed that the main thread is blocked by the ManagementDashboardService:
"main" #1 prio=5 os_prio=0 cpu=9652.67ms elapsed=167.25s tid=0x00007f41d1ee1800 nid=0x1c waiting on condition [0x00007f41d207a000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(java.base@11.0.18/Native Method) at org.camunda.optimize.service.es.writer.ElasticsearchWriterUtil.waitUntilTaskIsFinished(ElasticsearchWriterUtil.java:403) at org.camunda.optimize.service.es.writer.ElasticsearchWriterUtil.tryDeleteByQueryRequest(ElasticsearchWriterUtil.java:234) at org.camunda.optimize.service.es.writer.DashboardWriter.deleteManagementDashboard(DashboardWriter.java:202) at org.camunda.optimize.service.dashboard.ManagementDashboardService.init(ManagementDashboardService.java:70) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.18/Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.18/NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:389) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:333) at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:157) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:440) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1796) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:620) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542) at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) at org.springframework.beans.factory.support.AbstractBeanFactory$$Lambda$311/0x00000001003ac440.getObject(Unknown Source) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) - locked <0x00000000eb972690> (a java.util.concurrent.ConcurrentHashMap) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:333) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:208) at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:955) at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:920) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:583) - locked <0x00000000eb945570> (a java.lang.Object) at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:147) at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:731) at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:408) at org.springframework.boot.SpringApplication.run(SpringApplication.java:307) at org.camunda.optimize.Main.main(Main.java:30)
By analyzing the code, the @PostConstruct tag is used on the initialization of the management dashboard, which fires several requests to elasticsearch. Since the main thread is blocked by that, the liveliness probe doesn't react and kubernetes kills the pod before it could start.
Solution Proposal
1) Make it configurable whether an instance should be the one creating the management dashboard. Like this we can stop the importer with being tasked to do that and leave it up to the webapp to do so. Default should be "true" i.e. by default the instances create the management dashboard
2) Check whether it really is necessary to delete and re-create all management dashboards at every start-up. A mechanism akin to how the instant preview dashboards are created could be used instead For now we will leave this as is, we assume that the reason for recreation is that this is easier than migrating in case of changes to management entities. We may evaluate this further in a follow up
3) Make sure the management dashboard creation/deletion runs in a separate thread than main. Currently it is running with the @PostConstruct tag. Instead it should be executed similar to what is described in OPT-6771#
Testing Notes
Only solution part 1) can be tested easily:
1) setup a clean ES
2) set the config managementEntities.createOnStartup (env varCAMUNDA_OPTIMIZE_ENTITY_CREATE_ON_STARTUP) to false
3) start Optimize and confirm no management entities were created
and:
1) setup a clean ES
2) set the config managementEntities.createOnStartup (env var CAMUNDA_OPTIMIZE_ENTITY_CREATE_ON_STARTUP) to true
3) start Optimize and confirm management entities were created
This is the controller panel for Smart Panels app
- is related to
-
OPT-6771 Replace PostConstruct with ApplicationReadyEvent where appropriate
- Open
- links to