Details
Description
Environment (Required on creation):
- Camunda 7.12+
- all distros
- Aurora Postgres version 11.9
Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):
In certain circumstances, a query is needed for retrieving the last failure log id for an incident. This query performs poorly with high cardinalities in the ACT_HI_JOB_LOG table. This can lead to the failed job listener execution running into query transaction timeouts on the last run that is supposed to create an incident. As a result, the failed job listener execution fails and does not decrease retries for the job any further. The job might then run endlessly and never run into an incident.
The stacktrace for the failed job listener execution:
2021-10-07T04:13:04,872+0000-[ERROR]-[cid=,tid=,clu=]-[TransactionTemplate]-[171 ]-Application exception overridden by rollback exceptionorg.camunda.bpm.engine.ProcessEngineException: An exception occurred in the persistence layer. Please check the server logs for a detailed message and the entire exception stack trace. at org.camunda.bpm.engine.impl.util.ExceptionUtil.wrapPersistenceException(ExceptionUtil.java:263) at org.camunda.bpm.engine.impl.util.ExceptionUtil.doWithExceptionWrapper(ExceptionUtil.java:257) at org.camunda.bpm.engine.impl.db.sql.DbSqlSession.executeSelectList(DbSqlSession.java:111) at org.camunda.bpm.engine.impl.db.sql.DbSqlSession.selectList(DbSqlSession.java:103) at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.selectListWithRawParameter(DbEntityManager.java:182) at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.selectList(DbEntityManager.java:174) at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.selectList(DbEntityManager.java:166) at org.camunda.bpm.engine.impl.persistence.entity.HistoricJobLogManager.findHistoricJobLogsByQueryCriteria(HistoricJobLogManager.java:61) at org.camunda.bpm.engine.impl.HistoricJobLogQueryImpl.executeList(HistoricJobLogQueryImpl.java:302) at org.camunda.bpm.engine.impl.AbstractQuery.evaluateExpressionsAndExecuteList(AbstractQuery.java:219) at org.camunda.bpm.engine.impl.AbstractQuery.execute(AbstractQuery.java:193) at org.camunda.bpm.engine.impl.interceptor.CommandExecutorImpl.execute(CommandExecutorImpl.java:28) at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:110) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor$1.doInTransaction(SpringTransactionInterceptor.java:72) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35) at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33) at org.camunda.bpm.engine.impl.AbstractQuery.executeResult(AbstractQuery.java:160) at org.camunda.bpm.engine.impl.AbstractQuery.list(AbstractQuery.java:142) at org.camunda.bpm.engine.impl.persistence.entity.JobEntity.getLastFailureLogId(JobEntity.java:674) at org.camunda.bpm.engine.impl.persistence.entity.JobEntity.createFailedJobIncident(JobEntity.java:348) at org.camunda.bpm.engine.impl.persistence.entity.JobEntity.setRetries(JobEntity.java:296) at org.camunda.bpm.engine.impl.cmd.JobRetryCmd.decrementRetries(JobRetryCmd.java:60) at org.camunda.bpm.engine.impl.cmd.DefaultJobRetryCmd.executeStandardStrategy(DefaultJobRetryCmd.java:87) at org.camunda.bpm.engine.impl.cmd.DefaultJobRetryCmd.execute(DefaultJobRetryCmd.java:75) at org.camunda.bpm.engine.impl.jobexecutor.FailedJobListener$FailedJobListenerCmd.execute(FailedJobListener.java:111) at org.camunda.bpm.engine.impl.jobexecutor.FailedJobListener$FailedJobListenerCmd.execute(FailedJobListener.java:92) at org.camunda.bpm.engine.impl.interceptor.CommandExecutorImpl.execute(CommandExecutorImpl.java:28) at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:110) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor$1.doInTransaction(SpringTransactionInterceptor.java:72) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35) at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33) at org.camunda.bpm.engine.impl.jobexecutor.FailedJobListener.execute(FailedJobListener.java:54) at org.camunda.bpm.engine.impl.jobexecutor.FailedJobListener.execute(FailedJobListener.java:32) at org.camunda.bpm.engine.impl.interceptor.CommandExecutorImpl.execute(CommandExecutorImpl.java:28) at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:110) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor$1.doInTransaction(SpringTransactionInterceptor.java:72) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70) at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35) at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33) at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobHelper.callFailedJobListenerWithRetries(ExecuteJobHelper.java:112) at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobHelper.invokeJobListener(ExecuteJobHelper.java:93) at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobHelper.executeJob(ExecuteJobHelper.java:74) at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.executeJob(ExecuteJobsRunnable.java:110) at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.run(ExecuteJobsRunnable.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.ibatis.exceptions.PersistenceException: ### Error querying database. Cause: java.sql.SQLException: Connection is closed ### The error may exist in org/camunda/bpm/engine/impl/mapping/entity/HistoricJobLog.xml ### The error may involve org.camunda.bpm.engine.impl.persistence.entity.HistoricJobLogEntity.selectHistoricJobLogByQueryCriteria ### The error occurred while executing a query ### SQL: select distinct RES.* from ACT_HI_JOB_LOG RES WHERE RES.JOB_ID_ = ? and RES.JOB_STATE_ = ? order by RES.SEQUENCE_COUNTER_ desc LIMIT ? OFFSET ? ### Cause: java.sql.SQLException: Connection is closed at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:149) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:140) at org.camunda.bpm.engine.impl.db.sql.DbSqlSession.lambda$executeSelectList$1(DbSqlSession.java:111) at org.camunda.bpm.engine.impl.util.ExceptionUtil.doWithExceptionWrapper(ExceptionUtil.java:254) ... 53 common frames omitted Caused by: java.sql.SQLException: Connection is closed at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515) at com.sun.proxy.$Proxy136.prepareStatement(Unknown Source) at com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(ProxyConnection.java:337) at com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(HikariProxyConnection.java) at io.opentracing.contrib.jdbc.TracingConnection.prepareStatement(TracingConnection.java:65) at jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at io.opentracing.contrib.common.WrapperProxy$1.invoke(WrapperProxy.java:73) at com.sun.proxy.$Proxy175.prepareStatement(Unknown Source) at jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$TransactionAwareInvocationHandler.invoke(TransactionAwareDataSourceProxy.java:241) at com.sun.proxy.$Proxy176.prepareStatement(Unknown Source) at org.apache.ibatis.executor.statement.PreparedStatementHandler.instantiateStatement(PreparedStatementHandler.java:86) at org.apache.ibatis.executor.statement.BaseStatementHandler.prepare(BaseStatementHandler.java:88) at org.apache.ibatis.executor.statement.RoutingStatementHandler.prepare(RoutingStatementHandler.java:59) at org.apache.ibatis.executor.BatchExecutor.doQuery(BatchExecutor.java:90) at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:325) at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:156) at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:109) at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:89) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:147) ... 56 common frames omitted
Steps to reproduce (Required on creation):
- Create a high cardinality in the job log table (min 20 million rows) or define a rather low transaction timeout with a lower cardinality
- Set a history level that does not create events for failing jobs, e.g., ACTIVITY
- Create a job that reliably fails
- Run the job
Observed Behavior (Required on creation):
- The job fails until retries are down to 1
- On the usually last run of the failed job listener, creating an incident leads to the last failure job log id query being executed
- The query times out
- The failed job listener fails and keeps the retries at 1
- The job is executed endlessly with the same procedure
Expected behavior (Required on creation):
- An incident is created
- The job retries are set to 0 and committed to the database
Root Cause (Required on prioritization):
- The method JobEntity#getLastFailureLogId performs a query on the historic job log table. This method is invoked during the execution of the failed job handler, in a separate transaction after a job has failed.
- The query performs a full table scan on ACT_HI_JOB_LOG, filtering for rows with the given job id and status, sorting by the sequence counter
- With increasing cardinalities, this query becomes more costly
Solution Ideas (Optional):
- Option 1: We might be able to cover the query by an index or adjust it so it uses an existing index (e.g., on the ID_ column)
- Option 2: We avoid making this query.
- JobEntity#getLastFailureLogId is used when a job reaches zero retries.
- This happens in two cases:
- Case 1: Job fails during execution by job executor
- Case 2: Job fails is set to 0 retries via API
- In case 1 and if the history level is full, the failure log is created in the same transaction since the job logs are also created by the failed job handler. In this case, no query is needed.
- In case 1 and if the history level is not full, the job log query is executed, but this is not useful because it won't find any
- In case 2, we can question if referencing the last failure log in the historic incident makes sense (since no job failure lead to 0 retries); this seems to be a deliberate design choice (see test case HistoricIncidentTest#testJobLogReferenceWithNewIncidentCreatedOnSetRetriesAfterFailure and the comment in JobEntity#getLastFailureLogId), but we should revise it.
Hints (optional):
- The query was introduced with
CAM-10672 - Option 2 is preferrable if possible