-
Feature Request
-
Resolution: Unresolved
-
L3 - Default
-
None
-
None
-
None
User Story (Required on creation):
As an operator, I can monitor/alert engine metrics as part of my operation monitoring tool of choice.
Functional Requirements (Required before implementation):
- Make the following engine metrics available via Prometheus interface (authenticated and authorized):
- Performance metrics
- Threads active
- Threads idle/available in the pool
- Threads blocking
- Job Backlog (pending jobs that are due but not yet executed)
- Usage metrics
- Performance metrics
- Document metrics that allow operators not familiar with the Camunda terms to monitor the system (similar to what MongoDB does)
Technical Requirements (Required before implementation):
- Create an internal representation of tagged metric samples (counters and gauges, a sample can have multiple tags)
- Enable the job executor implementations to expose thread performance (active threads, available threads, blocked threads) where possible
- Create an internal metrics collector that tracks samples
- Should be pluggable in the engine configuration
- Collects reported usage metrics from ACT_RU_METER_LOG
- Collects performance metrics from the job executor (access via engine configuration)
- Fetches number of jobs in backlog from the ACT_RU_JOB table
- (optional) Makes the metrics collection interval configurable in the engine configuration
- Used by the DbMetricsReporter to fetch the known metrics and to store them in ACT_RU_METER_LOG
- We might optimize the reporter to only store non-zero metrics in the database to spare unnecessary data
- Makes the list of collected usage and performance metrics configurable in the engine configuration
- Documentation
- What metrics do we collect (usage vs. performance)?
- How do we collect metrics?
- How can you influence this with configuration?
- How can you provide your own metrics collector?
- Create a REST endpoint (preferably under “/metrics/prometheus”) that exposes collected metrics in Prometheus format
- Normalize metric names so they conform to Prometheus format (Snake case)
- Serve Prometheus format based on requested result format (see MetricsServlet and Exporter)
- Fetch metrics from configured engine collector via API
- Transform internal metrics samples to Prometheus samples
Limitations of Scope (Optional):
Metrics not in scope
- Job configuration (can be part of the diagnostics interface)
- core-pool-size
- max-pool-size
- queue-capacity
- wait-time-in-millis
- max-wait
- max-jobs-per-acquisition
- Process metrics (should be monitored in Optimize)
- Process Instances Started (by process definition)
- Process Instances Ended (by process definition)
- Process Instance Cycle Time
- Number of incidents
- Troubleshooting metrics
- Can decrease performance.
- They can be implemented as needed as a custom extension of the metrics collector
- Health metrics (SUPPORT-10327)