Metrics

EventStoreDB collects metrics in Prometheus formatopen in new window, available on the /metrics endpoint. Prometheus can be configured to scrape this endpoint directly. The metrics are configured in metricsconfig.json.

NOTE

/metrics does not yet contain metrics for Projections and Persistent Subscriptions. To view these in Prometheus it is still recommended to use the Prometheus exporter

In addition, EventStoreDB can actively export metrics to a specified endpoint using the OpenTelemetry Protocolopen in new window (OTLP). Commercial

Metrics reference

Caches

Cache hits and misses

EventStoreDB tracks cache hits/misses metrics for stream-info and chunk caches.

Time seriesTypeDescription
eventstore_cache_hits_misses{cache=<CACHE_NAME>,kind=<"hits"|"misses">}CounterTotal hits/misses on CACHE_NAME cache

Example configuration:

"CacheHitsMisses": {
  "StreamInfo": true,
  "Chunk": false
}
1
2
3
4

Example output:

# TYPE eventstore_cache_hits_misses counter
eventstore_cache_hits_misses{cache="stream-info",kind="hits"} 104329 1688157489545
eventstore_cache_hits_misses{cache="stream-info",kind="misses"} 117 1688157489545
1
2
3

Dynamic cache resources

Certain caches that EventStoreDB uses are dynamic in nature i.e. their capacity scales up/down during their lifetime. EventStoreDB records metrics for resources being used by each such dynamic cache.

Time seriesTypeDescription
eventstore_cache_resources_bytes{cache=<CACHE_NAME>,kind=<"capacity"|"size">}GaugeCurrent capacity/size of CACHE_NAME cache in bytes
eventstore_cache_resources_entries{cache=<CACHE_NAME>,kind="count"}GaugeCurrent number of entries in CACHE_NAME cache

Example configuration:

"CacheResources": true
1

Example output:

# TYPE eventstore_cache_resources_bytes gauge
# UNIT eventstore_cache_resources_bytes bytes
eventstore_cache_resources_bytes{cache="LastEventNumber",kind="capacity"} 50000000 1688157491029
eventstore_cache_resources_bytes{cache="LastEventNumber",kind="size"} 15804 1688157491029

# TYPE eventstore_cache_resources_entries gauge
# UNIT eventstore_cache_resources_entries entries
eventstore_cache_resources_entries{cache="LastEventNumber",kind="count"} 75 1688157491029
1
2
3
4
5
6
7
8

Checkpoints

Time seriesTypeDescription
eventstore_checkpoints{name=<CHECKPOINT_NAME>,read="non-flushed"}GaugeValue for CHECKPOINT_NAME checkpoint

Example configuration:

"Checkpoints": {
  "Replication": true,
  "Chaser": false,
  "Epoch": false,
  "Index": false,
  "Proposal": false,
  "Truncate": false,
  "Writer": false,
  "StreamExistenceFilter": false
}
1
2
3
4
5
6
7
8
9
10

Example output:

# TYPE eventstore_checkpoints gauge
eventstore_checkpoints{name="replication",read="non-flushed"} 613363 1688054162478
1
2

Events

These metrics track events written to and read from the server, including reads from caches.

Time seriesTypeDescription
eventstore_io_bytes{activity="read"}CounterEvent bytes read
eventstore_io_events{activity=<"read"|"written">}CounterEvents read/written

Example configuration:

"Events": {
  "Read": false,
  "Written": true
}
1
2
3
4

Example output:

# TYPE eventstore_io_events counter
# UNIT eventstore_io_events events
eventstore_io_events{activity="written"} 320 1687963622074
1
2
3

Gossip

Measures the round trip latency and processing time of gossip. Usually a node pushes new gossip to other nodes periodically or when its view of the cluster changes. Sometimes nodes pull gossip from each other if there is a suspected network problem.

Gossip latency

Time seriesTypeDescription
eventstore_gossip_latency_seconds_bucket{activity="pull-from-peer",status=<"successful"|"failed">,le=<DURATION>}HistogramNumber of gossips pulled from peers with latency less than or equal to DURATION in seconds
eventstore_gossip_latency_seconds_bucket{activity="push-to-peer",status=<"successful"|"failed">,le=<DURATION>}HistogramNumber of gossips pushed to peers with latency less than or equal to DURATION in seconds

Gossip processing

Time SeriesTypeDescription
eventstore_gossip_processing_duration_seconds_bucket{
activity="push-from-peer",
status=<"successful"|"failed">,
le=<DURATION>}
HistogramNumber of gossips pushed from peers that took less than or equal to DURATION in seconds to process
eventstore_gossip_processing_duration_seconds_bucket{
activity="request-from-peer",
status=<"successful"|"failed">,
le=<DURATION>}
HistogramNumber of gossip requests from peers that took less than or equal to DURATION in seconds to process
eventstore_gossip_processing_duration_seconds_bucket{
activity="request-from-grpc-client",
status=<"successful"|"failed">,
le=<DURATION>}
HistogramNumber of gossip requests from gRPC clients that took less than or equal to DURATION in seconds to process
eventstore_gossip_processing_duration_seconds_bucket{
activity="request-from-http-client",
status=<"successful"|"failed">,
le=<DURATION>}
HistogramNumber of gossip requests from HTTP clients that took less than or equal to DURATION in seconds to process

Example configuration:

"Gossip": {
  "PullFromPeer": false,
  "PushToPeer": true,
  "ProcessingPushFromPeer": false,
  "ProcessingRequestFromPeer": false,
  "ProcessingRequestFromGrpcClient": false,
  "ProcessingRequestFromHttpClient": false
}
1
2
3
4
5
6
7
8

Example output:

# TYPE eventstore_gossip_latency_seconds histogram
# UNIT eventstore_gossip_latency_seconds seconds
eventstore_gossip_latency_seconds_bucket{activity="push-to-peer",status="successful",le="0.005"} 8 1687972306948
1
2
3

Incoming gRPC calls

Time seriesTypeDescription
eventstore_current_incoming_grpc_callsGaugeInflight gRPC calls i.e. gRPC requests that have started on the server but not yet stopped
eventstore_incoming_grpc_calls{kind="total"}CounterTotal gRPC requests served
eventstore_incoming_grpc_calls{kind="failed"}CounterTotal gRPC requests failed
eventstore_incoming_grpc_calls{
kind="unimplemented"}
CounterTotal gRPC requests made to unimplemented methods
eventstore_incoming_grpc_calls{
kind="deadline-exceeded"}
CounterTotal gRPC requests for which deadline have exceeded

Example configuration:

"IncomingGrpcCalls": {
  "Current": true,
  "Total": false,
  "Failed": true,
  "Unimplemented": false,
  "DeadlineExceeded": false
}
1
2
3
4
5
6
7

Example output:

# TYPE eventstore_current_incoming_grpc_calls gauge
eventstore_current_incoming_grpc_calls 1 1687963622074

# TYPE eventstore_incoming_grpc_calls counter
eventstore_incoming_grpc_calls{kind="failed"} 1 1687962877623
1
2
3
4
5

Client protocol gRPC methods

In addition, EventStoreDB also records metrics for each of client protocol gRPC methods: StreamRead, StreamAppend, StreamBatchAppend, StreamDelete and StreamTombstone. They are grouped together according to the mapping defined in the configuration.

Time seriesTypeDescription
eventstore_grpc_method_duration_seconds_bucket{
activity=<LABEL>,
status="successful"|"failed",
le=<DURATION>}
HistogramNumber of LABEL gRPC requests that took less than or equal to DURATION in seconds to process

Example configuration:

"GrpcMethods": {
  "StreamAppend": "append",
  "StreamBatchAppend": "append",
  // leaving label as blank will disable metric collection
  "StreamRead": "",
  "StreamDelete": "",
  "StreamTombstone": ""
}
1
2
3
4
5
6
7
8

Example output:

# TYPE eventstore_grpc_method_duration_seconds histogram
# UNIT eventstore_grpc_method_duration_seconds seconds
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="1E-06"} 0 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="1E-05"} 0 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="0.0001"} 129 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="0.001"} 143 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="0.01"} 168 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="0.1"} 169 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="1"} 169 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="10"} 169 1688157491029
eventstore_grpc_method_duration_seconds_bucket{activity="append",status="successful",le="+Inf"} 169 1688157491029
1
2
3
4
5
6
7
8
9
10
11

Kestrel

Time seriesTypeDescription
eventstore_kestrel_connectionsGaugeNumber of open kestrel connections

Example configuration:

"Kestrel": {
  "ConnectionCount": true
}
1
2
3

Example output:

# TYPE eventstore_kestrel_connections gauge
eventstore_kestrel_connections 1 1688070655500
1
2

Process

EventStoreDB collects key metrics about the running process.

Time SeriesTypeDescription
eventstore_proc_up_time{pid=<PID>}CounterTime in seconds this process has been running for. PID is process Id of EventStoreDB process
eventstore_proc_cpuGaugeProcess CPU usage
eventstore_proc_thread_countGaugeCurrent number of threadpool threads (ThreadPool.ThreadCountopen in new window)
eventstore_proc_thread_pool_pending_work_item_countGaugeCurrent number of items that are queued to be processed by threadpool threads (ThreadPool.PendingWorkItemCountopen in new window)
eventstore_proc_contention_countCounterTotal number of times there was contention when trying to take monitor's lock (Monitor.LockContentionCountopen in new window)
eventstore_proc_exception_countCounterTotal number of exceptions thrown
eventstore_gc_time_in_gcGaugePercentage of CPU time spent collecting garbage during last garbage collection
eventstore_gc_heap_size_bytesGaugeHeap size in bytesopen in new window
eventstore_gc_heap_fragmentationGaugePercentage of heap fragmentationopen in new window during last garbage collection
eventstore_gc_total_allocatedCounterTotal allocated bytesopen in new window over the lifetime of this process
eventstore_gc_pause_duration_max_seconds{
range=<RANGE>}
RecentMaxRecent maximum garbage collection pause in seconds. This measures the times that the execution engine is paused for GC
eventstore_gc_generation_size_bytes{
generation=<"gen0"|"gen1"|"gen2"|"loh">}
GaugeSize of each generation in bytes
eventstore_gc_collection_count{
generation=<"gen0"|"gen1"|"gen2">}
CounterNumber of garbage collectionsopen in new window from each generation
eventstore_proc_mem_bytes{
kind=<"working-set"|"paged-bytes"|"virtual-bytes">}
GaugeSize in bytes of the working setopen in new window, pagedopen in new window or virtualopen in new window memory
eventstore_disk_io_bytes{
activity=<"read"|"written">}
CounterNumber of bytes read from/written to the disk
eventstore_disk_io_operations{
activity=<"read"|"written">}
CounterNumber of OS read/write operations issued to the disk

Example configuration:

"Process": {
  "UpTime": false,
  "Cpu": false,
  "MemWorkingSet": false,
  "MemPagedBytes": false,
  "MemVirtualBytes": false,
  "ThreadCount": true,
  "ThreadPoolPendingWorkItemCount": false,
  "LockContentionCount": true,
  "ExceptionCount": false,
  "Gen0CollectionCount": false,
  "Gen1CollectionCount": false,
  "Gen2CollectionCount": false,
  "Gen0Size": false,
  "Gen1Size": false,
  "Gen2Size": false,
  "LohSize": false,
  "TimeInGc": false,
  "GcPauseDuration": true,
  "HeapSize": false,
  "HeapFragmentation": false,
  "TotalAllocatedBytes": false,
  "DiskReadBytes": false,
  "DiskReadOps": false,
  "DiskWrittenBytes": false,
  "DiskWrittenOps": false
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

Example output:

# TYPE eventstore_proc_thread_count gauge
eventstore_proc_thread_count 15 1688070655500

# TYPE eventstore_proc_contention_count counter
eventstore_proc_contention_count 297 1688147136862

# TYPE eventstore_gc_pause_duration_max_seconds gauge
# UNIT eventstore_gc_pause_duration_max_seconds seconds
eventstore_gc_pause_duration_max_seconds{range="16-20 seconds"} 0.0485873 1688147136862
1
2
3
4
5
6
7
8
9

Queues

EventStoreDB uses various queues for asynchronous processing for which it also collects different metrics. In addition, EventStoreDB allows users to group queues and monitor them as a unit.

Time seriesTypeDescription
eventstore_queue_busy_seconds{queue=<QUEUE_GROUP>}CounterTotal time spent processing in seconds, averaged across the queues in the QUEUE_GROUP. The rate of this metric is therefore the average busyness of the group during the period (from 0-1 s/s)
eventstore_queue_queueing_duration_max_seconds{
name=<QUEUE_GROUP>,range=<RANGE>}
RecentMaxRecent maximum time in seconds for which any item was queued in queues belonging to the QUEUE_GROUP. This is essentially the length of the longest queue in the group in seconds
eventstore_queue_processing_duration_seconds_bucket{
message_type=<TYPE>,queue=<QUEUE_GROUP>,le=<DURATION>}
HistogramNumber of messages of type TYPE processed by QUEUE_GROUP group that took less than or equal to DURATION in seconds

QueueLabels setting within metricsconfig.json can be used to group queues, based on regex which gets matched on queue names, and label them for metrics reporting. Capture groups are also supported. Message types can be grouped in the same way in the MessageTypes setting in metricsconfig.json.

NOTE

Enabling Queues.Processing can cause a lot more time series to be generated, according to the QueueLabels and MessageTypes configuration.

Example configuration:

"Queues": {
  "Busy": true,
  "Length": true,
  "Processing": false
}

"QueueLabels": [
  {
    "Regex": "StorageReaderQueue #.*",
    "Label": "Readers"
  },
  {
    "Regex": ".*",
    "Label": "Others"
  }
]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Example output:

# TYPE eventstore_queue_busy_seconds counter
# UNIT eventstore_queue_busy_seconds seconds
eventstore_queue_busy_seconds{queue="Readers"} 1.04568158125 1688157491029
eventstore_queue_busy_seconds{queue="Others"} 0 1688157491029

# TYPE eventstore_queue_queueing_duration_max_seconds gauge
# UNIT eventstore_queue_queueing_duration_max_seconds seconds
eventstore_queue_queueing_duration_max_seconds{name="Readers",range="16-20 seconds"} 0.06434454 1688157489545
eventstore_queue_queueing_duration_max_seconds{name="Others",range="16-20 seconds"} 0 1688157489545
1
2
3
4
5
6
7
8
9

Status

EventStoreDB tracks the current status of the Node role as well as progress of Index, and Scavenge processes.

Time seriesTypeDescription
eventstore_statuses{name=<NAME>,status=<STATUS>}GaugeNumber of seconds since the 1970 epoch when NAME most recently had the status STATUS

For a given NAME, the current status can be determined by taking the max of all the time series with that name.

Index can have one of the following statuses:

  • Opening (loading/verifying the PTables)
  • Rebuilding (indexing previously written records on start up)
  • Initializing (initializing any other parts of the index e.g. StreamExistenceFilter on start up)
  • Merging
  • Scavenging
  • Idle

Scavenge can have one of the following statuses:

  • Accumulation
  • Calculation
  • Chunk Execution
  • Chunk Merging
  • Index Execution
  • Cleaning
  • Idle

Node can be one of the node roles.

Example configuration:

"Statuses": {
  "Index": true,
  "Node": false,
  "Scavenge": false
}
1
2
3
4
5

Example output:

# TYPE eventstore_checkpoints gauge
eventstore_statuses{name="Index",status="Idle"} 1688054162 1688054162477
1
2

Storage Writer

Time seriesTypeDescription
eventstore_writer_flush_size_max{range=<RANGE>}RecentMaxRecent maximum flush size in bytes
eventstore_writer_flush_duration_max_seconds{range=<RANGE>}RecentMaxRecent maximum flush duration in seconds

Example configuration:

"Writer": {
  "FlushSize": true,
  "FlushDuration": false
}
1
2
3
4

Example output:

# TYPE eventstore_writer_flush_size_max gauge
eventstore_writer_flush_size_max{range="16-20 seconds"} 410 1688056823193
1
2

System

Time seriesTypeDescription
eventstore_sys_load_avg{period=<"1m"|"5m"|"15m">}GaugeAverage system load in last 1, 5, and 15 minutes. This metric is only available for Unix-like systems
eventstore_sys_cpuGaugeCurrent CPU usage in percentage. This metric is unavailable for Unix-like systems
eventstore_sys_mem_bytes{kind=<"free"|"total">}GaugeCurrent free/total memory in bytes
eventstore_sys_disk_bytes{disk=<MOUNT_POINT>,kind=<"used"|"total">}GaugeCurrent used/total bytes of disk mounted at MOUNT_POINT

Example configuration:

"System": {
  "Cpu": false,
  "LoadAverage1m": false,
  "LoadAverage5m": false,
  "LoadAverage15m": false,
  "FreeMem": false,
  "TotalMem": false,
  "DriveTotalBytes": false,
  "DriveUsedBytes": true
}
1
2
3
4
5
6
7
8
9
10

Example output:

# TYPE eventstore_sys_disk_bytes gauge
# UNIT eventstore_sys_disk_bytes bytes
eventstore_sys_disk_bytes{disk="/home",kind="used"} 38947205120 1688070655500
1
2
3

Metric types

Common types

Please refer to Prometheus documentationopen in new window for explanation of common metric types (Gauge, Counter and Histogram).

RecentMax

A gauge whose value represents the maximum out of a set of recent measurements. It's purpose is to capture spikes that would otherwise have fallen in-between scrapes.

NOTE

The ExpectedScrapeIntervalSeconds setting within metricsconfig.json can be used to control the size of the window that the max is calculated over. It represents the expected interval between scrapes by a consumer such as Prometheus. It can only take specific values: 0, 1, 5, 10 or multiples of 15.

Setting the expected scape interval correctly ensures that spikes in the time series will be captured by at least one scrape and at most two.

Example output: Following metric is reported when ExpectedScrapeIntervalSeconds is set to 15 seconds

# TYPE eventstore_writer_flush_size_max gauge
eventstore_writer_flush_size_max{range="16-20 seconds"} 1854 1688070655500
1
2

In above example, maximum reported is 1854. It is not a maximum measurement in last 15s but rather maximum measurement in last 16 to last 20 seconds i.e. the maximum measurement could have been recorded in last 16s, last 17s, ...., upto last 20s.

OpenTelemetry Exporter Commercial

EventStoreDB passively exposes metrics for scraping on the /metrics endpoint. If you would like EventStoreDB to actively export the metrics, the OpenTelemetry Exporter Plugin can be used.

The OpenTelemetry Exporter plugin allows you to export EventStoreDB metrics to a specified endpoint using the OpenTelemetry Protocolopen in new window (OTLP). The following instructions will help you set up the exporter and customize its configuration, so you can receive, process, export and monitor metrics as needed.

A number of APM providers natively support ingesting metrics using the OTLP protocol, so you might be able to directly use the OpenTelemetry Exporter to send metrics to your APM provider. Alternatively, you can export metrics to the OpenTelemetry Collector, which can then be configured to send metrics to a variety of backends. You can find out more about the OpenTelemetry collectoropen in new window.

Configuration

Refer to the general plugins configuration guide to see how to configure plugins with JSON files and environment variables.

Sample JSON configuration:

{
  "OpenTelemetry": {
    "Otlp": {
      "Endpoint": "http://localhost:4317",
      "Headers": ""
    }
  }
}
1
2
3
4
5
6
7
8

The configuration can specify:

NameDescription
OpenTelemetry__Otlp__EndpointDestination where the OTLP exporter will send the data
OpenTelemetry__Otlp__HeadersOptional headers for the connection

Headers are key-value pairs separated by commas. For example:

"Headers": "api-key=value,other-config-value=value"

EventStoreDB will log a message on startup confirming the metrics export to your specified endpoint:

OtlpExporter: Exporting metrics to http://localhost:4317/ every 15.0 seconds

The interval is taken from the ExpectedScrapeIntervalSeconds value in metricsconfig.json in the server installation directory:

"ExpectedScrapeIntervalSeconds": 15

Troubleshooting

SymptomSolution
The OpenTelemetry Exporter plugin is not loadedThe OpenTelemetry Exporter plugin is only available in commercial editions. Check that it is present in <installation-directory>/plugins.

If it is present, on startup the server will log a message similar to: Loaded SubsystemsPlugin plugin: "otlp-exporter" "24.2.0.0".
EventStoreDB logs a message on startup that it cannot find the configurationThe server logs a message: OtlpExporter: No OpenTelemetry:Otlp configuration found. Not exporting metrics..

Check the configuration steps above.