EventStoreDB stores indexes separately from the main data files, accessing records by stream name.
EventStoreDB creates index entries as it processes commit events. It holds these in memory (called memtables) until it reaches the
MaxMemTableSize and then persisted on disk in the index folder along with
.bloomfilter files and an index map file. The index files are uniquely named, and the index map file called indexmap. The index map describes the order and the level of the index file as well as containing the data checkpoint for the last written file, the version of the index map file and a checksum for the index map file. The logs refer to the index files as a PTable.
Indexes are sorted lists based on the hashes of stream names. To speed up seeking the correct location in the file of an entry for a stream, EventStoreDB keeps midpoints to relate the stream hash to the physical offset in the file.
As EventStoreDB saves more files, they are automatically merged together whenever there are more than 2 files at the same level into a single file at the next level. Each index entry is 24 bytes and the index file size is approximately 24Mb per 1M events. The Bloom filter files are approximately 1% of the size of the rest of the index.
Level 0 is the level of the memtable that is kept in memory. Generally there is only 1 level 0 table unless an ongoing merge operation produces multiple level 0 tables.
Assuming the default
MaxMemTableSize of 1M, the index files by level are:
|Level||Number of entries||Size|
|n||2^(n-1) * 1M||2^(n-1) * 24Mb|
Each index entry is 24 bytes and the index file size is approximately 24Mb per M events.
Indexing in depth
For general operation of EventStoreDB the following information is not critical but useful for developers wanting to make changes in the indexing subsystem and for understanding crash recovery and tuning scenarios.
Index map files
Indexmap files are text files made up of line delimited values. The line delimiter varies based on operating system, so while you can consider indexmap files valid when transferred between operating systems, if you make changes to fix an issue (for example, disk corruption) it is best to make them on the same operating system as the cluster.
The indexmap structure is as follows:
hash- an md5 hash of the rest of the file
version- the version of the indexmap file
checkpoint- the maximum prepare/commit position of the persisted ptables
maxAutoMergeLevel- either the value of
int32.MaxValueif it was not set. This is primarily used to detect increases in
MaxAutoMergeLevel, which is not supported.
index- List of all the ptables used by this index map with the level of the ptable and it's order.
EventStoreDB writes indexmap files to a temporary file and then deletes the original and renames the temporary file. EventStoreDB attempts this 5 times before failing. Because of the order, this operation can only fail if there is an issue with the underlying file system or the disk is full. This is a 2 phase process, and in the unlikely event of a crash during this process, EventStoreDB recovers by rebuilding the indexes using the same process used if it detects corrupted files during startup.
Writing and merging of index files
Merging ptables, updating the indexmap and persisting memtable operations happen on a background thread. These operations are performed on a single thread with any additional operations queued and completed later. EventStoreDB runs these operations on a thread pool thread rather than a dedicated thread. Generally there is only one operation queued at a time, but if merging to ptables at one level causes 2 tables to be available at the next level, then the next merge operation is immediately queued. While merge operations are in progress, if EventStoreDB is appending large numbers of events, it may queue 1 or more memtables for persistence. The number of pending operations is logged.
For safety ptables EventStoreDB is currently merging are only deleted after the new ptable has persisted and the indexmap updated. In the event of a crash, EventStoreDB recovers by deleting any files not in the _ indexmap_ and reindexing from the prepare/commit position stored in the indexmap file.
If you have set the maximum level (
MaxAutoMergeIndexLevel) for automatically merging indexes, then you need to trigger merging indexes above this level manually by using the
/admin/mergeindexes endpoint, or the es-cli tool that is available with commercial support.
Triggering a manual merge causes EventStoreDB to merge all tables that have a level equal to the maximum merge level or above into a single table. If there is only 1 table at the maximum level or above, no merge is performed.
Stream existence filter
The Stream Existence Filter is a pair of files in the index directory.
It is made up of a persisted Bloom filter (the
.dat file), and a checkpoint (the
.chk file) that records where in the log the filter has processed up to. As per the documented backup procedure, the checkpoint needs to be backed up before the dat file.
The Stream Existence Filter is used when reading and writing to quickly determine whether a stream might exist or definitely does not exist. If a stream definitely does not exist then EventStoreDB can skip looking through the index to find information about it. This is particularly useful when creating new streams (i.e. appending to stream that did not previously exist).
Additional information can be found in this
The configuration options that effect indexing are:
|Option||What's it for|
|Where the indexes are stored|
|How many entries to have in memory before writing out to disk|
|Sets the minimum number of midpoints to calculate for an index file|
|Tells the server to not verify indexes on startup|
|The maximum level of index file to merge automatically before manual merge|
|Bypasses the checking of file hashes of indexes during startup and after index merges (allows for faster startup and less disk pressure after merges)|
|Size in bytes of the stream existence filter|
|Maximum number of entries in each index LRU cache|
|Feature flag which can be used to disable the index Bloom filters|
Read more below to understand these options better.
Default: data files location
Index effects the location of the index files. We recommend you place index files on a separate drive to avoid competition for IO between the data, index and log files.
MaxMemTableSize effects disk IO when EventStoreDB writes files to disk, index seek time and database startup time. The default size is a good tradeoff between low disk IO and startup time. Increasing the
MaxMemTableSize results in longer database startup time because a node has to read through the data files from the last position in the
indexmap file and rebuild the in memory index table before it starts.
MaxMemTableSize also decreases the number of times EventStoreDB writes index files to disk and how often it merges them together, which increases IO operations. It also reduces the number of seek operations when stream entries span multiple files as EventStoreDB needs to search each file for the stream entries. This affects streams written to over longer periods of time more than streams written to over a shorter time, where time is measured by the number of events created, not time passed. This is because streams written to over longer time periods are more likely to have entries in multiple index files.
Index cache depth
IndexCacheDepth effects how many midpoints EventStoreDB calculates for an index file which effects file size slightly, but can effect lookup times significantly. Looking up a stream entry in a file requires a binary search on the midpoints to find the nearest midpoint, and then a seek through the entries to find the entry or entries that match. Increasing this value decreases the second part of the operation and increase the first for extremely large indexes.
The default value of 16 results in files up to about 1.5 GB in size being fully searchable through midpoints. After that a maximum distance between midpoints of 4096 bytes for the seek, which is buffered from disk, up to a maximum level of 2 TB where the seek distance starts to grow. Reducing this value can relieve a small amount of memory pressure in highly constrained environments. Increasing it causes index files larger than 1.5 GB, and less than 2 TB to have more dense midpoint populations which means the binary search is not used for long before switching back to scanning the entries between. The maximum number of entries scanned in this way is
distance/24b, so with the default setting and a 2TB index file this is approximately 170 entries. Most clusters should not need to change this setting.
Skip index verification
SkipIndexVerify skips reading and verification of index file hashes during startup. Instead of recalculating midpoints when EventStoreDB reads the file, it reads the midpoints directly from the footer of the index file. You can set
true to reduce startup time in exchange for the acceptance of a small risk that the index file becomes corrupted. This corruption could lead to a failure if you read the corrupted entries, and a message saying the index needs to be rebuilt. You can safely disable this setting for ZFS on Linux as the filesystem takes care of file checksums.
In the event of corruption indexes will be rebuilt by reading through all the chunk files and recreating the indexes from scratch.
Auto-merge index level
MaxAutoMergeIndexLevel allows you to specify the maximum index file level to automatically merge. By default EventStoreDB merges all levels. Depending on the specification of the host running EventStoreDB, at some point index merges will use a large amount of disk IO.
Merging 2 level 7 files results in at least 3072 MB reads (2 * 1536 MB), and 3072 MB writes while merging 2 level 8 files together results in at least 6144 MB reads (2 * 3072 MB) and 6144 MB writes. Setting
MaxAutoMergeLevelto 7 allows all levels up to and including level 7 to be automatically merged, but to merge the level 8 files together, you need to trigger a manual merge. This manual merge allows better control over when these larger merges happen and which nodes they happen on. Due to the replication process, all nodes tend to merge at about the same time.
Optimize index merge
OptimizeIndexMerge allows faster merging of indexes when EventStoreDB has scavenged a chunk. This option has no effect on non-scavenged chunks. When EventStoreDB has scavenged a chunk, and this option is set to
true, it uses a bloom filter before reading the chunk to see if the value exists before reading the chunk to make sure that it still exists.
Stream existence filter size
StreamExistenceFilterSize is the amount of memory & disk space, in bytes, to use for the stream existence filter. This should be set to roughly the maximum number of streams you expect to have in your database, i.e if you expect to have a max of 500 million streams, use a value of 500000000. The value you select should also fit entirely in memory to avoid any performance degradation. Use 0 to disable the filter.
Upgrading to a version of EventStoreDB that supports the Stream Existence Filter requires the filter to be built - unless it is disabled. This will take approximately as long as it takes to read through the whole index.
Resizing the filter will also cause a full rebuild of the filter.
Index cache size
IndexCacheSize is the maximum number of entries in each index LRU cache. The cache size is set to 0 (off) by default because it has an associated memory overhead and can be detrimental to workloads that produce a lot of cache misses. The cache is, however, well suited to read-heavy workloads of long-lived streams.
The index LRU cache is only created for index files that have Bloom filters.
Use index Bloom filters
UseIndexBloomFilters is a feature flag which can be used to disable the index Bloom filters. This should not be necessary and this flag will be removed in a future release, but is provided for safety since the index Bloom filters are a new feature. Please contact EventStore if you discover some need to disable this feature.
Unless this flag is set to false, EventStoreDB creates a
.bloomfilter file for each new PTable. The Bloom filter describes which streams are present in the PTable. This speeds up stream reads since EventStoreDB can avoid searching in index files do not contain the stream.
Note that immediately after upgrading to a version of EventStoreDB that produces index Bloom filters, no Bloom filters will yet exist. Either wait for new PTables to be produced with Bloom filters in the natural course of writing/merging/scavenging PTables, or rebuild the index for immediate generation.
For most EventStoreDB clusters, the default settings are enough to give consistent and good performance. For clusters with larger numbers of events, or those that run in constrained environments the configuration options allow for some tuning to meet operational constraints.
The most common optimization needed is to set a
MaxAutoMergeLevel to avoid large merges occurring across all nodes at approximately the same time. Large index merges use a lot of IOPS and in IOPS constrained environments it is often desirable to have better control over when these happen. Because increasing this value requires an index rebuild you should start with a higher value and decrease until the desired balance between triggering manual merges (operational cost) and automatic merges (IOPS) cost. The exact value to set this varies between environments due to IOPS generated by other operations such as read and write load on the cluster.
A cluster with 3000 256b IOPS can read/write about 0.73Gb/sec (This level of IOPS represents a small cloud instance). Assuming sustained read/write throughput of 0.73Gb/s. When an index merge of level 7 or above starts, it consumes as many IOPS up to all on the node until it completes. Because EventStoreDB has a shared nothing architecture for clustering this operation is likely to cause all nodes to appear to stall simultaneously as they all try and perform an index merge at the same time. By setting
MaxAutoMergeLevelto 6 or below you can avoid this, and you can run the merge on each node individually keeping read/write latency in the cluster consistent.