Log Structured Merge Tree - Garren's public notes

_Log-structured merge trees are a data structure I come across often when working with databases. I often read up on how they work and then shortly completely forget it all. I then have to relearn it again because I either want to speak to someone about it or use something LSM-related. This is my LSM overview notes that I can quickly refer to whenever I forget how a specific part works._ Log Structured Merge Tree is a data structure for storing data on disks. It is primarily used as the core key-value store in a database. An LSM is especially good when there is a high amount of writes to the database. The LSM architecture is becoming more and more popular. The most well-known implementation is [RocksDB](https://rocksdb.org/). ## How an LSM works An LSM is broken up into two main components. An in-memory component called a _Memtable_ and String Sorted Tables (SSTables) that store the data on disk. The _Memtable_ stores the most recent key and value updates and the SSTables stores older keys and values. Both components store the values sorted by the key. A type of tree structure like a b+tree, ART or Hashtable is typically used to store the keys in memory. Interestingly [RocksDB uses a skiplist](https://github.com/facebook/rocksdb/wiki/Memtable), but it is configurable to use a hash structure as well. When the Memtable reaches a certain size, the Memtable is marked as read-only. A new Memtable is created to receive new updates and the old Memtable is flushed to disk and stored as a SSTable at Level 0. Each SSTable is immutable and will hold only one copy of each key in it. The below diagram shows the different levels, sizes and how the keys are split up per level. ![[Pasted image 20231226153304.png]] _SSTable levels from [How RocksDB Works](https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works/)_ From the above diagram, Level 0 is newer than Level 1 and so any keys it holds are newer than Level 1. ## SSTables internals Sorted string tables are broken up into two on-disk components. An index file and a data file. The index file is often a b-tree design ordered by the keys and has a pointer value to the position of the data contained in the data file. The data file is also ordered by key, so for range scans, we use the index file to find the first key in the data file and then read the rest of the data straight from the data file. ## Compaction Over time, the combined SSTables for a Level will exceed the allowed size limits for that level. A background compaction process will merge that level with the level below it. The process will compare the keys stored in the lower level with the higher level, if the lower level contains that key, it will replace it with the higher level and newer key. This process is critical to reclaim disk space and reduce the number of files required to read from for a key lookup. ## Writes and redundancy Inserts and updates are written into the active Memtable and also appended to the write ahead log (WAL). The WAL is flushed to disk in a background thread so that if the LSM crashes the updates in the Memtable are not lost and can be recovered when the application restarts. The WAL will be linked to each Memtable, so when a Memtable is flushed to disk, the WAL can be cleaned up and removed. This is done so that the WAL is not continually growing. When a key is deleted, the LSM needs to record that the key is deleted. This is known as a tombstone. If this isn't done, when reading that key it could find an older key in the SSTables and return old data. ## Read Path For a single key lookup, the LSM first looks in the Memtable. If it is not there it will look in the SSTables, starting at the most recent one and searching all SSTables until it finds the key. An optimisation to make the read more efficient is to add a [Bloom Filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) to each SStable. A Bloom filter can be used to determine if the key is definitely not in the SSTable or maybe in the table. This is then used to reduce the number of SSTables to search through to find the key. A Range scan is a fair bit more complicated. An iterator is opened for the _Memtable_ and all the SSTables. It reads from the _Memtable_ and the SSTable making sure to exclude older results and merge the results into a single response back. ## Bitcask [Bitcask](https://github.com/basho/bitcask) is an interesting variation on the standard LSM design. It keeps an in-memory Hashmap of all the keys and a pointer to the position of the value on disk. This makes reads and writes very fast, but requires a lot more memory to use and start-up times can be slower because the Hashmap needs to be built. [Here is a nice dive into the implementation](https://arpitbhayani.me/blogs/bitcask/) ## Extra reading For a deeper dive into the theory of LSM, read chapter 3 of [[Why I own two copies of Designing Data-Intensive Applications|Designing Data-Intensive Applications]] and chapter 7 of [[Database Internals]]. Artem has a very indepth article on [How RocksDB works](https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works/). Then for some history, the [original Paper on LSM](https://paperhub.s3.amazonaws.com/18e91eb4db2114a06ea614f0384f2784.pdf). Finally, [Joran's talk on LSM-forest in Tigerbeetle](https://www.youtube.com/watch?v=yBBpUMR8dHw) is a really fun overview of how they use an LSM in Tigerbeetle.