I wrote something called LogStore based on the general notion of log-structured data (then learned about the work of Ousterhout et al. in the 90s). For what it's worth, I avoided the need to lock the metadata and the database during compaction, allowing the log to grow during compaction.
Instead of working through the locked metadata (offset per entry) and rewriting the entries to a new file, the compactor works from the end of the log file to the start of the log file. Each record descriptor is actually written after the record data to help the compactor easily skip over older versions of records it has processed already during the compaction iteration.
Once the compactor reaches the start of the file, it checks to see if the file grew (more appends, new data) since the beginning of the compaction, and starts over from the new EOF but stopping at the previous EOF. This repeats until the compactor fully catches up.
Then the database is temporarily locked, the new metadata is swapped in, and the existing log file is overwritten with the compacted log file.
The database and metadata are not locked during compaction, at least to the extent that you seem to think they are.
When a compaction is requested, the compact query enters the write queue. When the query reaches the head of queue, an asynchronous compaction is started on a snapshot of the database at that time, to a file in /tmp. Also, a buffer is added to the database metadata in which subsequent queries will be stored. After this essentially instant operation, the database can proceed to process write queries as normal. While the compaction is ongoing, write queries are appended to the buffer mentioned above. When the compaction thread that was spawned earlier finishes writing its snapshot, it inserts into the write queue a request to finalize compaction. When this request reaches the head of the queue, it appends all the buffered queries to the compacted database file. Finally, it swaps the compacted file in /tmp to the regular database path. So writes are blocked once for an instant and once for however long it takes to write those buffered queries, which shouldn't be long either. Note that reads are never blocked by compaction; indeed they are never blocked in general.
Thank you for explaining how compaction works in greater detail. As I understand it, the "buffer" used for write queries (and more) while compaction is active is in-memory, correct, an ArrayList? http://github.com/mmcgrana/fleetdb/blob/master/src/clj/fleet...
Why not just open a new file at compaction-start instead of an in-memory buffer? When compaction ends, append the newly open file to the compacted file, then swap-in the compacted file as the current log file.
I suppose deciding on whether to buffer in memory or on disk would depend on several factors:
1) how much compaction is required and thus how long compaction might take to complete
2) historical write-rate average
3) buffer size threshold
4) compaction time threshold
By thresholding I mean: start buffering in memory and then switch to a file on disk if compaction starts taking "too long" to complete or the buffer in memory becomes "too large".
Right, there are actually two potential problems here: 1) the initial compaction takes a long time and there are a correspondingly large amount of buffered writes to append and 2) the compaction never finishes, either because compaction hung or died - and then we are adding to the buffer indefinitely.
I don't think that 1) will be that much of a problem: the amount of writes that the db can process in the time it takes to do a compaction of even a large db should be small compared to the rate at which they can be appended to the compacted file.
2) is more problematic; I'll need to add a timeout-like guard to prevent a runaway write buffer.
this is more or less exactly like the BGREWRITEAOF redis command. I wonder if we both happened to think at the same solution or if you in some way know about Redis and got some idea from it. I could love to know that some Redis algorithm could be of general use for other DBs implementation.
Thanks for your work, and welcome to the in-memory database developers crew ;)
Is it not true that a major purpose of a DB is to handle datasets the are too big for RAM? If the items fit in RAM, isn't speed ~1/1000th (ie ratio of RAM access to HD access time) the issue it was?
If everything fits into RAM, you can use my proprietary ArrayDB storage system which has bindings for every non-toy programming language known to man. In fact, if you're a programmer, you already know the API for ArrayDB. No downloads, no installation, minimal investment and training required!
That really depends on the query and the data structures used in the database. You can query large amounts of data on the hard drive very quickly if you store your data and indexes in data structures manually tuned for the problem. Modern operating systems / databases will try to use up your available RAM anyway to cache hard drive so keeping the whole database in RAM (if it fits into the RAM) won't help that much anyway.
Clojure is very compelling for implementing a main-memory database; if you use the persistent data structures and concurrency primitives in the right way, you get atomicity, consistency, and isolation essentially for free.
Do the index maps contain pointers back to the base records? The examples I saw on the implementation page showed a nested copy of the data; I assume this is just a projection?