tech ramblings

Why Databases Write Ahead

If you've ever been curious to go look under the hood of any relational database, you may have come across the Write Ahead Log (WAL). It's known by different names in different databases - transaction log, binlog, redo log, etc. but essentially, it is a log of events happening in the database. But why is this useful? Why do databases maintain such a log? And why do they write-ahead? The short answer is - durability(remember aciD?), without compromising on performance. For the long answer, keep reading :)

Durability

First let's define what this is. Durability is the guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure. Basically, once the database acknowledges the completion of a query, no event should cause that data to be lost.

Life without WAL.

Let's imagine how databases would function without a WAL. We issue an insert/update query, and the table and the corresponding index data structures are updated on disk. Note that it has to be a synchronous update, not an asynchronous one, because well, durability...

The problem: Updating the on-disk data structures is costly, performance-wise. Firstly, finding the OS page where the record/index entry is stored translates to random IO. Secondly, to ensure 100% durability, one would have to ensure that the file changes are actually synced to disk, and not just living in the OS/disk cache, which is again, not free of cost. See fsync!

Enter, WAL.

Ok, so how can we fix the above problem? The bottleneck is writing to disk, but we can't avoid that by just writing to memory That would lead to a potential data loss.

Maybe, we can make the disk operation a little faster? How about when we issue an insert/update query, we make the changes to the in-memory table/index data structures, but as opposed to flushing them to disk, we let them be in-memory. And instead, we write an event containing all the changes made by the query to a "write-ahead log". And this log entry is flushed (and synced) to disk synchronously. The writes to the on-disk table/index data structures, however, happens asynchronously. Why is this better? Because this write operation is both simpler and faster. Simpler because it doesn't involve multiple writes for table, index, etc, it is just a single entry containing all the changes to be made. Faster, because it is sequential IO. Every write is an append to the log file. Therefore, we can afford to sync the writes to disk for every query.

Durability, without compromising on performance.

So, there we have it - durability without compromising on performance! A fast, sequential write of a log entry to disk, and if there is any failure and the in-memory changes are lost, the database can still be brought back up to date by replaying the lost events from the durable WAL!

The WAL also has other benefits. For instance, it enables replication (a replica database instance can continually consume the log file of the primary and apply each event). Furthermore, it allows for recovery of the database to a specific point back in time (we only replay events from the WAL upto a certain point)!

For more:

  • https://www.youtube.com/watch?v=wI4hKwl1Cn4
  • https://www.youtube.com/watch?v=feTihjJJs3g