Database Selection & Design (Part III)
— Internals of ACID implementation —
In the previous part, we saw about different types of databases and their properties. In this section, let us look at what a transaction is, how it is written to the memory, logged, and stored in disk.
What is a Transaction? People often confuse individual database statements to transactions. A transaction is a set of database statements/operations that is deemed as a logical unit of work for any function, also defined as individual indivisible operations.
Moving parts of a Hard Disk: This is an electro-mechanical data storage device that can store data in a stable (non-volatile) storage. This means that the data storage will be permanent and will not be lost even when the power is off. The disk is integrated with a spindle and the head stack assembly. The spindle helps with the disk rotation and the head movement helps with the track movement.
Data storage in disk: Let’s also take a look how the data is stored in the disk in the form of sectors and tracks.
In the diagram above, the triangular part is a sector and the circular part is track. The convergence area of these two is called as a block.
What is a Block? A block is the smallest unit of data that an operating system can either write to a file or read from a file. Offset comes into place when you want to read a particular byte.
Byte = Block Address (Sector # and Track #) + Offset
In a hard disk, by spindle movement, the sectors are changed & by head post movement, your tracks are changed.
The data you will be storing on to your database will be stored in one or more blocks based on the size of your tuple (row) and your disk block. When you have multiple rows in your table, there might be multiple blocks used to store your information based on your block size and row size. For example: You have an order service table defined as below:
This means, each row in your table is going to take 128b. Now, if you have a block size of 512b, each block can store 512/128 = 4 rows. Assume you have 100 rows in your table, you need 100/4 = 25 blocks to store your entire table.
Is block the only unit of work that is maintained across DBMS and data structure? What are the other units we have and why?
Page: Pages are used by some operating systems instead of blocks. A page is basically a virtual block, and they have fixed sizes. Pages are typically 512 to 8192 bytes, with 4096 being a typical value. They are stored in the main memory. Pages are used instead of blocks because they make processing easier when there are many storage devices, because each device may support a different block size. With pages the operating system can deal with just a fixed size page, rather than try to figure out how to deal with blocks that are all different sizes. Pages act as sort of a middleman between operating systems and hardware drivers, which translate the pages to the appropriate blocks. Both pages and blocks are used as a unit of data storage.
Where are these pages stored, how do they interact with the disk blocks, who handles the creation and management of these pages?
Buffer pool: A buffer pool is an area of main memory that has been allocated by the database manager for the purpose of caching table and index data as it is read from disk. It is an intermediate layer between the basic file system and the tuple-oriented file system. Its task is to make the pages addressable in main memory and to coordinate the writing of pages to disk.
“What happens when the buffer pool is full?, How does the data from buffer pool gets moved to disk and what are the policies that govern them?”
Buffer Manager Policies: Stealing & Forcing policies determine what we can assume about the relative consistency of the log and the database with respect to transaction status.
Stealing: When a transaction wants to perform an operation, but the memory pool is full, it needs to clear its memory by kicking some transactions out of memory pool. This can be dangerous because the transaction being kicked out might not been fully committed, thereby moving dirty pages to disk.
Forcing: Means that every time a transaction commits, all the affected pages will be pushed to stable storage. This policy provides durability, but is inefficient, because each page may be written by many transactions and will slow the system down.
There are four variations to this:
- STEAL: All transactions have their own dedicated memory block. When steal happens, pages in the stolen memory block can be written out to disk, even if that page is dirty (having uncommitted operations within a transaction).
- NO-STEAL: In this case, all dirty pages are retained in the buffer pool until the final outcome of the transaction has been determined. In case of memory full situation, the OS level flush kicks in.
- FORCE: Make sure that every update is on disk before commit. This policy has a potential for dirty read between the write-to-disk and commit operations.
- NO-FORCE: Allow transaction to commit to disk as per operating system rules. In this case, since the pages are not forced back to disk every time, other transactions can make use of the pages for their reads and writes.
With the above variations, the combination of STEAL and NO-FORCE is the fastest and efficient, as the transactions are not sent to disk and memory is efficiently used across transactions. The opposite is the NO-STEAL and FORCE.
Example #1: In the example below, we have a transaction that is committed to the buffer pool, with NO-FORCE, and the transaction is never received by the stable storage. This causes durability issues as we have lost a transaction which was committed to the application / customer. In this case, the transaction has to be performed again (REDO) to maintain the durability of the system.
Example #2: In the example below, we have a transaction that is in progress in the buffer pool. With STEAL policy, the transaction could be flushed to the stable storage. If for some reason, the transaction issues a rollback, the dirty flush in the disk needs to be rolled back (UNDO) to maintain the Atomicity of the transaction
But, how will the database know how to UNDO and REDO? Will these buffer management policies take care of the failure and recovery situations? What happens at the time of a crash and what data is used to roll back or roll forward?
Need for logging: The above scenarios causes serious damage to the relational properties, causing atomic and durability issues. To avoid that, log files were introduced in between these actions. This concept is called “Write Ahead Log” (WAL). In WAL, the writes are recorded in the log files, flushed to stable storage before the commit is written to the buffer pool
What Information does a log contain? Log records may contain redo and undo information. A log record containing both is called an undo-redo log record, there may also be undo-only log records and redo-only log records. There are two ways you can do logging within WAL. Undo & Redo logging are the options to recover from any failures between operations within a transaction.
- UNDO record contains the information needed to reverse a change made by a transaction (in the event of rollback). Undo logs are collections of records that capture all actions (operations performed on a tuple) to the transactions BEFORE they are committed.
- REDO record contains the information to redo a change made by a transaction (if they have been lost). Redo logs are collection of records which captures all actions (operations performed on a tuple) to the transactions AFTER they are committed.
How does these recovery options (UNDO/REDO logging) correlate with the buffer management policies (STEAL/FORCE)?
NO-FORCE / NO-STEAL:
- No-Force means: Pages are committed without flushing the data to the disk. No-Steal means: The pages are never stolen and dirty data are never written to the disk
- If failure occurs, the committed transactions needs to be rolled forward (REDO) & UNDO is not required as there are no dirty pages in the disk. Hence this quadrant shows No-UNDO / REDO
STEAL / NO-FORCE:
- No-Force means: Pages are committed without flushing the data to the disk. STEAL means: The pages are stolen and dirty data are written to the disk
- If failure occurs, the committed changes need to be rolled forward (REDO) to the database & the dirty writes needs to be rolled back (UNDO). Hence this quadrant shows UNDO / REDO
NO-STEAL / FORCE:
- Force means: Pages are committed and the data is flushed to the disk. No-Steal means: The pages are never stolen and dirty data are never written to the disk
- If failure occurs, REDO is not required as every commit is flushed into the disk & UNDO is not required as there are no dirty pages in the disk. Hence this quadrant shows No-UNDO / REDO
STEAL / FORCE:
- Force means: Pages are committed and the data is flushed to the disk. Steal means: The pages are stolen and dirty data are written to the disk
- If failure occurs, REDO is not required as every commit is flushed into the disk & UNDO is required as there are dirty pages in the disk. Hence this quadrant shows UNDO / No-REDO
Comparison as shown below:
The concept of how the internals work is very critical to understand the properties of how database work on the data structure and model level. The above details are something you might not be using or discussing with your teams and architects on a regular basis, but is important to keep in mind when making your choice of data storage. So, please understand the internals of any property you review as part of the whole series.
Happy Reading !