Transaction Management - Durability

Overview

  • Data for databases are stored on disk

  • Data is brought to memory to be modified

  • Until the changes made are written to disk, they are not permanent:

    • A power loss will loose any data changed in memory but not yet written to disk

  • Due to atomicity, if a transaction aborts we must UNDO all changes by that transaction

    • A second problem is that if a transaction changed data and this change is written to disk, we must be able to find out what it did and change it back

  • Algorithms for ensuring durability deal with these two problems:

    • How not to loose data changed by committed transactions

    • How not to loose information about changes written to disk by uncommitted transactions

Logging

  • Logs deal with keeping track of changes made by transactions.

  • A log is a system table continuously updated as transactions execute (new tuples/records are appended)

  • Each log entry has a sequential number, LSN: log sequence number

  • Log is created in memory but periodically written to disk completely, i.e. flushed to disk

  • Log entries are created for:

    • Changes to transactions

    • Transaction states: commit, abort

    • Recovery steps from a crash: undo of log entries

    • Checkpoints: a snapshot of the active transactions

  • Each entry has an LSN, continuously increasing number

What happens in a crash when log is used incorrectly?

  • Let’s see a scenario:

    Suppose T1 modified data pages P1 and P2 in memory
    
    We want to write both P1 and P2 to disk, then we will allow
    T1 to commit.
    
    However after writing P1 to disk there is a crash. We loose
    all state data in memory including the portion of the log
    in memory.
    
    T1 did not finish, so we must undo its changes by reading
    T1 back from disk.
    
    If the log containing the changes to page P1 by T1 is not
    on disk, we are in trouble. We cannot UNDO.
    
  • To make sure that such a situation does not cause a problem, we will employ a method called WRITE AHEAD LOGGING (WAL).

Write Ahead Logging (WAL)

  • Write ahead logging (WAL) is a method to ensure that if a data page is modified on disk, we have a log record for it on disk.

  • To accomplish this, the log is always written to disk ahead of the data.

  • Write ahead logging:

    Before writing any data page modified from memory to disk:
    
        First, flush all the log records currently in memory
        including the information about what was changed in this
        data page.
    
        After log is written, the data pages can be written
        to disk.
    
  • Example log entry:

    Log entries (on disk)

    LSN

    Log Entry

    PrevLSN

    1234

    Update: T2 PageY 21 25

    1235

    Update: T1 PageX 12 15

    1236

    Commit: T2

    1234

    1237

    Update: T1 PageY 25 31

    1235

    1238

    Commit: T1

    1237

  • PrevLSN is the LSN number of the last change by the same transaction before the current one (used to trace back a transaction steps).

  • Each update has the previous and next value for the data page:

    T2 PageY  21 25
    
    Page Y: old value: 21, new value: 25
    

    This is a simplification of the actual log entry.

    In reality, we only write the part of the data paged changed.

    T2 PageY startoffset-in-page  oldvalue newvalue
    
  • Data pages contain the LSN of the last change to that page in their header.

    Data pages (on disk)

    Page number

    LSN

    Page Contents

    PageX

    1235

    15

    PageY

    1237

    31

  • This allows us to see whether a change recorded in log is already written to disk or not.

  • All transaction management systems must use WAL to make sure that a change to a data page written to disk can be traced and undone is possible.

  • We will assume WAL is always used in all the discussion below.

Force and Steal

  • Whenever a transaction aborts, we must UNDO all the changes it has already made, by reversing the updates.

  • Whenever there is a crash:

    • First, we must UNDO all the changes by all transactions that have not completed.

    • Second, we must REDO the changes made by committed transactions to make sure that they are not lost.

    • These actions are called recovery.

  • In both cases, what must be done for REDO and UNDO will ultimately depend on the rules we employ in executing transactions, namely force and steal.

    • Force has to do with whether we force data pages to disk when a transaction commits.

    • Steal has to do with whether memory pages are dedicated to a single transaction or can be shared (or stolen) by others.

    • We will see how to do both properly and their impact on recovery.

Force

  • Transaction management systems may employ force:

    Using force means that whenever a transaction
    wants to commit:
    
       1. flush the log to disk
    
       2. all the pages modified by that transaction
          are written to disk
    
       3. write commit record to log, flush the log to disk
    
       4. allow transaction to commit
    
  • Advantage of force is that if a transaction is committed, then we know all the pages by the transaction are written to disk.

    No data by a committed transaction is lost.

  • Disadvantage of force is that writing pages may require many seeks. Other transactions may also use the pages once they are in memory improving performance if force was not employed.

  • In case of crash:

    • If we see a commit record, we know that all changes by the transaction are written to disk already. No need to REDO.

    • If we do not see a commit record, some data pages may actually be written to disk while we were trying to commit. So, we need to UNDO those changes.

No force

  • For comparison, let’s see how NO FORCE will work.

    Using force means that whenever a transaction
    wants to commit:
    
       1. write the commit record and flush the log to disk
    
       2. allow the transaction to commit
    
          pages modified by the transaction will be
          written to disk by the operating system
          as needed by other transactions
    
  • The advantage is that we can improve performance by allowing commits without forcing lots of disk accesses.

  • Log contains all actions by a committed transaction, so we can make sure no data is lost even in case of crash.

    However, recovery may not involve both an UNDO and a REDO. We will see such a recovery algorithm below.

No steal

  • If a system does not use steal, then specific memory pages allocated to a transaction remain allocated to the same transaction until it commits:

    Using no steal means that if a transaction has modified
    a page, it must be kept in memory (not written to disk)
    until the transaction decided to commit.
    
  • Disadvantage of no steal is that memory use is not optimal.

  • Advantage of no steal is the simplicity of recovery. In case of a crash, any transaction that is not yet committed according to the log has not written any data to disk.

    No need to UNDO.

    There may be a need to REDO.

Steal

  • Using steal means the reverse:

    Using steal means that a transaction can steal the memory
    block allocated to a transaction by
    
        First writing the log due to Write Ahead Logging
    
        Then, writing the modified page to disk to free
        up memory for this transaction.
    
  • Advantage of steal is that memory is used more efficiently, allowing transactions to use more or less memory depending on need of different operations.

  • Disadvantage of steal is that if there is a crash, we need to UNDO any pages modified by uncommitted transactions (i.e. dirty pages) that were written to disk due to steal. Hence both UNDO and REDO are needed.

Recovery from a crash

  • ARIES series of algorithms provide safe recovery from a crash.

  • Often recovery occurs after a catastrophic event that causes loss of all state information.

  • To recover, we must find out the state of the database just before crash based on the portion of the log on disk. The first step of recovery is the “analysis step”.

  • The analysis step will read log from the beginning all the way to its end to find all transactions that have ended and all transactions that were still in progress.

    • To simplify analysis, we can take period snapshots of the database state called checkpoints.

    • The analysis starts from the latest checkpoint.

  • Based on the analysis, we find two things:

    • All pages modified by committed transactions that may not have been written to disk. All these changes must be redone.

      If force is used, there is no need to REDO.

    • All transactions that were still executing at the time of crash. The changes by these transactions must be undone.

      If steal is not used, there is no need UNDO.

Checkpointing

  • A checkpoint is a snapshot of the database state written to the log.

  • Checkpoints store two main types of information:

    • Transaction table: all transactions that are still executing at the time of checkpoint.

      For each transaction, we store its id and the LSN of the last action by the transaction.

    • Dirty page table: list of all pages modified in memory and were not written back to disk at the time of checkpoint.

      For each dirty page, we store the page id and the LSN of the earliest change that was not written to disk.

  • When log is flushed to disk, checkpoints are also written to disk.

  • During recovery, we will start from the last checkpoint for analysis.

  • Example:

    Log checkpoint example

    LSN

    Log Entry

    1001

    begin checkpoint

    1002

    Transaction table

    1003

    T1 991, T2 995

    1004

    Dirty page table

    1005

    P1 981, P4 987

    1006

    end checkpoint

  • Note in reality, checkpoints can span many log entries.

    It may take time to write a checkpoint, so it is possible to use a fuzzy checkpoint that will allow transactions to continue while checkpoint is being written.

Data Used During Recovery

  • Let’s recap what data is available during recovery

  • Log records contain data about:

    • Transaction actions:

      update of pages

      commit of transactions

      abort of transactions

      end of transactions

  • We keep track of when a committed or aborted transaction is completely finished with an END record.

    • for abort, all changes are undone (in memory)

    • for commit, log has been flushed.

    Only at this point, the transaction is notified that it has ended.

    • Recovery actions: log records for these are called Compensation Log Record (CLR)

      undo of updates

    • Checkpoints:

      transaction table

      dirty page table

  • Data pages contain information about

    • LSN of the last update that changed that page

ARIES Recovery Algorithm

  • The algorithm consists of three phases:

    • Analysis phase

    • REDO phase

    • UNDO phase

  • Remember that DPT (dirty page table) stores pairs of the form (PX, LX) where PX is a page number and LX is the LSN number of the log entry for the first update to PX that has not been written to disk yet.

  • TT (transaction table) stores pairs of the form (TX,LX) where TX is a transaction that is still active, and LX is the LSN number of the log entry for the last operation performed by LX.

  • I summarize the operations performed at each step below.

Analysis Phase

  • The main point of analysis is to find at the time of crash which pages may be dirty and which transactions may still be executing.

  • We simply trace the log starting with the checkpoint:

    • Find the last LSN for all transactions we find

    • Remove committed transactions

    • Record all new potentially dirty pages (and earliest potentially unrecorded change for each page)

  • Any transaction that is not committed at the end of analysis is assumed to be incomplete and must be aborted. However, UNDO step comes first, then we will first REDO.

  • Analysis algorithm:

    Read last checkpoint entry
    
    Initialize the DPT (dirty page table) and TT (transaction table) to
    the recorded checkpoint entries
    
    set NEXT_LSN to the last LSN for checkpoint
    
    while the end of log is not reached
        read the next log record pointed by NEXT_LSN into LOG_RECORD
        if LOG_RECORD is an update: (TX updates PG)
           put (TX, NEXT_LSN) into TT
           ## or modify the LSN for TX if it is already in TT
           if PG  is not in DPT then
               add (PG, NEXT_LSN) to DPT
        else if LOG_RECORD is a CLR: (CLR: undo [TX update PG])
            put (TX, NEXT_LSN) into TT
            ## or modify the LSN for TX if it is already in TT
            if PG  is not in DPT then
                add (PG, NEXT_LSN)  to DPT
        else if LOG_RECORD is: abort TX
            mark TX  in TT  as aborted
            change the LSN to NEXT_LSN
        else if LOG_RECORD is: commit TX
            mark TX  in TT  as committed
            change the LSN to NEXT_LSN
        else if LOG_RECORD is: end TX
            remove TX  from TT
        else
            ignore the log record
        advance to the next log record  ##set NEXT_LSN to NEXT_LSN+1
    

REDO Phase

  • The point of REDO is to bring the database to the same state at the time of crash. What we really care is making sure the changes by committed transactions are recorded.

    • Depending on the underlying concurrency scheme, we can REDO only the changes by committed transactions.

  • Redo step will read each data page that is potentially dirty and if its pageLSN is smaller than the LSN of the log record, that means this change is not yet recorded to disk and we must REDO.

  • Redo will simply make the NEW value of the change the current value.

    Update TX PY 10 12
    

    means that TX changes page PY from 10 to 12, so we must REDO to change the page to 12 if this change is not yet recorded.

  • As in the transaction management system, UNDO/REDO changes are kept in memory or forced to disk at commit depending on whether force/steal are used.

    The algorithm works even in the case of repeated crashes as long as write ahead logging is used.

  • REDO proceeds in forward log order, from earlier records to later records.

  • Redo algorithm

    assume DPT and TT are computed by the above Analysis Phase
    set NEXT_LSN  to the lowest LSN number in DPT
    ## earliest change to a dirty page that may not have been recorded
    
    while the end of log is not reached
        read the next log record pointed by NEXT_LSN into LOG_RECORD
        if LOG_RECORD is an update: (TX updates PG) for a committed transaction
            call function REDO_RECORD(LOG_RECORD,NEXT_LSN)
        else if LOG_RECORD is a CLR: (CLR: undo TX updates PG) for a committed transaction
            call function  REDO_RECORD(LOG_RECORD,NEXT_LSN)
        else
            ignore the log record
        advance to the next log record ## set NEXT_LSN  to NEXT_LSN+1
    for all transaction TX in TT with status committed
        write an end TX log record
        remove TX from TT
    
    
    Subroutine REDO_RECORD(LOG_RECORD,NEXT_LSN)
    ## the record LOG_RECORD is to be redone at log number NEXT_LSN
    
    if LOG_RECORD is an update: (TX update PG)
       if PG is not in DPT then
           ignore ##this change has already been recorded
       else
           find the record (PG, DPT_LSN) in DPT for this page
           if  NEXT_LSN < DPT_LSN then
              ignore ##this change has already been recorded
           else
              read PG into memory and find its pageLSN
              if  NEXT_LSN <= PG.pageLSN then
                  ignore ## this update has already been recorded
              else
                  REDO the update [TX updates PG]
    
    else if LOG_RECORD is a CLR: (CLR: undo TX updates PG)
        ##do the same as above, except REDO the undo
        if PG is not in DPT then
            ignore  ##this change has already been recorded
        else
            find the record (PG, DPT_LSN)  in DPT for this page
            if  NEXT_LSN < DPT_LSN then
                ignore
            else
                read PG  into memory, find its pageLSN
                if  NEXT_LSN  <= PG.pageLSN  then
                    ignore
                else
                    UNDO the update [TX updates PG])
                    /* hence redo the CLR */
    
    end of subroutine REDO_RECORD
    

UNDO Phase

  • The point of undo is to erase changes made by aborted transactions.

  • Undo will read data pages modified by the transaction to check if the change by a log entry is recorded on disk.

  • Undo will simply make the OLD value of the change the current value.

    Update TX PY 10 12
    

    means that TX changes page PY from 10 to 12, so we must UNDO to change it back to 10.

  • Similar to redo, the undo is made in memory. Pages changed by an UNDO are written back based on the force/steal protocol.

    As long as we follow the write ahead logging, we can recover from repeated crashes.

  • Undo proceeds in backward order, each change must be changed in reverse.

    Example:

    Update TX PY 10 12
    Update TX PY 12 15
    

    to undo we must first change PY from 15 to 12, then from 12 to 10. Hence, we will trace the log in reverse order.

  • As we undo multiple transactions, we will find the largest LSN to undo for each transaction.

    • We will pick the largest to undo and then add the next LSN to undo to a list.

    • We will continue picking the largest until no operations are left.

      Hence we do not trace a single transaction back, but all transactions at the same time.

  • Undo algorithm:

    Assume the analysis and redo phases are completed successfully
    Set the set TO_UNDO  to empty
    For all active transactions (TX, LSNX) in TT,
       add LSNX   to set TO_UNDO
       write a log record (abort TX)
    
    While TO_UNDO  is not empty
       remove the largest LSN number UNDO_LSN  from TO_UNDO
       find LOG_RECORD corresponding to UNDO_LSN
       if LOG_RECORD  is an update record of the form [TX updates PG]
          undo the update to PG  (in memory)
          write a CLR record: CLR: undo of record [TX updates PG])
          find the previous operation for TX (follow previous lsn pointer)
          if prevlsn is not nil
              add it to TO_UNDO
          if prevlsn is nil
              write an end record for this transaction (i.e. TX)
       else if LOG_RECORD  is a CLR record
           find previous undo
           if previous undo is not nil
               add it to TO_UNDO
           if prevlsn is nil
               write an end record for this transaction (i.e. TX)
       else find the prevlsn for the same transaction
           if it is not nil
                add it to TO_UNDO
           if prevlsn is nil
                 write an end record for this transaction (i.e. TX)
    

Example ARIES recovery

  • Suppose after a crash, we find the following information in the log on disk (we only show the relevant part of the log):

    Log entries (on disk)

    LSN

    Log Entry

    PrevLSN

    994

    Update: TA P6 10 15

    995

    Update: TA P5 ZZ H

    994

    996

    begin checkpoint

    997

    TT: TA 995

    998

    DPT: P6 994

    999

    end checkpoint

    1000

    Commit: TA

    996

    1001

    Update: T1 P1 A B

    1002

    Update: T1 P2 C D

    1001

    1003

    Update: T2 P3 E F

    1004

    Update: T2 P4 F G

    1003

    1005

    Update: T3 P5 H I

    1006

    Update: T4 P6 15 22

    1007

    Commit T4

    1006

    1008

    Update: T2 P6 K L

    1004

    1009

    Commit T1

    1002

    1010

    Update T3 P2 D E

    1005

  • Assume also the following is the contents of the data pages at the time crash.

    Data page contents (on disk)

    pageid

    pageLSN

    content

    P1

    1001

    B

    P2

    1010

    E

    P3

    980

    E

    P4

    1004

    G

    P5

    996

    H

    P6

    994

    15

  • Based on this log, we can conclude that there is no force used in this DBMS system.

    T4 is commited, but change at LSN=1006 to P6 is not written to disk.

  • Based on this log, we can conclude that there is steal used in this DBMS system.

    T3 is not yet committed, but its changes at LSN=1010 to P2 are written to disk.

  • We can now trace each step of the algorithm based on this information.

  • Analysis:

    Start at checkpoint (LSN: 996), initialize TT and DPT
    
    LSN      State info
    996      TT: TA 995  DPT: P6 994
    1000     TT:         DPT: P6 994
    1001     TT: T1 1001 DPT: P6 994, P1 1001
    1002     TT: T1 1002 DPT: P6 994, P1 1001, P2 1002
    1003     TT: T1 1002, T2 1003 DPT: P6 994, P1 1001, P2 1002, P3 1003
    1004     TT: T1 1002, T2 1004 DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004
    1005     TT: T1 1002, T2 1004, T3 1005
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    1006     TT: T1 1002, T2 1004, T3 1005, T4 1006
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    1007     TT: T1 1002, T2 1004, T3 1005
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    1008     TT: T1 1002, T2 1008, T3 1005
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    1009     TT: T2 1008, T3 1005
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    1010     TT: T2 1008, T3 1010
             DPT: P6 994, P1 1001, P2 1002, P3 1003, P4 1004, P5 1005
    
     Abort transactions: T2 and T3
     write abort log records
    
     LSN    Log Entry
     1011   abort T2
     1012   abort T3
    
  • Now, redo phase, we will go through every single log entry starting with the earliest LSN in the recovered DPT, go forward and redo the actions of all committed transactions.

    We will need to read each potentially dirty page. We will only REDO an entry if it is an update to a page in DPT that was not yet written to disk.

    Start at 994
    
    Skip the following LSNs:
    
    995 -- LSN for P5 in DPT is higher than 995
    1003, 1004, 1005, 1008, 1010 -- aborting transactions
    
    Test for REDO the following LSNs:
    
    test 994: Read P6, pageLSN 994, already written, no need to REDO
    test 1001: Read P1, pageLSN 1001, already written, no need to REDO
    test 1002: Read P2, pageLSN 1010, already written, no need to REDO
    test 1006: P6 already in memory, but pageLSN=994, must REDO this update
    
    REDO 1006
    
  • Next, we will do UNDO all the actions of aborting transactions T2 and T3 in backwards order.

    For each action, we will check if the action has been written to disk. If so, we will UNDO by changing the disk entry.

    Aborting T2 and T3, put the last LSN for each into
    the TO_UNDO set.
    
    TO_UNDO = {1010, 1008}
    
    Write: UNDO 1010 log record
    
       P2 is already in memory with pageLSN = 1010. We must
       undo P2 contents and change it to D.
    
       PrevLSN = 1005 is added to TO_UNDO.
    
    TO_UNDO = {1005, 1008}
    
    Write: UNDO 1008 log record
    
       P6 is already in memory, but pageLSN=1006. So, no need to change
       the data page content.
    
       PrevLSN = 1004 is added to TO_UNDO.
    
    TO_UNDO = {1005, 1004}
    
    Write: UNDO 1005 log record
    
       Read P5 into memory, pageLSN=996. So, no need to change
       the data page content as this update was never written to disk.
    
       PrevLSN = nil.
    
    TO_UNDO = {1004}
    
    Write: UNDO 1004 log record
    
       Read P4 into memory, pageLSN=1004. So, we need to reverse
       this change (from G back to F).
    
       PrevLSN = 1003 is added to TO_UNDO.
    
    TO_UNDO = {1003}
    
    Write: UNDO 1003 log record
    
       Read P3 into memory, pageLSN=980. So, no need to change
       the data page content as this update was never written to disk.
    
       PrevLSN = nil.
    
    TO_UNDO = {}
    
       Recovery is complete.
    
       Write END log records for aborted transactions.