# Query Processing

## Overview

- Databases implement relational algebra operators as the basic units
  of operation. The implemented operators are bag versions of the
  relational algebra, to be in sync with SQL (plus include a group by
  operation).

- A logical query plan is represented as a query tree where the
  relations are leaves of the tree and the algebra operators are
  intermediate nodes.

- A single algebra operation can have many different
  implementations. For example, joins can be implemented using
  sort-merge-join, hash-based-join, block-nested-join, etc.

- The physical implementation plan of a query is a query tree with
  additional information, the amount of memory reserved for each node
  and the specific implementation of the given operation.

  We will use M to denote the number of blocks available to a query
  or operation.

- Given a query, there may be many potential query trees. Each has an
  associated cost that can be estimated before the query is executed.

  The dominant cost is always the total number of disk pages read and
  written to execute the query.

- Note that CPU complexity of a query is going to be disregarded
  (i.e. the complexity of the algorithm once the data is in memory)
  due to the following reasons:

  - The cost of reading/writing data is much higher than any operation
    in memory. So, the donimating cost is disk access.
  - We assume each algorithm is implemented as efficiently as possible.
  - It may not even be possible to bring all the data for a query in
    memory at once, which may result in the data being read multiple
    times.

## Disk Access Process (Overly Simplifed)

- Remember that to process any data, it must be first brought to memory.

- Some DBMS component indicates it wants to read record R

  - File Manager:

    Does security check

    Uses access structures to determine the page it is on

    Asks the buffer manager to find that page

  - Buffer Manager

    Checks to see if the page is already in the buffer

    If so, gives the buffer address to the requestor

    If not, allocates a buffer frame

    Asks the Disk Manager to get the page

  - Disk Manager

    Determines the physical address(es) of the page

    Asks the disk controller to get the appropriate block of data from
    the physical address

    Disk controller instructs disk driver to do the dirty job


## Example query plans

- Let's us look at a simple query:

  ```
  SELECT R.C,S.E
  FROM R,S
  WHERE R.B=S.D and R.A=10 ;
  ```

  and two potential logical query trees:


```{image} query_execution_images/query_plans.png
   :align:center :width: 600px
```

- In plan A, we compute the join of R and S first. Then, the selection
  and projection can be computed over the results of the join.

- In plan B, we first select tuples that satisfy the condition
  R.A. The resulting tuples are then joined with all the tuples in S
  and then we apply the projection to the resulting tuples.

## Iterator Interface

- Each operator in the database is implemented using three main functions:

  - open() initializes the necessary memory structures (i.e. M
    buffers) and/or streams
  - getNext() reads data input streams and processes the data until a
    block full of output is produced or the input is completely
    processed, puts the output to the output buffer
  - close() frees all the structures used by the operator

- Since each operator works the same way, we can chain up the
  operators by using the input buffer for an operation as the output
  buffer for the operation below.

  - If this is the last operation in the query tree, then the output
    buffer is simply the standard output to the user executing the
    query.

- Certain operations are easy to combine with the previous operation
  without any additional cost.  These are identified as
  **"on-the-fly"**. For example, if the tuples are already in memory,
  any selection and projection operation can be performed in memory.

- Certain operations are **blocking**, i.e. these operations have to
  be completed before any tuples can be processed by the operations
  above. For example, sorting and group by are blocking operations
  unless the relation is already sorted by the relevant attributes.

- If the operations are not blocking, then the output of one operation
  is pipelined into the input of the next operator immediately,
  producing some output tuples without the query is completed.

## Example Execution

- Suppose we are processing the above query

  ```
  SELECT R.C,S.E
  FROM R,S
  WHERE R.B=S.D and R.A=10 ;
  ```

  by mapping the logical query trees above to more explicit physical
  implementations:


```{image} query_execution_images/query_plans_physical.png
   :align:center :width: 600px
```

- In Plan A 1: To prepare the join operation, we ready R for a scan to
  fill the memory blocks for the block-nested-loop-join. Once the
  sufficient blocks are filled, we will start scanning S to compute
  the join.

  - We add a projection to sequential scan of R to keep the attributes
    needed by the query, which will reduce the size of the relation in
    memory, which may reduce the cost of the join (if we end up
    reading S fewer times). This projection is completed in memory as
    tuples are read and has no additional cost.
    
  - The next() operation for the join will output tuples to the
    selection operation as soon as joining tuples are found. Hence the
    join and selection are pipelined, they are not blocking.

  - The selection operation reads the tuples output by the join,
    removes tuples that do not match the join condition and projects
    the requested attributes. These operations are performed
    completely in memory and have no additional disk I/O cost.

- In Plan B 1: Since we decided to do the selection first, we can use
  an index for the selection.

  - As tuples matching the index search condition are found and read
    from the disk, we can project the necessary attributes and fill
    the input buffer for the block-nested-loop join as before.

  - We have the index search operation being pipelined into the join,
    which is combined with the last projection operation. Because of
    this, there is no additional cost of reading the relation R in the
    join. The cost of reading R is included in the index scan.

## Query Execution Costs

- Often query costs involve the sum of the cost of each
  operation. However, the costs may incorporate additional costs if we
  need to write the results of the relations into memory.

```{image} query_execution_images/query_plans_physical_2.png
   :align:center :width: 600px
```

- Let's assume the following statistics for these relations.

   ```
   PAGES(R )=1,000
   PAGES(S)=5,000
   PAGES( PROJECT_(R.A,R.B,R.C) (R ) = 500 
   PAGES( PROJECT_(S.D,S.E) (S) = 800

   PAGES( PROJECT_(R.B,R.C) (SELECT_(R.A=10) R) = 15

   Index on R(A):  200 left nodes 
   Selectivity (R.A=10) = 1/100
   TUPLES(R) = 20,000
   ```

- Plan A 1: The main cost is the cost of join because the remaining
  selection and projection are performed on the fly.

  - Block-nested-loop-join: We read R once, 1,000 pages.

  - Since we do projection while reading R, we actually need 500 pages
    space to keep R in memory, which requires us to read S 500/20= 25
    times.

  - Total cost: 1000 + 25*5000 = 126,000 pages.

- Plan A 2: We will sort R and sort S fully , and then compute the
  join.

  - Sort R, read R fully (1,000 pages), project attributes A,B,C and
    write 500 pages worth of data using M=20 blocks, which is
    500/20=25 partially sorted groups. (Cost = 1,500 pages)
    
  - Merge 25 groups into 2 groups using M=20 pages, and 2 groups into
    1 group. Read/write 500 pages 2 times. (Cost = 500*4=2,000 pages)

  - Sort S: Read S fully (5,000 pages), project attributes D,E and
    write 800 pages worth of data using M=20 blocks into 800/20=40
    partially sorted groups. (Cost = 5,800 pages)

  - Merge 40 groups into 2 using M=20 pages and 2 groups into 1
    group. Read/write 500 pages 2 times. (Cost = 800*4=3,200 pages)

  - Assuming the join is involving unique attributes R.B and S.D, then
    the sort-merge-join can be done by reading sorted R and S only once,
    total cost of 500+800 = 1,300 pages.

  - Total cost = 1500+2000+5800+3200+1300=13,800.

  - Note that the actual cost is likely much lower as the last sort
    merge steps and the join can be combined, saving 2,600 pages.

- Plan B 1: Index scan involves scanning, 200/100= 2 leaf nodes. The
  scan will find estimated: 20,000/100 = 200 tuples which requires
  reading 200 pages in the worst case. The cost of index scan is 2+200
  pages (plus 1 or 2 for internal nodes that we will disregard).

  - Block nested loop does not have any additional cost of reading R
    since the R tuples are already in memory after the index scan. We
    simply need to find out how many times S needs to be read for the
    join.
  
  - Now the resulting 200 tuples from R fit in 15 pages (according to
    the statistics above), we can complete the join by reading S only
    once since the join has 20 blocks of memory allocated and all of
    found tuples of R fits in memory completely. Total additional cost
    for the join: 5,000 (read S once.)

  - Total cost = 202 + 5,000 = 5,002.

- In this example, plan B is the clear winner because the selection on
  R.A is very selective.

## Query operators

- In the rest of the course notes, we go over some basic types of
  operator implementations. Sort and block-nested-loop-join were
  already covered in the previous section.

- Query operators are classified into classes:

  - One pass
  - Two pass
  - Multi-pass

  depending on the availability of memory, storage method of the
  relation (i.e. sortedness for example) and the number of pages it
  occupies on disk.

## One pass algorithms

- The algorithms require one pass over a given relation.

- One easy way to think about this is as follows. Suppose we have a
  unary operation over relation R. If R fits in memory allocated to
  the query, then we can easily compute the operation in
  memory. However, many operations do not read all of R to be read
  into memory either. This is discussed below for each operation.

## Duplicate removal 

- Given M pages of memory. Let X be 1 page, and Y be M-1 pages in memory.

  ```
  Read R into X, 1 page at a time.

  For each tuple t, check:
     If tuple t is in Y:
        it is already seen, remove t.
     Else
        insert t into Y.
  ```

- getNext() will read a one block from Y and output. The next time it
  will be called, we need to process the operation until another block
  in Y is filled.

- This is a one-pass operation only if after duplicate removal, R fits
  in M-1 blocks.

- Not a blocking operation as we can start outputting tuples as long
  as we can mark them so that we do not return the same tuple more
  than once.

## Group by

- Given M pages of memory. Let X be 1 page, and Y be M-1 pages in memory.

  ```
  For each group in Y, we are going to keep:
      the grouping attributes
      the aggregate value
         for min, max, count, sum: keep min, max, count,
         sum of the tuples seen so far

         for avg, keep count and sum.


  Read R into X, 1 page at a time.
  For each tuple t, check:
      If the corresponding group for tuple t is in Y:
           update its aggregates.

      Else
            create a new group for t into Y and initialize its statistics.
  ```

- This is possible only if all the results fit in Y (M-1)
  blocks. Blocking operation: we cannot output the tuples until we
  finish processing all of R.

## Set and bag operators

- Bag Union of R and S:

  ```
  Read R one block at a time and output
  Read S one block at a time and output
  ```

- Set Union of R and S:

  ```
  Read R and remove duplicates
  Read S into the same space and continue to remove duplicates
  ```

- Set intersection:

  ```
  Read R, remove duplicates and store into M-1 blocks (section Y).
  Read S.
  If the tuple is in Y
        output and remove from Y
  Else
        discard the tuple
  ```

- Bag Intersection:

  Bag intersection requires that we keep track of how many copies of
  each tuple there are.

  ```
  Read R, group by all attributes and add a count for R.
  Store into M-1 blocks (Y section).

  Read S, for each tuple:
      If is in the Y section
           increment the count for S
      Else
           disregard
  Output min of count of R and S.
  ```

- All set/bag operations are defined similarly. In most cases the
  algorithm is one pass only if the necessary memory is available.

- In general, the cost of one pass the algorithms is PAGES(R) +
  PAGES(S) if R and S are being queried, again assuming memory is
  available.

## External Sorting

- A large number of operators can be executed by an intermediate
  sorting step:

  - DISTINCT
  - ORDER BY
  - GROUP BY
  - UNION/INTERSECTION/DIFFERENCE

- A limited amount of memory is available to the sort operation

  M: the number of memory pages available for the sort operation

  PAGES(R): total number of disk pages for relation R

- If PAGES(R) \<= M, then the relation can be sorted in one pass: read
  the relation into memory and apply any sorting algorithm. The cost
  if PAGES(R) pages.


## Sort based duplicate removal

- When a "distinct" projection is needed, we can do the following:

  ```
  Sort the relation
  Read the relation in sorted order
  For each tuple:
     If it is already seen
        discard
     Else:
        output
  ```

- Need to keep in memory the last seen tuple only, so 1 page is
  sufficient for the operation.

- It is possible to combine sort and duplicate removal

  ```
  Read the relation in groups, remove all unwanted attributes

  Sort the tuples in memory, remove duplicates (distinct) and write
  the sorted group to disk

  Read the sorted group for k-way-merge, during merge remove any
  additional duplicates
  ```

## Sort based projection

- Cost is similar to external sort, but the relation being read in the
  second stage is reduced in size by removing unnecessary attributes
- Tuples are smaller (how many attributes are removed?)
- Duplicates are removed (how many duplicates are there?)

## Hash based projection

- To compute "distinct R.A, R.B,…":

  Read all tuples in R and extract the projected attributes

  Hash the remaining attributes to buckets 1 ... M in memory (and
  continuously remove duplicates from buckets in memory)

  Whenever a bucket in memory is full, write it to disk

  For any bucket that takes up more than one disk page, read it back
  from disk to memory and remove duplicates within the bucket.

- Example: 2 pages for hashing and 1 page for processing

```{image} query_execution_images/Hashing1.png
:width: 600px
```

- Read and put values into the two buckets

```{image} query_execution_images/Hashing2.png
:width: 600px
```

- If the bucket needs to be computed before the query executes, it
  needs to be written to disk.

```{image} query_execution_images/Hashing3.png
:width: 600px
```

- Once hashing is complete, different operations can be mapped to
  different buckets and applied independently in each bucket.

```{image} query_execution_images/Hashing4.png
:width: 600px
```

## Hash based projection

- The cost:

  The relation has to be read once for hashing

  If all the buckets after reduction are too large fit in a single
  memory block, then the relation will be written once to disk

  If all the disk pages in a single bucket will fit in the M available
  blocks, then the last step can be performed in one read.

## Set operations

- To compute A UNION B (with duplicate removal), we first hash both A
  and B together and remove duplicates in each bucket separately.
- To compute A - B, we hash A and B into the same buckets and then
  compute A - B to each bucket separately.


## Index nested loop join

- Index loop join assumes a look-up of matching tuples for S using
  an index.

  ```
  Given R join S on R.A=S.A

  Read R one block at a time
     For each tuples r of R
        Use index on S(A) to find all matching tuples
        Read the tuples from disk and join with
  ```

- If R is not sorted on A, then we might end up reading the same
  tuples of S many times

  - What if A is R’s primary key?

    Then, for each A value, we will look up S only once

- Cost:

  - The outer relation R is read once, PAGES(R)

  - Assume for every tuple r in R, there are about c tuples in S that
    would join with R (ideal case is c is very small or 0 most of the
    time)

  - Then, for each tuples in R:
    \- We find the matching tuples in S (cost is index look-up, in the

    > best case one tuple for each level, so h+1 for an index with h
    > levels)

    - Then, if we need attributes that are not in the join,
      we read c tuples from S (which can be in c pages in the worst case)
    - Total cost is between PAGES(R) + TUPLES(R) * (h+1) (if no tuples
      match or index only scan is fine) and PAGES(R) + TUPLES(R) *
      (h+1+c)

- Example:

  ```
  SELECT S.B FROM R,S WHERE R.A = S.B

  Index I1 on S(B) with 2 levels (root, internal, leaf)
  PAGES(R)=100, TUPLES(R)= 2000
  PAGES(S)=200, TUPLES(S)= 4000

  Cost = 100 (for reading R) + 2000*3 (index look up for each tuple of R)

  SELECT S.C  FROM R,S WHERE R.A = S.B

  Assume statistics are same as above, but we now need to reach
  each matching tuple (at most 4000 tuples will match)

  Cost = 100+2000*3+4000 (one page read for each matched tuple)
  ```

  How could this be, there are only 200 pages of S?

  Well, if we are not finding all pages that we need to read first and
  reading each page as we find a match, we may end up reading the
  pages for S multiple times.

  Of course, in reality, you will likely do a lot of reduction of
  duplicate page requests in memory and improve on this. This is the
  worst case scenario.

  Will we ever do this?

  No, we will choose not to use index join for this case, clearly it
  looks very expensive and we better do some other operation.

- Another example:

  ```
  SELECT S.B FROM R,S WHERE R.A = S.B AND R.B=100

  Index I1 on S(B) with 2 levels (root, internal, leaf)
  PAGES(R)=100, TUPLES(R)= 2000
  PAGES(S)=200, TUPLES(S)= 4000
  Suppose only 3 tuples match R.B=100

  So, we can:
     Scan all of R to find these 3 tuples (100 pages)
     Read matching tuples from S (3*3)
     Total cost = 109 (using only 2 pages of memory)

     This would cost a lot more in block-nested loop join
     with M=2 (200 pages). So, index join is for cases
     where the outer relation is very small (normally after
     a selection)
  ```

## Sort-merge join

- Sort both R and S first

- Read R and S one block at a time, and join the matching tuples.

- Sort merge is similar to Step 2 of the sorting algorithm.

- Sort-merge join

- Example:

  ```
  R: [1, 2] [5, 8]    S:[1,5] [6, 7]
  Read [1,2] and [1,5] first, join 1.
  Read the next block of R, [5,8] [1,5]. Join 5
  Read the next block of S, [5,8] [6,7]. No more tuples, done.
  ```

- If each joining attribute has unique values in R and S, then the
  join can be performed in a single step without reading each relation
  once forward, PAGES(R) + PAGES(S) + the cost of sort

- If there are duplicate values, then we must worry about if all the
  duplicate values from both relations will fit in memory

- Example:

  ```
  R [1,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2][2,3]
  S [1,1] [1,1] [1,1][2,2] [2,2] [2,2] [2,2] [2,2][3,4]
  ```

- We need a total of 15 buffer pages to be able to compute this join
  at one step.

- In general, a block-nested loop join is performed for duplicate values.

## Hash join

- Hash both R and S on the joining attribute, read R and S once:

  PAGES(R) + PAGES(S)

- Each bucket will contain both tuples of R and S

- Read each bucket into memory to perform the join within the bucket

- If a bucket cannot fit in memory, then other methods should be used
  to perform the join operation within the bucket

## Summary

- Sorting and hashing are two main methods that can be used to
  implement other operators.
- In particular, sorting may help reduce the cost of multiple
  operations upstream