# Index Scan

## Disk Access Process (Overly Simplifed)

- Remember that to process any data, it must be first brought to memory.

- Some DBMS component indicates it wants to read record R

  - File Manager:

    Does security check

    Uses access structures to determine the page it is on

    Asks the buffer manager to find that page

  - Buffer Manager

    Checks to see if the page is already in the buffer

    If so, gives the buffer address to the requestor

    If not, allocates a buffer frame

    Asks the Disk Manager to get the page

  - Disk Manager

    Determines the physical address(es) of the page

    Asks the disk controller to get the appropriate block of data from
    the physical address

    Disk controller instructs disk driver to do the dirty job

## Cost and Resources

- An SQL query is translated to a combination of relational algebra
  operations.

- Each operation in the query is given some resources: memory blocks
  to use for the operation.

  We will use M to denote the number of blocks available to a query.

- We are interested in the overall cost of a query. We will only
  consider total number of disk pages read/written to execute the
  query.

  We will use PAGES(R) to denote the total number of pages in
  relation R.

- Note that CPU complexity of a query is going to be disregarded
  (i.e. the complexity of the algorithm once the data is in memory)
  due to the following reasons:

  - The cost of reading/writing data is much higher than any operation
    in memory. So, the donimating cost is disk access.
  - We assume each algorithm is implemented as efficiently as possible.
  - It may not even be possible to bring all the data for a query in
    memory at once, which may result in the data being read multiple
    times.

- The cost of a query (total number of pages read and written) depends
  on how much memory is available (M). We will see how these two are
  related for each operation.

## Access Paths

- Given a query

  ```
  SELECT attrs FROM R WHERE C
  ```

  the following are the possible access paths for this relation

- Table scan: Always possible.

- Hash index: Possible if the hash index is on a subset of attributes
  A1,...,An such that all conditions in C for A1,...,An are for equality
  and are conjunctive (ANDed).

- B-tree index with search key A1,...,Am such that a prefix A1,...,An
  of A1,...,Am have equality conditions in C.

- After an index scan, it is necessary to scan the relation if

  - C contains conditions on attributes that are not in the index, or
  - Projection attributes are not all in the index

- Example:

  ```
  FROM R WHERE R.A=10 AND R.B<10 AND R.C > 20
  ```

- Can we use B-tree indices on

  I1 on R(A,B) ?

  I2 on R(B,C) ?

  I3 on R(D,A) ?

## Selections

- To compute SELECT FROM R WHERE C

  [Sequential Scan] Read all of R one block at a time and check C. Only
  need a single buffer page to read R. Total cost PAGES(R).

  [Index Scan] Read the index to check all or part of C, find
  qualifying tuples. Read those tuples from disk and return. Total
  cost: the cost of index scan + cost of relation scan.

- Example:

  `` `
  SELECT  R.C, R.D    FROM R
  WHERE R.A=10 AND R.B=5 AND R.C<30 ;
  ` ``

  - Given index I1 on R.A,R.B such that

    I1 has depth 3

    I has about k leaf nodes with about k\*c tuples in general for any
    specific R.A,R.B value (I.e. duplicate tuples)

  - Finding the tuple identifier of all tuples with R.A=10 AND R.B=5
    takes 3 + k disk reads in the average.

  - We still need to read the qualifying k\*c tuples from disk to check
    R.C, and return R.C, R.D

  - Given k\*c tuples with R.A=10 and R.B=5, how many disk pages for R
    need to be read to find all these tuples?

    If the relation is not sorted/clustered with respect to R.A, R.B,
    then we might end up reading a different page for each single
    tuple. The worst case then is k\*c disk reads.

    If the relation is clustered with respect to R.A,R.B and suppose
    each disk page stores about n tuples of R, then we need to read
    about k\*c/n disk pages.

  - Given index I2 on R.A,R.B,R.C such that

    I has depth 3

    I has about k leaf nodes with about k\*c tuples in general for any
    specific R.A,R.B value (I.e. duplicate tuples)

    Find the leaf nodes for R.A=10 AND R.B=5. For these, we scan the
    siblings left to right until R.C >= 30. In the worst case 3 + k
    nodes are scanned.

    Return all the R.C values from the index, no need to read the
    relation.

## Access Paths

- Given a query

  `` `
  SELECT attrs FROM R WHERE C
  ` ``

  the following are the possible access paths for this relation

- Table scan: Always possible.

- Hash index: Possible if the hash index is on a subset of attributes
  A1,...,An such that all conditions in C for A1,...,An are for equality
  and are conjunctive (ANDed).

- B-tree index with search key A1,...,Am such that a prefix A1,...,An
  of A1,...,Am have equality conditions in C.

- After an index scan, it is necessary to scan the relation if

  - C contains conditions on attributes that are not in the index, or
  - Projection attributes are not all in the index

- Example:

  `` `
  FROM R WHERE R.A=10 AND R.B<10 AND R.C > 20
  ` ``

- Can we use B-tree indices on

  I1 on R(A,B) ?

  I2 on R(B,C) ?

  I3 on R(D,A) ?

## Complex Conditions

- Given an index can be used to evaluate the selection condition only
  partially, how can we compute complex conditions?

  FROM R WHERE C1 AND C2

- Use index to find tuples that satisfy C1, read tuples from disk and
  check C2 in memory.

- Use two indices to find tuples that satisfy C1 and C2 separately,
  take the intersection of the tuple identifiers.

- Given multiple indices to evaluate part of a complex query condition
  C, which one to choose?

- Choose the most selective one.

- If the resulting tuples can be reduced significantly by another
  index, use the next most selective index.

- Otherwise, do a table scan.
