# Size Estimation for Cost-based Query Optimization

- Databases use cost based optimization to find the lowest cost
  query plan, before the query is executed. 

- To be able to estimate the cost of query plans, we need to know:

  - Cardinality estimation: How many tuples we expect as the output of
    joins and selections.

  - Estimation theory is complex, but each estimation is based on some
    underlying assumptions. We will only give the basic ideas of
    estimation here to give an idea of how cardinality estimation
    works.

  - Space estimation: The size of the tuples on disk to estimate how
    many disk pages are needed to store them.

    Space estimation is easy since we know schema and the size of each
    attribute. We will disregard this for now.

## Database Statistics

- Databases keep statistics for each tables which need to be updated
  periodically to be useful.

- The more statistics you keep, better the size estimation. However,
  cost of updating them goes up.

- Depending on the DBMS (and its specific version), different
  statistics are kept. These statistics may not be automatically
  updated.

- Statistics can be updated using the `Analyze` command in SQL. (The
  command may vary from DBMS to DBMS).

- Simplest statistics:

  - TUPLES(R ): the number of tuples in R
  - PAGES(R ): the number of pages R spans
  - N_DISTINCT(R.A): the number of distinct values stored for R.A
  - MINVAL(R.A)/MAXVAL(R.A): min/max values stored for R.A

- You can query these statistics also using Postgresql system tables.

  ```
  select
     relname
     , relpages
     , reltuples
  from
     pg_class pc
     , pg_user pu
  where
     pu.usename=user
     and pc.relowner = pu.usesysid ;


  select
     attname
     , inherited
     , n_distinct
     , histogram_bounds
     , array_to_string(most_common_vals, E'\n') as most_common_vals
  from
     pg_stats
  where
     tablename = 'park_hours';
  ```

## Cardinality estimation

- For any condition Cond, we can estimate its selectivity, which is
  the the percentage of tuples that will satisfy Cond, given by
  **sel(Cond).

  - The more selective conditions have lower selectivity
    values. (i.e. 0.01 is more selective than 0.1).

- The expected number of tuples, **exp(X)** in a relation X depends on
  the operations used to define X.

- Size estimation for a selection condition:

  $$
  X = \sigma_{Cond}\, R
  $$

  ```
  exp(X) = sel(Cond) * TUPLES(R )
  ```

- Size estimation for a join (a selection over the Cartesian product):

  $$
  Y = R \bowtie_{JCond} S
  $$

  ```
  exp(Y) = sel(JCond) * (TUPLES(R )*TUPLES(S))
  ```

## Size estimation for equality conditions

- Given any equality condition on a constant, if we assume each
  constant is equally likely (uniform distribution), then we can use
  the following:

  ```
  sel(R.A=c) = 1/N_DISTINCT(R.A)
  ```

- For example:

  ```
  sel(student.year='freshman') = 1/4
  ```

  Assuming 4 different years and equal number of students in each year.

  ```
  sel(student.major) = 1/N_DISTINCT(student.major)
  ```

  Assuming each major is equally likely: same number of students in
  each major.

- Note that uniform distribution is generally not true. For example,
  there may be a few large majors in the university and a number of
  small ones.

  - So, the sizes will be an underestimate for large majors and an
    overestimate for small majors. How important this is depends on
    the frequency and actual cost of queries involving these majors.

 - This is the best we can do with a single statistic. If not useful,
    we need better statistics.

## Size estimation for joins

- Simple join conditions join across attributes from two tables.

  ```
  sel(R.A = S.B)
  ```

  You can look at this as one of two conditions:

  ```
  1. sel(R.A = c1) = 1/N_DISTINCT(R.A)
  2. sel(c2 = S.B) = 1/N_DISTINCT(S.B)
  ```

  Of these two, we want to choose the most selective one as the
  dominant selection condition.

  ```
  sel(R.A = S.B) = 1/( MAX{ N_DISTINCT(R.A), N_DISTINCT(S.B) } )
  ```

## Size estimation for ranges

- You can approach ranges as a collection of individual values, but
  this would be wrong when the values are over a large range:

  ```
  select( 3.0 <= students.gpa <= 3.5 )
  select( c1 <= employee.salary <= c2 )
  ```

  For GPA, there may be many distinct values as this is a float. So,
  you cannot treat it like a collection of distinct values.

  For salary, there may be very few distinct values of salary, but the
  range requested may be large, larger than even the number of
  distinct values.

- Instead, we can approach ranges as a slice over a continuous range:

  ```
  sel( c1 <= R.A <= c2 ) = (c2-c1)/(maxval(R.A)-minval(R.A))
  ```

  This could be very wrong if the range is very large but most
  values are clustered around a narrow range, i.e. large skew.

  For example: average age of college students is around 26 years, but
  we can expect a much larger range for graduate students. The range
  of ages can vary anywhere between 15 and 60. However, most students
  will be clustered around ages 18-24.

- Some older databases used a simple formula to assume that most range
  searches are not very selective:

  ```
  sel( c1 <= R.A <= c2 ) = 1/3
  ```

  Though we will generally not use this formula, it might yield better
  results than an incorrect estimate.

- If the ranges are crucial to your query, and some ranges are
  selective while some are not, you need to use a better statistic.

- Note that string match operations:

  ```
    name like 'abc%'
    name like '%abc%'
  ```

  are also range searches. Cardinality estimation is even more
  complex for these and use of the simple formula is more appropriate
  in these cases.

- Sometimes rewriting a query as a non-range (if possible) may yield
  better estimates. For example: A<3  vs. A=1 or A=2.

## Histograms

- Histograms divide values an attribute may take into ranges and count
  the number of tuples for each range.

- Equal width histograms find the ranges such that there are equal
  number of tuples in each range:

  ```
  select
     avg_width
     , histogram_bounds
  from
     pg_stats
  where
     tablename = 'imdbratings'
     and attname = 'rating';
  ```

  will give us:

  ```
  {1.6,2.3,2.9,3.5,3.7,3.9,4,4.2,4.3,4.4,4.5,4.6,
   4.6,4.7,4.8,4.8,4.9,4.9,4.9,5,5,5.1,5.1,5.1,5.2,
   5.2,5.2,5.2,5.3,5.3,5.3,8.5,8.5,8.5,8.5,8.7,8.7,
   8.7,8.8,8.8,8.8,8.9,9,9.1,9.2,9.5,10}
  ```

- To estimate the selectivity for a range query:

  ```
  sel( 8.0 <= imdbratings.rating <= 9.0 )
  ```

  Find the histogram ranges that overlap with the given range, find
  overlap for each range and divide by the total number of histogram
  ranges.

  For example, it would be about 1/4 of tuples in this case.

- Note that Postgresql will store the most frequent values and their
  frequency separately of the histogram (see 5.3-8.5 range for
  example) and use them in estimates for such values:

  ```
  select
     most_common_vals
     , most_common_freqs
  from
     pg_stats
  where
     tablename = 'imdbratings'
     and attname = 'rating';
  ```

## Selectivity of Boolean Conditions

- We will use simple probabilistic interpretation of selectivities:

  sel(Cond): probability that a tuple will be put in the output

- Using this, we can come up with the following rules:

  ```
  sel(not Cond) = 1 - sel(Cond)

  sel(Cond1 and Cond2) = sel(Cond1) * sel(Cond2)

  sel(Cond1 or Cond2) = 1 - ((1 - sel(Cond1)) * (1 - sel(Cond2)))
  ```

- First, we see that negation is not generally selective.

- Second, AND conditions are treated as if two conditions are
  independent of each other.

  This is often not true:

  ```
  sel(voting_district=c1 and vote=c2)
  ```

  In US, districts are drawn so that most voters in a district will
  vote predominantly for one party over the other. Counting votes like
  this will not work.

- Third, OR conditions are treated as

  ```
  Cond1 OR Cond2 = NOT ((NOT Cond1) AND (NOT Cond2))
  ```

  However this double counts tuples that satisfy both Cond1 and Cond2.

## Cardinality Estimation Summary

- Cardinality estimation is rough and can make many mistakes.

- Often, we do not care for the errors:

  - The big errors may apply to rarely queried values, then they have
    little impact.

  - The actual numbers are not important. Query optimization is about
    comparing magnitude of different plans.

    As long as the estimates reflect the ordering with respect to
    cost, then they are still useful.

- In case we care about the errors, use more sophisticated statistics
  but keep them up to date for most effect.

## Cardinality Estimation Examples

- Let's estimate the size of the following queries:

  ```
  Q1: SELECT * FROM R WHERE R.A = 5 and R.B < 10

  sel(R.A=5) = 1/N_DISTINCT(R.A)
  sel(R.B<10) = 1/3  (use simple heuristics)

  sel(Q1) = 1/(3*N_DISTINCT(R.A))

  exp(Q1) = TUPLES(R )/(3*N_DISTINCT(R.A))

  Q2: SELECT * FROM R WHERE R.C=5 AND R.D=10 AND R.E NOT LIKE 'e%';

  exp(Q2) = 2/ (N_DISTINCT(R.C)*N_DISTINCT(R.D)*3)
  ```

  Note that the selectivity of NOT LIKE is estimated to be 1-1/3=2/3.

- Joins:

  ```
  Q3: SELECT * FROM R,S WHERE R.A=S.B

  exp(Q3) = (TUPLES(R )*TUPLES(S)) / max(N_DISTINCT(R.A), N_DISTINCT(S.B))

  Q4: SELECT * FROM S,T WHERE S.X<T.Y AND T.Z=5

  exp(Q4) = (TUPLES(S)*TUPLES(T)) / (3*N_DISTINCT(T.Z))
  ```

  Note that the join selectivity is assumed to be 1/3 due to the
  inequality.

## Index Selectivity

- Index selectivity is an important measure of
  usefulness of indices

- The index selectivity for a selection query is the
  portion of the query that is answered by the index

- Note that overall the cost of the index use must
  be used in understanding how useful an index is

- Example:

  ```
  SELECT * FROM R WHERE R.A=5 AND R.B=5 AND R.C>3

  Index I1 on R(A,B) will be used for R.A=5 AND R.B=5 which are
  potentially selective conditions (depending on the number of
  distinct values for R.A and R.B)

  Index I2 on R(D,A,B) has the same selectivity as I1, but we will
  need to scan the whole leaf level to find these tuples

  Index I3 on R(C,B) is potentially less selective as the only
  conditions R.B=5 AND R.C>3 can be applied and a great deal of the
  index must be searched.

  You can also combine two indices: I1 and I3: Look up tuples that
  satisfy conditions supported by these indices and take their
  intersection in memory, before reading the found tuples from data
  pages.
  ```

- Indices can be ordered in general for their selectivity and usefulness

  - The most useful index for any query is generall a clustered
    index. A clustered index changes the ordering of the tuples on
    disk based on the index. You can only create one cluster for a
    table and you need to reorganize the index periodically if updates
    are common, which is costly. Consider clusters for rarely changed
    relations.

  - An unclustered/secondary index is the most common type of index. A
    secondary index on attributes that are often searched for equality
    and are selective is very useful. A special case of this is the
    primary key. An index is usually automatically created for the
    primary key by most databases.

  - An unclustered index to be used on less selective conditions can
    be useful if the index contains all the attributes needed for a
    query. In this case, query can be answered using the index only.

    For this to work, the size of the index on leaf nodes should be
    considerably smaller than the table.

    ```
    SELECT C FROM S WHERE B<5

    An index on S(B,C) can be very useful for this query even if
    not very selective.
    ```