# Database Tuning

- Tuning is the act of changing an application and DBMS environment to
  improve system performance

- Performance is usually measured in terms of response time

  - Time to get the first tuple
  - Time to get all the tuples to a query

- A workload consists of various operations performed by the system
  and their frequency

  - It is important to tune to improve workload, not just a single query
  - Some queries are bottlenecks: performed frequently and/or are very
    expensive

- Remember: when you improve the performance of one operation, other
  operations may slow down

  For example, creating an index to speed up a query will slow down
  insert/update operations

## Database Tuning Steps

- Step 1: buy more or faster hardware

  memory is crucial for buffering query operations and caches for
  various operations

  hard disk speed is crucial, buy faster and more disks to improve the
  parallelism

- Step 2: tune the system installation

  databases provide a large number of tunable parameters, read database
  administration books

  learn about making better use of multiple CPUs

### Disk caches

- A cache is a set of buffer pages maintained by the DBMS for a
  specific purpose
- Data cache for reading pages containing the index or the relation
- Procedure cache for storing previously constructed query plans
- Caches are usually shared between concurrent users
- Any requested item must be brought to cache from disk to read/modified
- If it is already in the cache, then the cache has a hit, otherwise
  the cache has a miss
- Since each hit is a savings in time, hit ratio must be maximized
  (some application designers seek 90% hit ratio)

### Cache replacement algorithms

- If a new item has to be inserted into the cache, another item might
  need to be removed.

- Cache replacement algorithm decides what should be removed, LRU
  (least recently used), MRU (most recently used)

  - A recently used page may be used for an update in near future (LRU)
  - A page read in table scan is no longer needed (MRU)
  - Sophisticated caches may take the algorithm using the database into
    account
  - How would you use the cache for an index page?

- A dirty page is a page modified by an uncommitted transaction -if
  this page is moved out of cache, it must be written back to disk

### Tuning the cache

- Divide the cache and bind a specific item to a cache (different
  tables may be cached in different caches)

- Divide the cache into pools of varied size, 2K, 4K, 8K, etc.

  - The query processor can choose the best available cache for a query
    (retrieve large sequences for table scans, even prefetch disk pages
    that are expected to be requested next)

- Procedure cache may use previously optimized query plans

  Hint: use program variables to increase possible reuse

  ```
  SELECT P.name FROM Professor P WHERE P.deptId = :deptid
  ```

  When you use prepared statements such as this, query and parameters
  are sent separately, allowing the same query for different
  parameters to use the same query plan.

## Partitioning

- Step 3: partition your data

- Vertical partitioning divides the attributes in the relation and
  distributes them to different disks or tablespaces

  Frequently queried attributes could be separated from infrequently
  queried attributes.

- Horizontal partitioning divides the tuples in the relation to
  multiple disks

  Allows parallelism in reading data from disk

  Some optimizers are able to concentrate on a single partition given
  a specific query

## Denormalization

- Step 4: change your data model

- Normalization reduces redundancy and null values

  - Lower storage requirements, simple queries and updates will be
    faster

  - Results in more tables, hence complex queries need more joins

    ```
    SELECT
        FAN.Title
    FROM
        Films F
        , FilmAlternateNames FAN
    WHERE
        F.filmid = FAN.filmid;
    ```

- Denormalization stores relations in a non-optimal manner

### Examples of denormalization

- Store alternate names in a string and use application code to update
  and print the alternate names

- Add extra columns for frequently accessed information

  Example: number of movies per actor:

  ```
  SELECT
     a.name
     , a.surname
     , count(distinct mr.movieid)
  FROM
     actors a
     , movieroles mr
  WHERE
     a.id=mr.actorid
  GROUP BY
     a.id
     , a.name
     , a.surname ;
  ```

- Add a column "NumMovies" instead for each actor, no join or
  group by is needed.

  This column must be updated in the application anytime an update
  is made to the casts relation. If updates are not frequent, then
  this could be acceptable.

- Certain attributes might be duplicated if they are used often

  - Example: Store stagename attribute in the casts relation

    Queries involving this attribute are now fully answered from
    casts (avoiding a costly join)

    Attributes other than stagename might be queried rarely but take a lot
    of space causing the stagename to take up a lot of space

    Anytime a new actor is added or stagename is changed, the
    changes must be reflected to the casts relation by updating

    > multiple tuples (this may be rare compared to the queries)

    The CASTS relation now stores redundant information and is
    larger in size

## Query Restructuring

- Step 5: Rewrite queries to improve their speed, avoid nested
  queries, use joins whenever possible

- Nested queries are hard to optimize.

- Inner and outer expressions are optimized separately.

- For correlated expressions, inner query is executed many times for
  each tuple in the outer expression.

- Certain possible optimizations could be missed with a nested query
  (suppose an index for casts on (actorid, filmid) existed)

  ```
  SELECT DISTINCT
       m.title
  FROM
       movies m
       , movieroles mr
  WHERE
       m.id=mr.movieid
       and exists (
           select
             *
           from
             actors a
           where
             a.surname like '%Bacon'
             and  a.id = mr.actorid);
  ```

  - All queries below are equivalent to the above one

    ```
    -- uncorrelated query is better
    SELECT DISTINCT
       m.title
    FROM
       movies m
       , movieroles mr
    WHERE
       m.id=mr.movieid
       and mr.actorid in (
           select
             a.id
           from
             actors a
           where
             a.surname like '%Bacon');

    -- join query is even better
    SELECT DISTINCT
       m.title
    FROM
       movies m
       , movieroles mr
       , actors a
    WHERE
       m.id=mr.movieid
       and mr.actorid=a.id
       and a.surname like '%Bacon';
    ```

### Drops of wisdom

- Avoid sorts (distinct, order by, group by, union, except) if
  possible, they are costly

- Some queries do not need a distinct or can be rewritten to avoid sorts

- Avoid full table scans

  - A search on a condition like A \<> 3 or A like '%Bacon' might
    result in a table scan
  - A search like A in (1,2,4) might be preferable (depending on the
    availability of statistics)

- Beware that combination of queries in procedural code and their call
  will be hard for the optimizer to properly account for. So push
  complexity either into the procedure or to the query, not distribute
  them.

- Avoid retrieving tuples into application code, use stored procedures
  and even complicated queries might be preferable to the added
  communication cost

- Use views wisely

  - Even though views are useful in application development, use a
    view in an application that is useful for the given query

    ```
    CREATE VIEW together(actorid1, stagename1, actorid2, stagename2) AS
    SELECT a1.actorid, a1.stagename, a2.actorid, a2.stagename
    FROM
        actors a1
        , actors a2
        , movieroles mr1
        , movieroles mr2
    WHERE
        a1.id=mr1.actorid
        and mr1.movieid=mr2.movieid
        and a2.id=mr2.actorid
        and a1.id <> a2.id ;

    SELECT
        t.actorid1
    FROM
        together t
    WHERE
        t.name = 'Kevin'
        and t.surname = 'Bacon';
    ```

    - None of the joins are necessary to answer this query. The
      optimizer might miss some faster query plans

### The use of indices

- Indices speed up query, but slow down insert/delete/update operations

- A clustered index allows fast access to a range query

  - There is only one clustered index per relation
  - Databases usually create one for the primary key by default
  - Reconstruction of clustered indices is costly

- Step 6: choose the most useful indices

  - Find the most useful clusters and use them if they are very useful
    for a range of queries and supported by the database

  - Next, find the most selective indices to add

    - Finally, find indices that might help with index only scans

### Clustered Indices

- We can create clustered indices in Postgresql by using an index:

  ```
  create index mr_idx on movieroles(movieid,actorid) ;
  cluster movieroles using mr_idx ;
  ```

- Clusters are generated once and are not modified incrementally

- You need to recluster periodically if there are frequent updates:

  ```
  cluster movieroles;
  ```

- The reorganization may be too costly for very large tables.

- Create a clustered index for attributes frequently queried with a
  range or has multiple matching attributes for a value

  Above cluster is very useful for finding actors in a movie:

  ```
  select
     a.name
     , a.surname
  from
     movies m
     , movieroles mr
     , actors a
  where
     m.name = 'Harry Potter and the Goblet of Fire'
     and m.id = mr.movieid
     and mr.actorid = a.id;
  ```

- Clustered indices also provide a sorted order to the relation

- Create unclustered indices on attributes with high selectivity

  ```
  SELECT A.name, A.surname FROM Actors A WHERE A.gender = 'F';

  SELECT
      A.firstname
      , A.lastname
  FROM
      Actors A
  WHERE
      A.name = 'Kevin'
      and A.surname ='Bacon';
  ```

  - Gender is not a selective condition, but name and surname are.

- Note that indices may not be used by the DBMS unless they
  significantly reduce the number of tuples to be read from disk for a
  query.

- Index nested loop join is also beneficial when there is a highly
  selective index

  ```
  SELECT
     mr.movieid
  FROM
     actors a, movieroles mr
  WHERE
     a.name = 'Kevin'
     and a.surname = 'Bacon'
     and a.id = mr.actorid;
  ```

- For frequently asked queries, indices might be created to allow
  index only searches.

  For example, given (name, surname, id) for actors, answering a
  query like one below now requires only an index search for actors.

  ```
  SELECT
     mr.movieid
  FROM
    actors a
    , movieroles mr
  WHERE
    a.name = 'Kevin'
    and a.surname = 'Bacon'
    and a.id = mr.actorid;
  ```    

- For example, given a query like the one below:

  ```
  SELECT
      a2.name
      , a2.surname
  FROM
      actors a1
      , movieroles mr1
      , movieroles mr2
      , actors a2
  WHERE
      a1.name = 'Kevin'
      and a1.surname = 'Bacon'
      and a1.id = mr1.actorid
      and mr1.movieid = mr2.movieid
      and a2.id = mr2.actorid;
  ```

  for A1, the index is searched in the usual way.

  For A2, the index on(stagename, actorid) can be searched fully
  instead of the relation to speed up the query.

- Indices do not always help reduce the cost of queries. For them to be
  useful, one or more of the following criteria should be true:

  - they must be selective
  - they must be significantly smaller in size than the relation they
    are indexing
  - they must be used often in queries where they make a difference

- Indices are in effect a type of materialized views. Like all
  materialized views, there is a cost to keeping them up
  to date if the relations are changing frequently.

- Foreign keys introduce hidden costs to updates since they must be
  checked for all updates that relate to them

- Count queries can be answered using indices on attributes with a
  "NOT NULL"” constraint (check if the index indices null values)



## Other hints

- Partition data to multiple disks

  - Place data that is accessed sequentially on its own disk

- Invoke parallel query processing when multiple CPUs are available

- Create more detailed statistics (histograms)

- Recompute statistics periodically as needed

- Examine the query plans generated by the system and influence them
  as necessary

## Postgresql Optimizer

- Postgresql's CBO (cost-based-optimizer) relies heavily on table
  statistics being available for all tables used in a query.

  ```
  analyze;
  ```

  will recompute the statistics for all the tables in a database.

  Must be run periodically for updated statistics.

- You can ask the optimizer to give you the query plan for a query.

  ```
  EXPLAIN query ;

  EXPLAIN
  SELECT
     mr.movieid
  FROM
     actors a, movieroles mr
  WHERE
     a.name = 'Kevin'
     and a.surname = 'Bacon'
     and a.id = mr.actorid;

                                            QUERY PLAN
  ----------------------------------------------------------------------------------------------
   Hash Join  (cost=3302.62..8991.86 rows=2 width=4)
     Hash Cond: (mr.actorid = a.id)
     ->  Seq Scan on movieroles mr  (cost=0.00..4695.07 rows=265107 width=8)
     ->  Hash  (cost=3302.61..3302.61 rows=1 width=4)
           ->  Seq Scan on actors a  (cost=0.00..3302.61 rows=1 width=4)
                 Filter: (((name)::text = 'Kevin'::text) AND ((surname)::text = 'Bacon'::text))
  (6 rows)
  ```

## Summary

- There are many methods from changing data model, queries and storage
  methods to improve performance.
- Performance must be improved for the whole workload. Compute the
  total cost of all queries multiplied by their frequency before and
  after tuning.
- Tuning is a complex optimization problem due to dependencies between
  different actions. Concentrate on queries that contribute most to
  the workload.
