Assignment 7 information

Errata and clarifications
Files for this assignment
Additional support code documentation
Examples
Webtester
Writeup details

Errata and clarifications

[11/30] I forgot to document a few procedures in the support code. See the Additional support code documentation section.
[11/30] I made a few mistakes with procedure names in problems 2 and 3:
- Problem 2 is inconsistent. The procedure you are to write should be called basic-rl-strategy (because it is a strategy procedure, not a player, in the terminology i'm using for this assignment.
- Similarly, the procedure you are to write for problem 3 returns a strategy procedure, not a player. Therefore it should be called create-exploring-rl-strategy.
I think the original "stubs" file had these names correct.
[11/27] Problem 1 — you will also need to define a procedure: (terminal-state? rl-state) which returns #t if rl-state is a terminal state.
This will enable the support code to automatically set the utility of terminal states to be the average reward for that state.
[11/27] (create-tables num-states #!optional util-init)
The optional second argument util-init may be a number (the initial utility value for all nonterminal states), or a procedure of 1 argument (which is called for each nonterminal state, passed the number of the state, and must return the initial utility value for that state). If omitted, all initial nonterminal utilities will be set to 0.0.
[11/27] The argument N to the play-match procedure may be either a number (of hands to play) or a procedure (of zero arguments).
If a procedure is given, this procedure is called before each hand is played (including the first hand). If it returns #t, then the hand is not played and the play-match procedure returns.
You can use this mechanism to keep playing hands until some condition is met, such as the maximum change in any utility value falling below some threshold.

Files for this assignment

assign7.scm — stubs file
a7code.com — support code (version 1.2)
a7example.scm (updated 11/30)

Additionas support code documentation

There are a few procedures in the support code that somehow didn't make it into the assignment handout.

(num-visits rl-state) — returns the number of times that the given state has been visited. Note, however, that when you turn off table updates, the value returned will not change. If you want the number of visits since you called reset-action-transitions, you should use get-action-transitions and sum over all actions.
(print-tables) — this procedure is called print-rl in the assignment handout. That still works, but this name probably makes more sense.
(save-tables fname) — this procedure writes the variables used to store the information in the tables to a file. BE CAREFUL: this procedure will probably OVERWRITE the filename that you give it.
To load tables that you have saved, you can just load the file like you load any Scheme file. You should not call init-tables after doing so, lest your newly loaded information be erased.

Examples

Here are two transcripts of Scheme sessions that I ran to illustrate how you can run your code to learn transition probabilities, utilities, and rewards.

Webtester

The Policy for Electronic Submission is the same as for Assignment 1.
Submit problems 1–4 to the web tester
Submit file for problem 5-B — this web tester will not perform any tests; it will just collect a file.
Submit file for problem 5-C — this web tester will not perform any tests; it will just collect a file.

Writeup details

For problem 5, you will learn a blackjack player two different ways and compare the results. This web page is a little long, but it's really that it has detailed instructions on how to learn a player and then asking your to describe the details of what you did and analyze the results.

A few general things:

You should do the writeup in your favorite word processor and turn in a printout.
Maximum length for the writeup is 4 pages (4 single-sided pages or 2 double-sided pages).
You should read part D before doing parts B and C so that you'll know what information you'll need for the writeup.
Based on written work from previous assignments, I can tell you that many students do not do analysis/evaluation of results as "in depth" as I would like. Give some thought to the underlying reasons your results turned out the way they did. I am asking you to demonstrate your understanding of the reinforcement learning techniques through this analysis/evaluation.

Here are the parts for problem 5:

(written) Describe how your calc-initial-state and calc-new-state procedures turn the game state into a reinforcement learning state. I am interested, not in the details of how you do the calculation, but in how the reinforcement learning states you use correspond to game states. Make sure you say how many states you used, and which are terminal states.
Give a brief explanation why you transformed the game state to a reinforcement learning state this way.
(electronic) In the first approach, you will first learn the world model, then learn the utilities, and finally evaluate how good your blackjack player is with those utilities. Here are the three steps you should follow.
Please note that I want you to report the amount won/lost and the the total amount wagered for each of these three steps in your writeup, so make sure you record this information!
1. Learn the model of the world (i.e., transition probabilities and average rewards) but do not learn utilities in this step. You can do this by using the non-learning-procedure (from a7example.scm) as the learning procedure. Note that there is more than one way you can do this part.
  I suggest you save the tables to a file after this step:
```
  (save-tables "a7p5b-model.scm")
```
  so that you can try different things in part B-2 without doing this step again.
2. Now, learn the utilities for the nonterminal states using this model of the world. Before learning, turn off the model updates with:
```
      (define enable-table-updates #f)
```
  This will keep the transition probabilities and average rewards from changing while you are learning the utilities.
  Learn utilities by playing backjack with the following player:
```
  (define (td-player)
    (list "TD-player" 
	 (create-exploring-rl-strategy R+ Ne)
	 (create-td-learning alpha-fn)))
```
  You will need to decide upon values for R+ and Ne and what your alpha-fn function should be. You will also have to figure out when to stop learning.
  Save the tables after this step and upload this file to the webtester.
```
  (save-tables "a7p5b-utilities.scm")
```
3. Evaluate the performance of your player by running 10,000 hands using a player:
```
  (define (utility-player)
    (list "Your name here" 
	  basic-rl-strategy
	  non-learning-procedure))
  
```
  Make sure that you have disabled the table updates (as in the previous step)
(electronic) In the second approach, you will learn the world model and the utilities simultaneously and then evaluate the performance of a blackjack player using your learned utilities.
Please note that I want you to report the amount won/lost and the the total amount wagered for each of these two steps in your writeup, so make sure you record this information!
Here are the steps you should follow:
1. After (re)initializing the tables (so they are initially empty), learn the model and utilities together by playing blackjack using the td-player show above in step B-2. You do not have to use the same R+, Ne, and alpha-fn that you used in step B-2.
  Make sure you have reenabled table updates if you disabled them.
  Save the tables after this step and upload this file to the webtester.
```
  (save-tables "a7p5c-utilities.scm")
```
2. Evaluate the performance of this player in the same manner you did for part B-3. (Also, make sure you disable the table updates while evaluating.)
(written) For this part, you will describe some of the details of what you did in parts B and C and do a little analysis on your results. Here are the things you should cover:
- Description for part B:
  - Report the total won/lost and the total amount of bets for steps B-1 and B-2.
  - Describe how you "learned" the transition probabilities and average rewards in step B-1. How do you know that you learned correct (or at least reasonably accurate) transition probabilities and average rewards in step B-1? (Just saying "I ran <very-large-number> hands" is not a complete answer.)
  - What values did you use for R+ and Ne? What was your alpha-fn function? Why did you pick these values and function?
  - How did you decide when to stop learning in step B-2?
- Description for part C:
  - Report the total won/lost and the total amount of bets for step C-1.
  - What values did you use for R+ and Ne? What was your alpha-fn function? Why did you pick these values/function? (Also, why did you use the same or different values/function than for step B-2?)
  - How did you decide when to stop learning in step C-1?
- Analysis:
  - Report the total won/lost and the total amount of bets from evaluating your players in steps B-3 and C-2. How do the two players compare?
  - How similar were the transition probabilities, utilities, and average rewards that were learned in parts B and C?
  - Which approach to learning a blackjack player is better? (Please note that this is not as simple as comparing win/loss percentages for the B-3 and C-2 results.) Explain your reasoning.