Assignment 7 information

Announcements & errata
Clarifications on the assignment
Support code files & getting started
Additional support code documentation
Submit Assignment 7 to the web testers

Announcements & errata

[12/10] Let me just reiterate what i said in class about turning in the writup late:
- turn it in on hardcopy for first tier late deadline to owen's mailbox in the AE lounge
- second tier late deadline is midnight sunday night. However, i still want hardcopy — you can turn this in on monday morning by 10am to owen's mailbox.
  However, if you're out of town or something, you may email the writeup (in text, postscript, or pdf format only, i.e. not ms word) to aistaff, but then of course you have to send this in by midnight sunday.
[12/5] The web testers for problems 1-4 are up. The web testers for problem 5 (which will simply collect a file) should be up shortly.
For the problem 5b web tester, i will ask you to include your state calculation code (same procedures as problem 1) in the file you upload for this problem. This is because the state calculation procedures you use for this problem need not be the same as the procedures you upload for problem 1.
[11/30] A student pointed out that the ending is cut off on the description for the calc-initial-state procedure (first bullet on page 3 of the original A7 handout). It should read:
- (calc-initial-state player-hand dealer-hand)
  The argument player-hand will consist of a list of two cards that the player was initially dealt; dealer-hand will consist of a list of one card, the dealer's face-up card. This procedure must return a valid reinforcement learning state (a nonnegative integer between zero, inclusive, and the number of states declared to create-tables, exclusive).
Needless to say, calc-new-state should also return a valid reinforcement learning state. I've updated the assignment handout on the handouts page.
[11/29] Instead of putting up additional documentation on the web, I put together another handout with this documentation, details of the writeup, and other information. I've repeated the problems in this handout, with the procedure name correction below. You can get this handout on the handouts page (off the main course web page).
[11/26] To be consistent with the terminology in this assignment, I am changing the names of the procedures you will write for problems 2 and 3 to: basic-rl-strategy and create-exploring-rl-strategy, respectively. This is reflected in the assign7.scm stubs file.

Clarifications & Discussion

Problem 1

[12/7] This is the text of a post I made on WebCT (slightly edited) under the thread "Number of States"
You don't need to use a lot of states for this assignment, however, the states should represent different situations of the game.
At one extreme, you could have separate states for each possible combination of cards. However, there's a lot of "duplication" here. for example, the following initial hands: 2 and 9, 3 and 8, 4 and 7, 5 and 6, would be played the same way, so it's fine for them to be represented by the same state.
At the other extreme, you could get away with using only two states: one for "still playing the hand" and one terminal state. This, however, would treat all hands in exactly the same way, whereas you really should play hands differently depending on the value, soft hand or not, dealer's face-up card, etc.
You should use some knowledge to structure the states, but you should leave some room for the program to learn. For example, if you think you should always stand on 14 or higher and always hit on 13 or less, you might use just those two states as initial states. Your program will learn the utilities for those states so it can make an optimal decision based on what action to take in those two states. However, you have deprived your program the opportunity to learn whether it might be better to hit on 9 or less and double down on 10-13 because those two situations fall into the same initial state.
If you structure the layout of your states well, calculating the states shouldn't be much work. For my 400 state implementation, my calc-initial-state and calc-new-state are (together) about 40 lines of code (nicely broken and indented). You can take some ideas from a7example.scm.

Problem 2

[12/7] The arguments to basic-rl-player were not explained well in the handouts. The fs-num argument is the current state number; this is the state from which your player must choose one of the given actions. I advise you to not hard-code any actions in your code — you should rely on the actions argument to provide the valid actions.

Problem 3

[12/8] Here is an example of how you can test your create-exploring-rl-player. Although here I am using a set of tables created by the A7p3 web tester, you can get the idea of how you can test your code yourself. The main thing is that there is a procedure in a7code.com called increment-action-transition. Normally, the blackjack simulator calls this procedure for you, but if you want to just test your code for problem 3, you have to call it yourself.

(define e-strat (create-exploring-rl-strategy 5.0 10))

; in the example tables i got, state 4 is one of the nonterminal
; states, so i am testing my code on that state
(map (lambda (a) (get-action-transitions 4 a))
     '(hit stand double-down))
;Value: (0 0 0)

(do ((i 0 (+ i 1)))
    ((= i 30))
  (let ((a (e-strat 4 '(hit stand double-down))))
    (print a " ")
    (increment-action-transition 4 a)))
; here's all the actions my strategy chose
double-down stand double-down stand double-down double-down double-down hit stand double-down hit stand double-down double-down double-down hit stand stand hit stand double-down stand hit hit stand stand hit hit hit hit 

(map (lambda (a) (get-action-transitions 4 a))
     '(hit stand double-down))
;Value: (10 10 10)

; now it should pick the action with the maximum expected utility
(e-strat 4 '(hit stand double-down))
;Value: hit

; note that this is what the basic-rl-strategy does all the time
(basic-rl-strategy 4 '(hit stand double-down))
;Value: hit

[12/7] The create-exploring-rl-player procedure should return a "strategy procedure". This is described in section B.1 of the "A7 addendum" handout. The one thing that is not mentioned there is that a strategy procedure must return a (valid) action.
[12/7] Your code for this problem should not change any utility values in the tables. It should simply decide which action to take based upon optimistic utility values which you have to calculate in this procedure.

Problem 4

[12/7] The create-td-learning procedure must return a "learning procedure" as described in section B.1 of the "A7 addendim handout". Perhaps it was a mistake to say that this procedure must be of the form "(td-learning fs a ts)" because all I meant to indicate with that is that it's a procedure of three arguments and provide some names for those arguments so that I could explain which is which. The "stubs" file contains the proper framework for your code: a procedure that returns a lambda form in which you can write your procedure (and reference both the arguments to create-td-learning and to the learning procedure.
A few clarifications: the value returned by a learning procedure is ignored. For this problem, it must update the utility of state fs.

Problem 5

A word about utility value convergence:
My utility values did not converge as nicely as I would have liked them to. I don't think I ever saw a maximum utility value change (after a round of maybe 1000 or 10,000 hands) of less that 0.001. A maximum change of less than 0.05 or even 0.01 occurred after not too many rounds. You may want to try changing your alpha value, i.e. play some number of rounds with an alpha of 0.05 or 0.01 and then create a new learning procedure that is going to use an alpha of say 0.001 for the remainder of the rounds. You could let the maximum utility value change be your guide — if you have max changes of say 0.1 or 0.2, then decreasing alpha to 0.001 is probably not appropriate at that point.
Note that you can stop learning at any time and test how well your utility values work by playing, I would suggest, 100,000 hands with the basic-rl-strategy. (I'd keep table updates turned off here.) Remember that the policy will converge before the utility values do.
On implementing a fixed strategy:
The thing is that a strategy procedure is given only the state number — not the player's cards. Therefore, you have to set up the states that that strategy needs and then encode the actions that should be taken from each state. This is essentially hardcoding a policy in the strategy procedure.

Support code & getting started

Stubs file: assign7.scm
Support code: a7code.com (current version: 1.1.3, released 12/5)
Version 1.1.3 (released 12/5) contains two minor bug fixes and some changes we needed for the web testers. the bugs fixed include: printing bug for exact numbers used as utilities or rewards, the bj-value and soft-hand? procedures now signal an error if you give them an invalid card (actually, they don't check the suit, only the value).
Version 1.1.2 (released 12/4) contains a bug fix (the dealer's face-up card was passed to calc-initial-state as it should, but the dealer's face-down card was passed to calc-new-state for transitions to nonterminal states) so this was corrected; when dealer and player both have blackjack, it's a push but wasn't treated as such (this doesn't affect learning but does affect the net winnings); there are a few new procedures in the support code (documented below) though you probably don't need these; and finally, the error checking on arguments passed to the support code was slightly improved.
Version 1.1.1 (released 11/29) contains no changes to the functionality of the support code. There was one minor bug fix and a little cleaning up of the namespace.
Version 1.1 (released 11/26) was the initial release of the support code.
Example: a7example.scm

Getting started

Take a look at the a7example.scm file above for a simple example of transforming the game state into a reinforcement learning state. This is what you have to do for Problem 1 of the assignment (but in a more intelligent way!)
Try playing blackjack with the random player. The procedures for this, and instructions on how to use the support code are at the bottom of the a7example.scm file.
After running, say, 10 hands, print out the tables by calling the procedure:
```
(print-rl)
```
This procedure calls print-transitions, print-rewards, and print-utilities.
Make sure you understand what information the tables are recording. I suggest you do this by going back through the narration of the 10 hands you played, manually tabulating the transitions and rewards, calculating what you think they should be, and comparing your results with what the computer prints out.

Submit your code to the web testers

The Policy for Electronic Submission is the same as for previous assignments. The syntax-checker is the same as for the last assignment.

Problem 1 (state calculation) Web-tester
This web tester only checks to make sure that you have uploaded a working (and consistent) set of state calculation procedures. I'd advise you to wait until you have done at least an initial implementation of your game-state to reinforcement-learning-state idea and have tested it by learning a blackjack player.
The a7example.scm file will pass this test; however, you will (after the fact) receive 0 credit (and still use one of your submissions) if you just submit this file (or trivial variations of it). You should upload something much closer to your final state calculation scheme.
Problem 2 (basic-rl-strategy) Web-tester
This web tester will have your basic-rl-strategy procedure choose actions in a randomly generated (but still blackjack-like) state space.
Problem 3 (create-exploring-rl-strategy) Web-tester
This web tester will make sure that a strategy procedure returned by your code actually does explore the different actions from a state but later picks the action with the maximum expected utility.
Problem 4 (create-td-learning) Web-tester
This web tester will check that your code does updates using the temporal differencing update rule correctly.
Problem 5b Web-tester
This web tester will simply collect a file; there will be no tests run online
The file you turn in must have the tables you saved after learning a model of blackjack (e.g., by playing the random player for at least 10,000 (though preferrably 100,000 or more) hands).
You must also include your state calculation procedures (i.e., the same procedures as for problem 1) that you used to create these tables.
If your file is larger than (i'm just guessing here) 500 KB or maybe even 100 KB, the web tester may have trouble with it. I'm not sure whether it's a limit set somewhere or whether it's a memory issue. Anyway, please try uploading it to the web tester. If you have problems, then please email a zip (or tar or tar.gz) file to aistaff@cs.
Problem 5c Web-tester
This web tester will simply collect a file; there will be no tests run online
The file you submit must have your final tables, and you should include any code you wrote for this part, e.g., any code to "automate" the temporal differencing learning.
If your file is larger than (i'm just guessing here) 500 KB or maybe even 100 KB, the web tester may have trouble with it. I'm not sure whether it's a limit set somewhere or whether it's a memory issue. Anyway, please try uploading it to the web tester. If you have problems, then please email a zip (or tar or tar.gz) file to aistaff@cs.

Additional support code documentation

I added a few procedures to a7code.com in version 1.1.2. I don't think you need these procedures, but I'll tell you about them anyway. I needed them to implement policy iteration (which you don't have to do), so that's why I added them.

(get-action-alist state-num) — returns an association list where each element is of the form: (action <transition-element> ...). Note that the cdr of each element is what is returned by the procedure get-transition-alist. For example:
```
(get-action-alist 0)
;Value: ((hit (1 .583) (0 .208) (6 .209))
;        (stand (2 1.))
;        (double-down (4 .801) (6 .199)))
```
(get-transition-actions state-num) — returns a list of actions that have been taken from the given state. For example:
```
(get-transition-actions 0)
;Value: (hit stand double-down)
```