CSCI 4150: Introduction to Artificial Intelligence, Fall 2004 |
However, if you're out of town or something, you may email the writeup (in text, postscript, or pdf format only, i.e. not ms word) to aistaff, but then of course you have to send this in by midnight sunday.
For the problem 5b web tester, i will ask you to include your state calculation code (same procedures as problem 1) in the file you upload for this problem. This is because the state calculation procedures you use for this problem need not be the same as the procedures you upload for problem 1.
The argument player-hand will consist of a list of two cards that the player was initially dealt; dealer-hand will consist of a list of one card, the dealer's face-up card. This procedure must return a valid reinforcement learning state (a nonnegative integer between zero, inclusive, and the number of states declared to create-tables, exclusive).
Needless to say, calc-new-state should also return a valid reinforcement learning state. I've updated the assignment handout on the handouts page.
You don't need to use a lot of states for this assignment, however, the states should represent different situations of the game.
At one extreme, you could have separate states for each possible combination of cards. However, there's a lot of "duplication" here. for example, the following initial hands: 2 and 9, 3 and 8, 4 and 7, 5 and 6, would be played the same way, so it's fine for them to be represented by the same state.
At the other extreme, you could get away with using only two states: one for "still playing the hand" and one terminal state. This, however, would treat all hands in exactly the same way, whereas you really should play hands differently depending on the value, soft hand or not, dealer's face-up card, etc.
You should use some knowledge to structure the states, but you should leave some room for the program to learn. For example, if you think you should always stand on 14 or higher and always hit on 13 or less, you might use just those two states as initial states. Your program will learn the utilities for those states so it can make an optimal decision based on what action to take in those two states. However, you have deprived your program the opportunity to learn whether it might be better to hit on 9 or less and double down on 10-13 because those two situations fall into the same initial state.
If you structure the layout of your states well, calculating the states shouldn't be much work. For my 400 state implementation, my calc-initial-state and calc-new-state are (together) about 40 lines of code (nicely broken and indented). You can take some ideas from a7example.scm.
(define e-strat (create-exploring-rl-strategy 5.0 10)) ; in the example tables i got, state 4 is one of the nonterminal ; states, so i am testing my code on that state (map (lambda (a) (get-action-transitions 4 a)) '(hit stand double-down)) ;Value: (0 0 0) (do ((i 0 (+ i 1))) ((= i 30)) (let ((a (e-strat 4 '(hit stand double-down)))) (print a " ") (increment-action-transition 4 a))) ; here's all the actions my strategy chose double-down stand double-down stand double-down double-down double-down hit stand double-down hit stand double-down double-down double-down hit stand stand hit stand double-down stand hit hit stand stand hit hit hit hit (map (lambda (a) (get-action-transitions 4 a)) '(hit stand double-down)) ;Value: (10 10 10) ; now it should pick the action with the maximum expected utility (e-strat 4 '(hit stand double-down)) ;Value: hit ; note that this is what the basic-rl-strategy does all the time (basic-rl-strategy 4 '(hit stand double-down)) ;Value: hit
A few clarifications: the value returned by a learning procedure is ignored. For this problem, it must update the utility of state fs.
My utility values did not converge as nicely as I would have liked them to. I don't think I ever saw a maximum utility value change (after a round of maybe 1000 or 10,000 hands) of less that 0.001. A maximum change of less than 0.05 or even 0.01 occurred after not too many rounds. You may want to try changing your alpha value, i.e. play some number of rounds with an alpha of 0.05 or 0.01 and then create a new learning procedure that is going to use an alpha of say 0.001 for the remainder of the rounds. You could let the maximum utility value change be your guide — if you have max changes of say 0.1 or 0.2, then decreasing alpha to 0.001 is probably not appropriate at that point.
Note that you can stop learning at any time and test how well your utility values work by playing, I would suggest, 100,000 hands with the basic-rl-strategy. (I'd keep table updates turned off here.) Remember that the policy will converge before the utility values do.
The thing is that a strategy procedure is given only the state number — not the player's cards. Therefore, you have to set up the states that that strategy needs and then encode the actions that should be taken from each state. This is essentially hardcoding a policy in the strategy procedure.
Version 1.1.3 (released 12/5) contains two minor bug fixes and some changes we needed for the web testers. the bugs fixed include: printing bug for exact numbers used as utilities or rewards, the bj-value and soft-hand? procedures now signal an error if you give them an invalid card (actually, they don't check the suit, only the value).
Version 1.1.2 (released 12/4) contains a bug fix (the dealer's face-up card was passed to calc-initial-state as it should, but the dealer's face-down card was passed to calc-new-state for transitions to nonterminal states) so this was corrected; when dealer and player both have blackjack, it's a push but wasn't treated as such (this doesn't affect learning but does affect the net winnings); there are a few new procedures in the support code (documented below) though you probably don't need these; and finally, the error checking on arguments passed to the support code was slightly improved.
Version 1.1.1 (released 11/29) contains no changes to the functionality of the support code. There was one minor bug fix and a little cleaning up of the namespace.
Version 1.1 (released 11/26) was the initial release of the support code.
(print-rl)This procedure calls print-transitions, print-rewards, and print-utilities.
This web tester only checks to make sure that you have uploaded a working (and consistent) set of state calculation procedures. I'd advise you to wait until you have done at least an initial implementation of your game-state to reinforcement-learning-state idea and have tested it by learning a blackjack player.
The a7example.scm file will pass this test; however, you will (after the fact) receive 0 credit (and still use one of your submissions) if you just submit this file (or trivial variations of it). You should upload something much closer to your final state calculation scheme.
This web tester will have your basic-rl-strategy procedure choose actions in a randomly generated (but still blackjack-like) state space.
This web tester will make sure that a strategy procedure returned by your code actually does explore the different actions from a state but later picks the action with the maximum expected utility.
This web tester will check that your code does updates using the temporal differencing update rule correctly.
This web tester will simply collect a file; there will be no tests run online
The file you turn in must have the tables you saved after learning a model of blackjack (e.g., by playing the random player for at least 10,000 (though preferrably 100,000 or more) hands).
You must also include your state calculation procedures (i.e., the same procedures as for problem 1) that you used to create these tables.
If your file is larger than (i'm just guessing here) 500 KB or maybe even 100 KB, the web tester may have trouble with it. I'm not sure whether it's a limit set somewhere or whether it's a memory issue. Anyway, please try uploading it to the web tester. If you have problems, then please email a zip (or tar or tar.gz) file to aistaff@cs.
This web tester will simply collect a file; there will be no tests run online
The file you submit must have your final tables, and you should include any code you wrote for this part, e.g., any code to "automate" the temporal differencing learning.
If your file is larger than (i'm just guessing here) 500 KB or maybe even 100 KB, the web tester may have trouble with it. I'm not sure whether it's a limit set somewhere or whether it's a memory issue. Anyway, please try uploading it to the web tester. If you have problems, then please email a zip (or tar or tar.gz) file to aistaff@cs.
(get-action-alist 0) ;Value: ((hit (1 .583) (0 .208) (6 .209)) ; (stand (2 1.)) ; (double-down (4 .801) (6 .199)))
(get-transition-actions 0) ;Value: (hit stand double-down)