Assignment 6 information

Assignment 6 web tester. Electronic submission policy is the same as for assignment 4.
Problem 3 notes:
- Here is a transcript of my chi^2-learn-dtree running on the mushroom-data1 training data.
- To keep things simple, your code should select the best attribute in the usual way and then test if it is statistically significant. If not, then just pick the majority classification of the examples. (You can do this with a recursive call if you give it the right arguments.)
  I believe the web tester assumes your code does this.
- As has been discussed on WebCT, the support code has the Q and chi-squared notation reversed from the assignment handout. The procedure that returns the probability that you are comparing to 0.001 should have been called q instead of chi^2, and its second argument should have been called chi^2 instead of q. To avoid further confusion, I will not change this in the support code.
  This, by the way, is the result of two conflicting notations: one from our text, and the other from the book "Numerical Recipes in C". I am trying to stick with the "Numerical Recipes in C" formulation.
Support code
- assign6.scm --- a "stubs" file
- a6code.com --- compiled support code, version 1.3.1 out 11/17
  Changes in version 1.3.1
  - added the majority-attribute-value procedure, described in the problem 4 section below
  - corrected a bug in the pick-random-subset procedure
- a6header.scm --- "headers" for the compiled code (see comments for more details if you're interested)
- a6data.scm, Version 2.0 --- training/testing data.
- a6p4data.scm --- a training data set with missing attributes
  [11/17] NEW VERSION: separates what was voting-data into voting-data1 and voting-data2
There is one change in the support code and some support code procedures that I forgot to list in the assignment handout:
- The procedure called tally in the assignment handout is called tally-tdata in the support code.
- (log2 x) returns the logarithm to base 2 of x
- (pick-majority tally)
  given a "tally", returns the majority attribute value
- (pick-random-subset Lst size)
  randomly picks "size" elements of "Lst". You can use this to select a training data set from a larger list.
- (print . stuff)
  takes 0 or more arguments and prints each to the screen.
Here is a narrative from running the snorkel data tha my program produced

Problem 4

Since this is out late, I've tried to make this as simple as possible. Here are the details:

[11/18] Note that the decision tree learned by your missing-learn-dtree will have no ? symbols in it. Therefore when testing your decision tree on an example with missing attribute values, it will be given the default answer. (Actually, the test procedure, as it stand now, will count this as an incorrect classification. This, unfortunately, will make the decision tree have a lower percentage of correct examples than it should, but I don't want to make more changes to the support code now.)
Like the chi-squared pruning problem, you should be able to make a copy of your learn-dtree code and make some relatively minor changes to it.
Use the simplest strategy for dealing with missing data: assign the majority value of that attribute to all the examples missing that attribute value.
You don't actually have to change the attribute value of an example. You can implement this simply by "fixing" the split on an attribute. The split-tdata procedure will treat a ? value just like any other value and include a split for it. For example, we might have:
```
((small ((expensive (red  small furry))
	 (cheap     (blue small furry))))
 (large ((expensive (pink large furry))
	 (expensive (blue large furry))
	 (cheap     (blue large smooth))))
 (?     ((cheap     (red  ?     smooth)))))
```
In order to treat the ? example as though it had the majority attribute value large, you can just add it to the list of examples for this value, i.e.
```
((small ((expensive (red  small furry))
	 (cheap     (blue small furry))))
 (large ((expensive (pink large furry))
	 (expensive (blue large furry))
	 (cheap     (blue large smooth))))
         (cheap     (red  ?     smooth)))))
```
The rest of your code should pretty much be the same as the regular learn-dtree.
I have added another procedure to the support code (version 1.3). It is not necessary to use this procedure, but you may find it helpful. The procedure is:
```
  (majority-attribute-value tdata anames attribute)
```
It will pick out the attribute value with the largest number of examples, excluding the ? attribute value. If the only attribute value in tdata is ?, then it returns '(). You can assume that you don't have to worry about this return value.
You should be able to do this problem with one small change to your learn-dtree code, plus about 15-20 additional lines of code.

Problem 5

Run the following tests with your code:

Run your learn-dtree and chi^2-learn-dtree procedures on the mushroom data sets 1 and 2, and test the resulting decision trees on the other data set. Show the resulting decision trees (leave them in the Scheme representation instead of drawing them). Which procedure is better on your tests?
Run your learn-dtree, chi^2-learn-dtree, and missing-learn-dtree procedures on the voting-data1 and voting-data2 data sets and test the resulting decision trees on the other data set. Show the resulting decision trees, as above. Which procedure is best on your tests?