Implement the ID3 decision tree learning algorithm that we discussed in class. To simplify the...

Question

Implement the ID3 decision tree learning algorithm that we discussed in class. To simplify the implementation, your system only needs to handle ternary classification tasks (i.e. each instance will have a class value of 0, 1, or 2). In addition, you may assume that all attributes are ternaryvalued (i.e. the only possible attribute values are 0, 1, and 2) and that there are no missing values in the training or test data. Some sample training files (train.dat, train2.dat) and test files (test.dat, test2.dat) are available from the assignment page of the course website. In these files, only lines containing non-space characters are relevant. The first relevant line holds the attribute names. Each following relevant line defines a single example. Each column holds an example’s value for the attribute named at the head of the column. The last column (labeled “class”) holds the class label for the examples. In all of the following experiments, you should use this last class attribute to train the tree and to determine whether a tree classifies an example correctly. 2 When building a decision tree, if you reach a leaf node but still have examples that belong to different classes, then choose the most frequent class (among the instances at the leaf node). If you reach a leaf node in the decision tree and have no examples left, then choose the class that is most frequent in the entire training set. If you reach a left node in the decision tree and there is a tie for the most frequent class, then break ties among them by preferring the one that is more frequent in the entire training set. If two or more classes are equally frequent in the entire training set, then break ties by preferring class 0 to class 1 and preferring class 1 to class 2. Do not implement pruning. A word on tie breaking: when choosing attributes using information again, if two or more attributes achieve the highest information gain value, break ties by choosing the earliest one in the list of attributes (assuming that the attributes in the first line of a training file are read in a left-to-right manner). IMPORTANT: • You must use Python to implement the ID3 algorithm. Do not use any non-standard libraries in your code, and make sure your program can compile on UTD machines, not just your own machines. • Your program should be able to handle any ternary classification task with any number of ternary-valued attributes. Consequently, both the number and names of the attributes, as well as the number of training and test instances, should be determined at runtime. In other words, these values should not be hard-coded in your program. • Your program should allow only two arguments to be specified in the command line invocation of your program: a training file and a test file. There should be no graphical user interface (GUI) of any kind. Any program that does not conform to the above specification will receive no credit. • Use logarithm base 2 when computing entropy and define 0 log2 0 to be 0. • In the input files, only lines containing non-space characters are relevant, as mentioned previously. In particular, empty lines may appear anywhere in an input file, including the beginning and the end of the file. Care should be taken to skip over these empty lines. Your Tasks a. Build a decision tree using the training instances and print to stdout the tree in the same format as the example tree shown below. wesley = 0 : | honor = 0 : | | barclay = 0 : 1 | | barclay = 1 : 2 | | barclay = 2 : 1 | honor = 1 : | | tea = 0 : 0 3 | | tea = 1 : 1 | | tea = 2 : 2 wesley = 1 : 0 | barclay = 0 : 2 | barclay = 1 : 1 | barclay = 2 : 0 wesley = 2 : 1 According to this tree, if wesley = 0 and honor = 0 and barclay = 0, then the class value of the corresponding instance should be 1. In other words, the value appearing before a colon is an attribute value, and the value appearing after a colon is a class value. b. Use the learned decision tree to classify the training instances. Print to stdout the accuracy of the tree. (In this case, the tree has been trained and tested on the same data set.) The accuracy should be computed as the percentage of examples that were correctly classified. For example, if 86 of 90 examples are classified correctly, then the accuracy of the decision tree would be 95.6%. (Note that the accuracy on the training instances will be 100% if and only if the training instances are consistent.) Accuracy on training set (90 instances): 95.6% c. Use the learned decision tree to classify the test instances. Print to stdout the accuracy of the tree. (In this case, the decision tree has been trained and tested on different data sets.) Accuracy on test set (10 instances): 60.0% d. Now, we want to investigate how the amount of training data affects the accuracy of the resulting decision tree. Plot a learning curve (i.e., a graph of the accuracy of your algorithm on the test set against different training set sizes) by re-training your learning algorithm on train.dat using training set sizes of 100, 200, 300, . . ., 800. Briefly comment on the shape of the curve. Does it exhibit the usual properties of a learning curve? (We suggest that you plot the graph using Excel, but if you choose to draw the graph by hand, you need to scan it so that you can submit it online. We will not accept hardcopy submissions.)

Implement the ID3 decision tree learning algorithm that we discussed in class. To simplify the implementation, your system only needs to handle ternary classification tasks (i.e. each instance will...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment