Machine Learning Laboratory

How to Run ITI/DMTI

The ITI program has a command line interface, as is customary for Unix environments.

ITI builds a decision tree in incremental or batch mode. In incremental mode, ITI incorporates each instance into the tree, and restructures the tree as needed so that it becomes the same tree that one would have gotten with the batch algorithm. Incremental induction is usually much less expensive than rebuilding the tree from scratch. One would want to use incremental induction in a serial learning task, such as ongoing knowledge maintenance.

In addition to incremental induction, the ability to restructure a tree makes it possible in many cases to travel through tree-space inexpensively. The program includes direct metric tree induction (non-incremental) in which it tries various tests at a node, and evaluates the quality of each resulting tree.

You will need to prepare at least two files for any task you wish to run. These files follow Quinlan's C4.5 format (see chapter 9 of Quinlan's book for more detail). First create a subdirectory (of your data directory) in which you will put all files related to the task. Then, in the subdirectory create a `names' file. The first line of this file is a comma-separated list of the allowable class names that will appear in the data file. This first line is terminated by a period. Each successive line of the `names' file gives the variable name followed by a colon, followed by a period. This is enough for ITI, but if you want to be able to use c4.5 on your data, you will need to provide more information between the colon and the period. For a numeric variable (all its values are numeric), put the word `continuous'. For a discrete variable, list the possible values, separated by commas.

Second, create a data file of any name you choose, but with extension `.data'. This file contains as one data item (record) per line. Each line is a comma-separated list of values, one per variable, followed by a comma, followed by the class label, followed by a terminating period. The special symbol `?' indicates that the value of the corresponding variable is unknown. The code distribution includes a `weather' task, with a `names' file and a `all.data' file.

To run ITI, type:

iti stem [[-d..][-e][-E][-f][-h..][-i][-j][-l..][-m][-M][-L][-p][-q..][-r..][-s..][-t][-u][-v][-w]]*

where `stem' is the name of the task (subdirectory of your data directory), and the various options are interpreted as described below. The actions specified by the options are taken in order, so one must specify them in the order intended. Repetition of options is meaningful. Some sequences of options are nonsensical but the idea has been to provide maximum flexibility.

Options:

-d..: Draw the current tree into the specified pst file (`.pst' will be appended to the specified name). The pst file can be converted to a postscript file by running the `pst' program, which is part of this distribution and is described below.
-E: Invoke the dmti algorithm to build a tree using the expected number of tests as the direct metric. The tree is built by restructuring the current tree.
-e: Update the current tree using error correction mode on the current set of training instances. ITI will repeatedly incorporate a training instance until no unincorporated training instance would be misclassified by the current tree.
-f: In batch (non-incremental) mode, build a tree from the current set of training instances. For a given set of instances, it is faster to build a tree in batch mode than it is to build it incrementally. (Note that it is usually faster to incorporate one new instance than it is to rebuild the tree from scratch in batch mode.) This is the fastest way to build a tree from a set of training instances.
-h..: Include the specified heading on performance graph. This option only makes sense when -m has also been specified.
-i: Update the current tree incrementally from the current unincorporated training instances.
-j: Shuffle the training set before starting. This makes sense only prior to training in error correction mode (see -e option). Otherwise it is largely useless because ITI always builds the same tree for the same set of incorporated instances.
-l..: Load the training instances from the specified file. These instances define the current training set of unincorporated instances.
-L: Invoke the dmti algorithm to build a tree using the number of leaves as the direct metric. Because the tree is binary, the total number of nodes is 2*leaves-1. The tree is built by restructuring the current tree.
-M: Invoke the dmti algorithm to build a tree using the minimum description length as the direct metric. The tree is built by restructuring the current tree.
-m: Do performance measuring. This will cause a file (with `.trace' appended to name) to be created during training that can be used for producing postscript performance graphs. You can use the PLOT program, as in plot < stem.trace > stem.ps, to produce performance figures for papers or slides. The PLOT program is not supported, but is included in case it is useful. You may need to hack PLOT to get what you want.
-P: Set the minimum number of instances in the second most frequently occuring class at a node that define the node to be considered impure. The default value is 1.
-p: Virtually prune the current tree now, and virtually prune it whenever the tree changes hereafter. See -u. Virtual pruning is done according to the minimum description length principle. This is generally useful when the instances are known to be noisy or overfitting is likely.
-q..: Load the testing instances from the specified file. These instances define the current testing set.
-R: Print the root node of the tree in a verbose form. This is not generally useful. It is for debugging, and is mentioned here only for the sake of completeness.
-r..: Restore the tree from the specified file (`.iti' is appended to the specified name) as the current tree.
-s..: Save the current tree to the specified file (`.iti' is appended to the specified name).
-t: Test the testing set on the current tree. Various measures, including classification accuracy are reported.
-u: Unprune (virtually) the current tree now, and do not virtually prune it whenever the tree changes. See -p.
-v: Toggle (initially off) printing of the instances as they are loaded. This is often useful for identifying a syntax error in the names or data files.
-R: Print the entire tree in a verbose form. This is not generally useful. It is for debugging, and is mentioned here only for the sake of completeness.
-w: Print ascii version of the current tree.

Here are two examples:

To load the led7 training instances, then build a tree quickly, then prune it, then print it, then draw it, one could type:
iti led7 -lcart -f -p -w -dcart
To load the 6-bit multiplexor training instances, then build a tree incrementally, then draw it, then restructure the tree via dmti, and then draw it, one could type:
iti mplex-6 -lall -i -dm6iti -E -dm6E

For running cross-validation tests, xval-prep and a modified xval.sh are included from Quinlan's C4.5 distribution (with permission of and thanks to J.R. Quinlan). The xval-prep.c requires no modification, but you might want to edit xval.sh to suit your own purposes.