Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

DTView - Decision and Regression Tree Visualization (Java)

Last post 08-31-2004, 12:21 by Dr. Who. 0 replies.
Sort Posts: Previous Next
  •  08-31-2004, 12:21 2182

    DTView - Decision and Regression Tree Visualization (Java)

    Attachment: dtview.zip
    Java sources, version 1.9, 2004.07.05

    Description

    A visualization program for decision and regression trees, which were created with the command line programs (written in C) of the dtree package. This program provides flexible ways of adapting the layout of the decision/regression tree. It also contains a toolbox dialog, which provides a simple user interface to the decision and regression tree induction programs.

    DTView

    Decision and Regression Tree Visualization

    Contents

    Introduction

    Decision trees are the most popular method for data analysis and classifier construction. With the dtree package I provided a set of command line programs written in C, with which decision and regression trees can be grown and pruned as well as executed on new data. However, these programs lacked a graphical user interface and, in particular, the possibility to visualize a learned decision or regression tree. This drawback is amended with this decision and regression tree visualization program written in Java, which also contains a user interface to the command line programs, consisting of a dialog box with an executable tab for each program.

    Enjoy,
    Christian Borgelt

    back to the top

    Invocation

    To start the program from the jar file, type java -jar dtview.jar, to start it from the compiled sources, type java dtview.DTView. A decision tree may then be loaded by selecting File > Load Tree... Alternatively, the name of the decision tree to display may be stated on the command line. Example decision tree files can be found in the source package in the directory dtview/data.

    The tree toolbox alone (i.e., the user interface to the command line decision and regression tree programs, without the visualization part) may be invoked from the compiled sources with java dtview.DTToolbox.

    back to the top

    Visualization

    The following picture explains the layout of a decision tree and the meaning of the different fields of a node. It shows a decision tree for the well-known iris data, which is a data set containing measurements of the petal length and width and the sepal length and width of three types of iris flowers.

    In the root node, shown at the top, the upper label field contains the name of the target attribute of the decision tree, in this case iris_type. For all other inner nodes and for all leaves the upper label field contains the branch value of the attribute tested in the parent node, for example < 2.45. For all inner nodes the lower label field shows the name of the attribute tested in this node (here either petal_length or petal_width, so the condition that corresponds to the value named above is peta_width < 2.45), while for leaves it shows the name of the most frequent class, which will be predicted if a case is classified with this leaf node. For this example the class is one of Iris-setosa, Iris-virginica and Iris-versicolor.

    Between the two label fields there are two bar diagrams, the upper of which shows as a dark gray bar the fraction of the training data set that falls to this node, thus indicating the importance of the subtree. The lighter gray bar (which also counts from the left side of the node and thus is, in a way, "behind" the dark gray bar) shows the fraction of sample cases relative to those assigned to the parent node.

    The lower bar diagram shows the class distribution, with one color for each class that is present in the node. The width of each color bar indicates the fraction of cases assigned to this node that have the corresponding class.

    For regression trees the upper bar (which shows the class distribution for a decision tree) is replaced by a bar showing the domain of the target attribute. The pictre below shows a regression tree for the iris data as an example, with which the petal width is predicted from the other attributes.

    The target value predicted in a node is indicated by a thin black vertical line. The variation around this value in the data set the tree was induced from is shown as a blue bar, which extends from the predicted value minus one standard deviation to the predicted value plus one standard deviation. The other fields of a node have the same meaning as for a decision tree.

    back to the top

    Interaction

    For large decision trees it is desirable to be able to fold subtrees, so that they do not obstruct the view to the more important parts of the tree. A subtree may be folded and unfolded by simply clicking on the inner node at which it is rooted. As an example the following picture shows the same decsion tree as above with a folded subtree.

    Note that a folded subtree can be identified by the gray lower label, which now shows the class that would be predicted here if this subtree were pruned to a leaf (instead of the test attribute for an unfolded subtree).

    A click with the left mouse button only (un)folds the node clicked on, while a click with the right mouse button opens a popup menu, with which the whole subtree may be (un)folded, that is, all nodes in this subtree are (un)folded. This popup menu also offers the possibility to prune the tree manually, that is, to turn a subtree into a leaf node. Note that folding a node/subtree can be undone, whereas pruning can not.

    A manually pruned decision tree may be written to a file by selecting File > Save Tree or File > Save Tree as...

    back to the top

    Layout and Font

    The layout of the tree can be adapted in a very flexible way.

    Selecting View > Set Layout... opens the following dialog box:

    In this box the layout mode (whether horizontal or vertical and whether parents should be centered or the layout should proceed from left to right or from top to bottom, respectively), the style of the connecting edges (direct or angled), the size of the decision tree nodes, their horizontal and vertical distance, the size of the shadows they throw, and the width of the frame around the whole decision tree can be specified.

    Selecting View > Set Font... opens the following dialog box:

    In this box the standard Java attributes can be chosen for the font that is used for the node labels. The name of the font determines its look. The style can be modified to bold or italic (or both). The size of the font is stated in points.

    Note that the bars in the middle of a decision tree vanish if the font size is too large to allow for such bars. However, a decision tree node will always have the two label fields, regardless of the specified node height, because the node height is adapted according to the chosen font size.

    back to the top

    Decision and Regression Tree Toolbox

    The tree toolbox may be invoked with File > Open Toolbox. This dialog box contains six tabs, four of which correspond to the command line programs dom (domain determination, contained in the table package) dti (decision tree induction), dtp (decision tree pruning) und dtx (decision tree execution, all three contained in the dtree package).

    On the first tab the format of the data file can be specified:
    (Note that this tab is not executable.)

    It is assumed that a data file consists of several records, each of which contains several fields (but all records have the same number of fields). As their names indicate, record separators separate records and field separators separate fields within a record. Blanks are characters that are used only to fill fields, for example to achieve a certain width, so that fields of different records are aligned in a text editor. They are removed when the file is read for processing. Unknown value characters are used to identify unknown or missing values. A field containing only such characters is assumed to be unknown or missing.

    By default it is assumed that the first record of the data file contains the names of fields. If this is not the case, the corresponding box should be unchecked, so that default names (field numbers, starting with 1) are generated.

    The last field of each record of the data file may contain an occurrence counter. Checking the corresponding box tells the programs about this, so that the last field is not interpreted as a data value, but as a record weight. If no such explicit weight is present, each record is assigned a uniform weight of 1.

    With the second tab attribute domains can be determined:

    The decision tree induction program need as input a description of the domains of the attributes. This tab serves the purpose to generate such a domain description from the data file, which is done when this it is in front and the "execute" button is pressed. The attribute types are determined automatically. This may fail: If a nominal attribute has numbers as values, which have no numeric meaning, the program will still assume that the attribute is numeric. To correct such things the generated domain file has to be edited. This can be done by pressing the Edit button, which opens a simple text editor.

    This tab also lets you locate the underlying programs if they acannot be found through the "path" environment variable. If you place the programs into the same directory where you start DTView, locating the programs should not be necessary.

    The third tab contains the parameters that influence the decision tree induction:

    The evaluation measure is used to choose the test attributes and the tests, and it may be weighted with the fraction of known values of the attribute. The upper measure is for decision trees, the lower for regression trees. Which of the two selected measure is used depends on the type of the target attribute.

    For the evaluation measure a minimum value may be specified, which has to be exceeded for a split to be generated. Some measures can make use of specific parameters, the meaning of which are explained in the book Graphical Models - Methods for Data Analysis and Mining by Christian Borgelt and Rudolf Kruse, J. Wiley and Sons, Chichester, United Kingdom 2002. For standard applications these parameters may safely be ignored. Furthermore a maximum height of the tree to be grown can be set (with 0 meaning that there is no limit) as well as a minimum support, which has to be exceeded by at least two branches for a split to be generated. Finally, the program can be asked to try to form subsets of the values of nominal test attributes.

    With the fourth tab a decision or regression tree can be induced:

    The induction of a decision or regression tree starts from a definition of attribute domains, which also yield the type of the target attribute and thus the type of the tree (decision tree for a nominal target attribute, regression tree for a numeric target attribute). For the domain file the file generated with the second tab may be used.

    Next the name of the target field may be specified. If this input is left empty, the last attribute in the domain file is used, which, if the domain file was generated automatically, corresponds to the last field of a record in the data file.

    In addition a data file is needed. If a decision tree is to be learned and the class frequencies are highly uneven in this data file, so that simply predicting the majority class leads to excellent error rates, it can pay to balance the class frequencies, which is done by modifying the record weights, so that all classes receive the same total weight. This can help the induction, but it should be kept in mind that such balancing distorts the data statistics.

    The output decision tree, which is generated when the "execute" button is pressed with this tab in front, is written to the file specified at the bottom of the tab.

    With the fifth tab an induced decision or regression tree can be pruned:

    Pruning takes an input tree and writes an output tree. A data file may be used to prune the tree, but it may also be done without. If you do not want to use a data file, simply leave the corresponding input field empty.

    There are three different pruning methods: none, which corresponds to reduced error pruning if a data file is used that is different from the one with which the tree was induced; pessimistic, which adds a user-specified fixed number of errors to each leaf, and confidence level, which uses the upper bound of a (formal, but not statistically valid) confidence interval for the leaf errors to determine the error rates. The pruning parameter is the number of errors to add to each leaf or the confidence level, respectively.

    Furthermore, a maximum tree height may be specified (with 0 meaning that there is no limit) and it can be selected whether a replacement of a subtree with its largest branch (in terms of the number of training cases assigned to it, not in terms of the number of nodes) should be considered. This option requires a data file to take effect. In addition, it should be noted that on bigger trees it can be very time consuming and thus should be applied with care.

    With the sixth and last tab a decision or regression tree can be executed on a data set:

    Here the decision tree to be executed and the input and output data files have to be specified. The name of the field to be added can be stated as well as names for optional fields containing the support (number of cases in the training data set on which the decision is based) as well as the confidence of the classification (percentage of correct predictions among those that are classified with the same leaf in the training data set). If the corresponding input fields are left empty, no such fields are generated.

    By default the field names are written to the first record of the output file. If you do not want this, uncheck the corresponding box.

    back to the top

    Copying

    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

    back to the top

    Download

    Download page with most recent version.

    back to the top

    Contact

    Snail mail: Christian Borgelt
    Working Group Neural Networks and Fuzzy Systems
    Department of Knowledge Processing and Language Engineering
    School of Computer Science
    Otto-von-Guericke-University of Magdeburg
    Universitätsplatz 2
    D-39106 Magdeburg
    Germany
    E-mail: christian.borgelt@cs.uni-magdeburg.de
    borgelt@iws.cs.uni-magdeburg.de
    Phone: +49 391 67 12700
    Fax: +49 391 67 12018
    Office: 29.015
    back to the top

    © 2004 Christian Borgelt

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed