Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Table Utility Programs

Last post 08-28-2004, 17:00 by Dr. Who. 0 replies.
Sort Posts: Previous Next
  •  08-28-2004, 17:00 2157

    Lightning [li] Table Utility Programs

    Attachment: table.zip

    C source code, package version 2.17, 2004.08.12

    Some utility programs to work with table files, i.e., text files containing tabular data. The modules of this package are also needed for compiling,for instance, the decision tree programs, the Bayes classifier programs, neural network programs, and the clustering programs.

    Table Utility Programs
    (A Brief Documentation of the Programs dom / tmerge / tsplit / opc / xmat)


    Contents
    Introduction
    Determining Attribute Domains
    Merge Tables
    Split a Table
    Computing One Point Coverages
    Computing a Confusion Matrix
    Copying
    Download
    Contact


    Introduction
    I am sorry that there is no detailed documentation yet. Below you can find a brief explanation of how to grow a decision tree with the program dti, how to prune a decision tree with the program dtp, how to execute a decision tree with the program dtx, and how to extract rules from a decision tree with the program dtr. For a list of options, call the programs without any arguments.

    Enjoy,
    Christian Borgelt

    As a simple example for the explanations below I use the dataset in the file table/ex/drug.tab, which lists 12 records of patient data (sex, age, and blood pressure) together with an effective drug (effective w.r.t. some unspecified disease). The contents of this file is:

       Sex    Age Blood_pressure Drug
       male   20  normal         A
       female 73  normal         B
       female 37  high           A
       male   33  low            B
       female 48  high           A
       male   29  normal         A
       female 52  normal         B
       male   42  low            B
       male   61  normal         B
       female 30  normal         A
       female 26  low            B
       male   54  high           A

     

     

    Determining Attribute Domains
    The domains of the columns of th table drug.tabcan be determined with the program dom

      dom -a drug.tab drug.dom
    The program dom assumes that the first line of the table file contains the column names. (This is the case for the example file drug.tab.) If you have a table file without column names, you can let the program read the column names from another file (using the -h option) or you can let the program generate default names (using the -d option), which are simply the column numbers. The -a option tells the program to determine automatically the column data types. Thus the values of the Age column are automatically recognized as integer values.

    After dom has finished, the contents of the file drug.dom should look like this:

      dom(Sex) = { male, female };
      dom(Age) = ZZ;
      dom(Blood_pressure) = { normal, high, low };
      dom(Drug) = { A, B };

    The special domain ZZ represents the set of integer numbers, the special domain IR (not used here) the set of real numbers. (The double Z and the I in front of the R are intended to mimic the bold face or double stroke font used in mathematics to write the set of integer or the set of real numbers. All programs that need to read a domain description also recognize a single Z or a single R.)

     

     

    Merge Tables
    The program tmerge can be used to merge two tables and to project a table to a subset of its columns. The latter consists simply in merging a table to another which contains less columns. I only demonstrate the projection by merging the table in the file drug.tab to the empty table in the file drug.hdr:

      tmerge -a drug.hdr drug.tab drug.prj
    This command removes the Age column from the table drug.tab (since this column is missing in the file drug.hdr) and writes the result to the file drug.prj. After the program tmerge has finished, the contents of the file drug.prj should be:

      Sex    Blood_pressure Drug
      male   normal         A
      female normal         B
      female high           A
      male   low            B
      female high           A
      male   normal         A
      female normal         B
      male   low            B
      male   normal         B
      female normal         A
      female low            B
      male   high           A

    Since the option -a is given, the columns of the output file are aligned. If the file drug.hdr contained tuples, these tuples would precede the tuples from the file drug.tab in the output file drug.prj.

     

     

    Split a Table
    If you are interested in the sets of patients with low, normal, or high blood pressure, you can split the table into subtables, each of which contains only tuples with a specific value for the column Blood_pressure, with

      tsplit -a -c Blood_pressure drug.tab
    This should result in three files - 0.tab, 1.tab and 2.tab - with the following contents:

      0.tab:   Sex    Age Blood_pressure Drug
               male   20  normal         A
               female 73  normal         B
               male   29  normal         A
               female 52  normal         B
               male   61  normal         B
               female 30  normal         A

      1.tab:   Sex    Age Blood_pressure Drug
               female 37  high           A
               female 48  high           A
               male   54  high           A

      2.tab:   Sex    Age Blood_pressure Drug
               male   33  low            B
               male   42  low            B
               female 26  low            B

    That is, the file 0.tab contains all patients with normal blood pressure, the file 1.tab all patients with high blood pressure, and the file 2.tab all patients with low blood pressure. The tables are aligned since the option -a was given. With the -c option the column is specified on which the split is based. Similarly, the table can be split in such a way that the relative frequencies of the values are maintained (stratified split). For example, calling the program tsplit with

      tsplit -a -t3 -c Blood_pressure drug.tab
    should result in three files (3 because of the -t3 option) - 0.tab, 1.tab and 2.tab - with the following contents:

      0.tab:   Sex    Age Blood_pressure Drug
               male   20  normal         A
               female 52  normal         B
               female 37  high           A
               male   33  low            B

      1.tab:   Sex    Age Blood_pressure Drug
               female 73  normal         B
               male   61  normal         B
               female 48  high           A
               male   42  low            B

      2.tab:   Sex    Age Blood_pressure Drug
               male   29  normal         A
               female 30  normal         A
               male   54  high           A
               female 26  low            B

     

     

    Computing One Point Coverages
    The program opc can be used to reduce a table. This does not change anything for the original table, but simplifies the table that resulted from the application of the program tmerge shown above. This table can be reduced by calling the program opc with

      opc -a drug.prj drug.red
    After the program opc has finished, the contents of the file drug.red should read like this:

      Sex    Blood_pressure Drug #
      male   normal         A    2
      male   normal         B    1
      male   high           A    1
      male   low            B    2
      female normal         A    1
      female normal         B    2
      female high           A    2
      female low            B    1

    The number in the last column indicates the number of occurences of the corresponding tuple (table row) in the original table.

    The opc program can also be used to compute one point coverages, either in a fully expanded or in a compressed form. One point coverages are considered in possibility theory and computing them is important for inducing possibilistic network from data. However, explaining this in detail would lead too far.  

     


    Computing a Confusion Matrix
    The program xmat can be used to evaluate a classification result. It reads a table file and computes a confusion matrix from two columns of this table. It uses the last two columns by default (the last column for the x- and the semi-last for the y-direction). Other columns can be selected via the options -x and -y followed by the name of the columns that are to be used for the x- or y-direction of the confusion matrix. To demonstrate this program we use the file drug.cls, which contains simply the data from the file drug.tab with an additional classification column:

      Sex    Age Blood_pressure Drug Class
      male   20  normal         A    B
      female 73  normal         B    B
      female 37  high           A    A
      male   33  low            B    B
      female 48  high           A    B
      male   29  normal         A    A
      female 52  normal         B    B
      male   42  low            B    A
      male   61  normal         B    B
      female 30  normal         A    A
      female 26  low            B    B
      male   54  high           A    A

    To determine a confusion matrix for this table, simply call the program xmat with

      xmat drug.cls
    The output, which by default is written to the terminal, should read like this:

      confusion matrix for Drug vs. Class:
       no | value  |      1      2 | errors
      ----+--------+---------------+-------
        1 | A      |      4      2 |      2
        2 | B      |      1      5 |      1
      ----+--------+---------------+-------
          | errors |      1      2 |      3

    In this matrix the x-direction corresponds to the column Class and the y-direction to the column Drug. As you can see, for drug A the classification is wrong in two cases (first line, second column of the matrix), for drug B it is wrong in one case (second line, first column). Overall there are three errors.

     

     

    Copying
    dti/dtp/dtx/dtr/rsx - induce, prune, and execute decision and regression trees
    copyright © 1996-2003 Christian Borgelt

    These programs are free software; you can redistribute them and/or modify them under the terms of the GNU Lesser (Library) General Public License as published by the Free Software Foundation.

    These programs are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser (Library) General Public License for more details.

     

     

    Download
    Download page with most recent version.

     

     

    Contact
    Snail mail:  Christian Borgelt
    Working Group Neural Networks and Fuzzy Systems
    Department of Knowledge Processing and Language Engineering
    School of Computer Science
    Otto-von-Guericke-University of Magdeburg
    Universitätsplatz 2
    D-39106 Magdeburg
    Germany
    E-mail:  christian.borgelt@cs.uni-magdeburg.de
    borgelt@iws.cs.uni-magdeburg.de
    Phone:  +49 391 67 12700
    Fax:  +49 391 67 12018
    Office:  29.015
     

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed