Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Decision Trees - Attribute Inclusion?

Last post 07-23-2009, 18:25 by Predictor. 4 replies.
Sort Posts: Previous Next
  •  07-09-2009, 3:31 9067

    Decision Trees - Attribute Inclusion?

    Hi Could someone explain why after building a C&RT model (Clementine 12) on a dataset with hundreds of potential attributes, the generated model appears to be based on a certain number of attributes but there are a greater number of attributes listed as Inputs within the Summary tab of the tree's model nugget. For example, viewing a generated model built on 120 potential inputs, 4 attributes are used in creating the classification rules. However, in the Inputs section of the Summary tab, 12 attributes are listed as Inputs.
  •  07-12-2009, 18:10 9077 in reply to 9067

    Re: Decision Trees - Attribute Inclusion?

    Hi, I think that the answer you are looking for lies in the mathemathical theory of the model's algorithm.

    I guess you could try to learn that... or you could just read Clementine's help and then start to play around with the model generation options, such as "levels below root".

     

  •  07-13-2009, 15:42 9080 in reply to 9077

    Re: Decision Trees - Attribute Inclusion?

    Today while I was returning from a DM meeting I remembered your case...

    One of the things that probably happens is that the model selects the best classifiers and then it has no more data to keep creating more groups. If you have, for instance, 10 cases, and your best classifier splits those 10 into two groups of 5, then the next field will split those groups in maybe 2 or 3 cases, and then there are no more records to keep on splitting. When the model stops splitting has to, I believe, with the "splitting rules". You may configure the model to keep on splitting, but there is one point where those splits won't be solid cases and rules, just punctual instances not relevant to the whole dataset and will most probably even vary a lot. You won't get those splits past the evaluation and validation phase. This is known as overfitting. ¿When do you know you have achieved this point? I can't answer this question. For me it has always been trial and error, sadly.

    Maybe Tim or another senior (in terms of experience, not age Stick out tongue) DMiner can confirm what I'm saying.

    So how can you use those fields? Well, I remember the "drugA drug B etc" tutorial making a new field from two fields, Na / K. This field was a very good predictor, but Na and K, separately, weren't.

    One intriguing subject for me has always been "what the model can do and can't do when it comes to fields". Apparently Decission Trees can't create new fields from the original fields, but I believe other models such as Neural Networks and Discriminant Analysis do this. What you need to know is what relations (such as X / Y) can and cannot be made by the model itself. This is an interesting subject indeed and it belong to data preparation, one of the most important steps in data mining.

    I'm no expert, so I won't feel bad if someone corrects every word I have just said. Please feel free to do so!

  •  07-14-2009, 18:15 9086 in reply to 9067

    Re: Decision Trees - Attribute Inclusion?

    I'm not certain, but my first *guess* would be the result of pruning.   The potential inputs listed as inputs woudl probably be used in creating the model, but this model creation processing includes pruning, so the final inputs displayed in ther model tree or ruleset might be a small sub-set of the original inputs prior to pruning.

    By default pruning occurs in both C5 and CART.

    Bit swamped with work, so haven't had a a * big think* about this though...

    Tim

  •  07-23-2009, 18:25 9105 in reply to 9067

    Re: Decision Trees - Attribute Inclusion?

    brennanro:
    Could someone explain why after building a C&RT model (Clementine 12) on a dataset with hundreds of potential attributes, the generated model appears to be based on a certain number of attributes but there are a greater number of attributes listed as Inputs within the Summary tab of the tree's model nugget. For example, viewing a generated model built on 120 potential inputs, 4 attributes are used in creating the classification rules. However, in the Inputs section of the Summary tab, 12 attributes are listed as Inputs.

    I'm not sure about your software, but I've noticed that RIK (Rule Induction Kit) does this as well, and the "number of variables" in the model refers to the number of elemental conditions included.  In other words, a single input variable will be counted 3 times, if it is included in 3 different conditions.

     

    -Will Dwinnell

    <a href="http://matlabdatamining.blogspot.com/">Data Mining in MATLAB</a>

     

    Filed under:

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed