Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

K-Mean Clustering Inputs

Last post 12-22-2009, 15:16 by JulesMN. 6 replies.
Sort Posts: Previous Next
  •  04-02-2009, 14:55 8777

    K-Mean Clustering Inputs

    Does the KMeans modelling node automatically standardize clustering inputs, or should pre-standardization be carried out? Two-Step node appears to have a standardize checkbox option.

     

    Thanks,

     

    R

  •  04-02-2009, 15:10 8778 in reply to 8777

    Re: K-Mean Clustering Inputs

    Should have added that the dataset contains attributes/fields of different scales (quantiles, percentages, absolute values) all at customer level and some categorical and demographic data.
  •  04-02-2009, 22:23 8780 in reply to 8777

    Re: K-Mean Clustering Inputs

    The short answer is no, K-means doesn't normalize automatically. I'd recommend first transforming numeric fields so that they are more normally distributed using the Transform Node, and then z-scoring these and the other unnormalized fields. One way to do this is to compute the global statistics, and then creating a derive node for multiple fields with the formula:

     

     ( @FIELD - @GLOBAL_MEAN( @FIELD ) ) / @GLOBAL_SDEV( @FIELD )

     

    Then use these for Kmeans. I've posted a quick experiment on this topic at http://abbottanalytics.blogspot.com

     

  •  04-03-2009, 5:20 8781 in reply to 8780

    Re: K-Mean Clustering Inputs

    Thanks Dean. Interesting point on the effect of correlated variables too - was also something I was thinking about.

     

    R

     

     

  •  04-03-2009, 8:18 8782 in reply to 8781

    Re: K-Mean Clustering Inputs

    Maybe Tim can comment on one other aspect of the formula I put in the previous comment that normalizes the data, but I'm generally not a big fan of the GLOBAL node because it is a terminal node, so you either have to run it manually before you use the global stats, or script it up with Clem Script.

     What I usually do is use an aggregate node without a variable to aggregate on (this aggregates over all the data), compute the stats (like Mean, StdDev ) inside the aggregate node, and then do a merge to create new columns with those statistics which then you can use to normalize your data (the new variable names by default will be something like LASTGIFT_Mean. It makes for a wider data set, but you can do this in the flow of the data, and then remove them further downstream with a filter node.

     Anyone else want to weigh in on this?

  •  04-05-2009, 16:04 8791 in reply to 8782

    Re: K-Mean Clustering Inputs

    Hi !

    I do exactly the same as Dean mentioned.  Couldn't have said it any better.

    I'd recommend using the Aggregate node where possible to compute the means, std dev etc and then join/merge back to the data.  When reading data from a database this type of data process will be converted into SQL, and is much faster than using a set-globals (and therefore making two passes of the data).

    Do you have nulls/missing data?
     - Rememer that Clementine follows the same conventions as databases, so tri-state logic (I've also heard this called tri-value logic) applies.  Be aware that the calculation of means etc will ignore nulls.  For example mean of ( 2 + 2 + null)  = 2.    Not 1.3!  I want this outcome, but its something you should be aware of.  Using a filler node and replacing any nulls with zero will lead to different means etc, and would also probably affect your normalisation and data transformations. If you have nulls choose how to handle them carefully.

    Tim

  •  12-22-2009, 15:16 9492 in reply to 8782

    Re: K-Mean Clustering Inputs

    I've made it through the Aggregate and Merge Nodes.  Now working on the Derive node to create the normalized variables.  I would like to be able to use the Multiple Mode so that I can create the Normalized variables (i.e., Age_Norm, Income_Norm) in one node instead of in multiple nodes.  I am trying to do something like:

    (@FIELD - @FIELD_Mean) / @FIELD_SDev

     And it isn't liking it.  Is there a way to do this?

     Thanks!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed