Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Using Factor Analysis along with K-Means Clustering

Last post 01-20-2010, 9:56 by brennanro. 2 replies.
Sort Posts: Previous Next
  •  01-19-2010, 9:22 9574

    Using Factor Analysis along with K-Means Clustering

    Not 100% sure this is the best forum to post the question, but it does relate to a Clementine analysis project.

    I don't have a great recollection of the basic theory behind Factor Analysis/PCA, but I've always regarded it as a data reduction technique (which I haven't had the opportunity/need to use before) useful for identifying a smaller number of underlying factors amongst a large dataset.

    Anyway - in advance of an upcoming segmentation project, a colleague has been pushing the idea of using of a Factor Analysis model to reduce the number of inputs that feed into the K-Means clustering model. However, I'm unconvinced of the merits of undertaking this additional step and would normally go straight to the clustering stage. In particular as the data available relates to different customer activities, and as such represents fairly distinct sets of behaviours, 

    From my own perspective, I feel that careful selection of variables for the clustering dataset would be sufficent, and that the FA stage is not necessary. Does anyone have any opinion on why this would be a good/bad idea?

    Regards,

     R

  •  01-19-2010, 22:09 9576 in reply to 9574

    Re: Using Factor Analysis along with K-Means Clustering

    In a past life as a consultant I had used PCA to reduce survey data into underlying components prior to running k-means. It was very benefical.   In my opinion the suitability of PCA would largely depend upon a few data related factors including;
     - how much data (variables/fields and also records)
     - source of the data (is the data from a survey or source likely to suffer from issues of multi-colinearity) 
     - how will the results of the data analysis be presented?

    If you have a large (millions) number of records of data then PCA is probably not a wise choice for performance reasons.

    Having hundreds of variables/fields might make it more likely that some of these correlate with each other, and could benefit from PCA.  My data is from mobile (cell phone) usage and each variable/field is pretty distinct and correlation (say voice call count and sms call count) are actually really important and PCA doesn't help at all.  For survey data I've found PCA a must (but haven't nearly as much experience as I do with large transactional data).

    Segments/clusters built on counts and cleverly manipulated raw data lend themselves very well to explaination and presentation.  For example, "customer segement 'A' have high spenders blah blah".  Much easier to explain than segment has lots of "factor 1" :)

    Hope that helps

    Tim

     



     

  •  01-20-2010, 9:56 9578 in reply to 9576

    Re: Using Factor Analysis along with K-Means Clustering

    Normal 0 false false false MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;}

    Thanks Tim - very helpful. I guess my reluctance to try both techniques was associated with both a lack of practical experience using PCA and my misgivings as to why it’s use was being pushed (i.e. without any sound foundations or reasons)!

    I'll be working with a dataset containing over 20k records and between 60-100 attributes focused on aggregated customer transactions (unlike the attitudinal/survey data which I've always associated with PCA).

    I'll try both methods and compare results - which will involve refreshing myself on PCA/FA theory and if or how it can cope with re-coded categorical variables in Clementine.

     R

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed