Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Building Neural Networks on Unbalanced Data (using Clementine)

Last post 11-15-2009, 14:02 by TimManns. 10 replies.
Sort Posts: Previous Next
  •  11-01-2009, 15:23 9347

    Building Neural Networks on Unbalanced Data (using Clementine)

    Attachment: dean abbott.zip

    Attached is an example stream related one of my personal blog posts.

     Also pasted the blog post below.

    Cheers

    Tim

    - - - - - - - - -

    I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World.  I think my presentations went down well (well, I got good feedback).  There were also a few questions and issues that were posed to me.  One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

    Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data.  The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome.  By default Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false.   This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net model.  In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

    But there is a simple solution!

    It is something I have always done out of habit because it has proved to generate better models, and I find a decimal score more useful. Being a cautous individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.

    In situations where you have a binary categorical outcome (say, churn yes/no, or response yes/no etc) then in Clementine you can use a Derive (flag) node to create alternative outcome values.  In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0. 

    By changing the categorical outcome values to a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead expose the output neuron values and the Clementine output score will be a decimal range from 0.0 to 1.0.  The distribution of this score should also closely match the probability of the data input into the model during building.  In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted / ranked to determine which customers to contact first.  I subsequently use the lift metric and the incidence of actual outcomes in sub-populations of predicted high scoring customers.  I rarely try to create a categorical 'true' or 'false' outcome, so didn't give it much thought until now.

    If you want to create an incidence matrix that simply shows how many 'true' or false' outcomes the model achieves, then instead of using the Neural Net score of 0.5 to determine the true or false outcome, you simply use the probabilty of the outcome used to build the model.  For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04.  If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.  A simple derive node can be used to do this.

    If you have a categorical output with multiple values (say, 5 products, or 7 spend bands etc) then you can use a Set-To-Flag node in a similar way to create many new fields, each with a value of either 0.0 or 1.0.  Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field.  This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough 'fuzzy' logic way as I have in the past:).

     

  •  11-04-2009, 0:00 9357 in reply to 9347

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    This is an excellent post and very useful. I have one situation where this could be very useful where I categorize output in a NN as either 1 or 0. The 1's occur in approx 10% of my total sample, therefore balancing the data excludes a large portion of the training data to even up the training data

     Thnaks

    Os

  •  11-04-2009, 4:39 9358 in reply to 9347

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    I've been thinking about this a bit more and have a few questions... If I have a categorical output True/False which is balanced 50/50 in the training data. For each prediction on real data I often output the propensity so I have something similar to..

                      $N-NN - T/F    $NC-NN - T/F

    Item 1           0                   0.02

    Item 2           0                   0.93

    Item 3           1                   0.10

    Item 4           1                   0.54

    My question is why does Item 1 (0 0.02) come out as a prediction as 0 instead of 1 with such a low propensity? Similary, why does Item 3 (1, 0.10 ) get predicted as 1 and not 0 with low propensity?

    If I used your method above and had a the output train as a range between 0 and 1 is it effectively training the model to output a form of propensity?

     

  •  11-04-2009, 4:39 9359 in reply to 9347

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    I've been thinking about this a bit more and have a few questions... If I have a categorical output True/False which is balanced 50/50 in the training data. For each prediction on real data I often output the propensity so I have something similar to..

                      $N-NN - T/F    $NC-NN - T/F

    Item 1           0                   0.02

    Item 2           0                   0.93

    Item 3           1                   0.10

    Item 4           1                   0.54

    My question is why does Item 1 (0 0.02) come out as a prediction as 0 instead of 1 with such a low propensity? Similary, why does Item 3 (1, 0.10 ) get predicted as 1 and not 0 with low propensity?

    If I used your method above and had the output train as a range between 0 and 1 is it effectively training the model to output a form of propensity?

     

  •  11-05-2009, 10:41 9364 in reply to 9359

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    Hi osknows, hi Tim. This reminds me of my early Data Mining, when I was suffering in pain and Tim suggested to get a gain chart instead of just calculate the model's precission.

    This relates to this as well. In some cases, precission doesn't matter at all. In churn, for example, mostly all that matters is the ROC (Clem's gain chart) because your client won't try to retain all of the "YES, HE WILL LEAVE!!!" predictions. He will tell you hey, monthly we try to retain about 10,000 people. Could you tell me who I should be trying to retain? This is where RANKING and not CLASSIFYING comes into place. A BAD CLASSIFIER could be a GOOD RANKER and the other way round. This can be demonstrated, if anyone needs further proof then please ask and I will try to get you a very clear powerpoint I have just seen in a Data Mining conference.

    osknows: "N" is the output, "NC" is the confidence. The RANKING of your four items would be: Item 2, Item 1, Item 3 and finally Item 4 most close to "1" output. So for example if your client needs a ranked list, you would need to fill NC with -NC when N is 0, discard the N field and finally rename NC to "SCORE". Welcome to fuzzy logic! Items 1 and 2 are predicted to be more 0 than 1, and Item 2 is more 0 than Item 1.

    Did you follow me?

  •  11-05-2009, 10:47 9365 in reply to 9364

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    By the way here you go: http://www-2.dc.uba.ar/materias/mdmkd/jadm/jadm2009/jadm09-4.rar

    It's in spanish, but you can watch and understand the third slide.

  •  11-06-2009, 10:32 9368 in reply to 9364

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    Thanks for the clarity, especially as it's contrary to my own incorrect interpretation! Previously, I was unsure whether the fuzzy logic aspect meant that a score of 1,0.1 was equal to 0,0.1.

     May I also request your powerpoint info on classifiers and ranking?

     

    Thank you,

    Os

  •  11-06-2009, 11:25 9369 in reply to 9368

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    You're welcome.

    Powerpoint: look above your post.

  •  11-06-2009, 12:38 9370 in reply to 9369

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

     

    Thanks for the PP link.

    I have a situation where I have lots of data (200,000 records each with 255 fields) where a proportion of the predictions is entirely predictable but a small group, although predictable is actually impossible to rank. Ok I'll come clean, we're talking horse racing prediction/ranking here :)

    I'm a business analyst careerwise and find it quite ironic that the difference between work and my own interest is; on the one hand I have entirely accurate data that is is difficult to model and on the other inaccurate data that is easy to model..! I'll leave it to you to guess which way around that is.

    One thing I often find with Clementine which I don't necessarily find with SPSS statistics or other tools; Clementine algorithms seem to carry the inherent average of the data into the output.... for example I could have data where 10% of the data is classified as A and the remaining 90% B. Using the Binary/Numeric classifier without any preparation is normally happy to assign everything as B in most models and report 90% accuracy.

  •  11-06-2009, 14:40 9371 in reply to 9370

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    That is because of the nature of the algorithm: it will give you that output because it perceives a good accuracy. Accuracy, in some cases, shouldn't be much of a success meter. For churn problems, for example, you should check your gain chart.

    What you can do is remove certain amount of values from the majority class and run another model. I remember once I run about 10 neural networks, number 1 being 10%-90% balance and number one being 50%-50%, or so. I then analized them in gain charts and my champion model was around 5. But this is one case, you should experiment and experiment and experiment. You have nothing to lose. You just need to press "execute", go do something else and then come back. Maybe even go to sleep. It doesn't matter. It's just computational time.

    This is why I consider very, VERY valuable for a Data Mining tool to have an "experiment" option. I recently saw a presentation by Dan Steinberg (CEO Salford I think) that showed Salford's experimentation capability. It was pretty cool. Here is the powerpoint:

    http://www-2.dc.uba.ar/materias/mdmkd/jadm/jadm2009/jadm09-2.rar

    WEKA also has "The Experimenter", which is also very useful, but doesn't include different data preparation automated experiments. Salford's I think includes some.

    Unfortunately in Clementine there are not much experimentation options. It will take you some time, but then again, as I have just told you, there is nothing to lose.

    Be careful however, to have your data very well prepared. Don't get yourself lost in generating endless models with a poorly prepareted data base, that will get you nowere.

    I hope this helps!

    By the way, check my other thread, maybe you can help me? Big Smile

  •  11-15-2009, 14:02 9398 in reply to 9370

    Re: Building Neural Networks on Unbalanced Data (using Clementine)

    Horse racing eh?

    if you are trying to use detailed adata for horse racing maybe look at the proble differently than a simple 'win vs lose' outcome.  I've never ben a betting man (never liked my chances :) but I would consider calculating the average speed for each horse, or the time to complete a race rather than the outcome (or time difference from the fastest time in the race).  This way you might be able to rank or select horses based upon the difference from the winner or 'optimum' outcome.

    Tim

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed