Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

cross-validation

Last post 02-03-2010, 14:59 by JeffA. 1 replies.
Sort Posts: Previous Next
  •  02-01-2010, 21:09 9645

    cross-validation

    Hi all.

    I working on the pasw13. I need to split a data set (of 13,000 posts). The partition node  dose not allow to part the data set according to the cross-validation methodology. (in this procedure you need to split the data set to 5 up to 10 parts...).  Does anybody have an idea, how can we do this?

    Thank's Dan.

  •  02-03-2010, 14:59 9654 in reply to 9645

    Re: cross-validation

    Hi Dan-

     You can do k-fold cross validation as follows (I built a super node that makes it tidy - probably is not the most efficient but it works without syntax):

    1) Create a new field that is a random number - say N(mu,sigma^2)

    2) Pass the data through a binning node that creates deciles (for 10-fold validation or 5 equal groups for 5 fold) based on this new field from step 1.

    3) Create 10 derive nodes attached to step 2 (to create in effect 10 separate data streams, NOT one after another)  which says for example if decile =1 then partition ='Testing' else 'Training'. Then after this node add another derive node that creates a field called BY which is equal to '1'. Do this for each of the 10 switching out of course '1' for 2 through 10 (0-9 if I recall how the binning is numbered).

    4) Bring the 10 data streams together with an append node.

    5) Run your model using as a partition the 'partition' field from step 3 and separate models (split) using the BY field from step 3. This is set in the model node.

     You are in effect creating 9 additional copies of the data set - where in each copy, a10th of the data is for testing and the other 90% is for training. You can add an analysis node after the model to see the accuracy or RMSE or whatever on each of the 10 splits to average them.

    I wrote quick from memory but write back if you are not following

     

    Jeff 

      

     

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed