Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Quick way to remove outliers?

Last post 08-19-2008, 16:04 by Arkantos. 2 replies.
Sort Posts: Previous Next
  •  07-22-2008, 9:42 8165

    Quick way to remove outliers?

    Hi everyone, I have been using the "Action -> Discard" function included in the Data Audit node, but the problem is that this has to be done individually to each field... isn't there a way to do it automatically? I tried selecting many fields and setting the option but it still changes only one. Looked for help at the manual but nothing...

     

    Clem 11.1 btw...

     

    Thanks

  •  07-22-2008, 15:20 8169 in reply to 8165

    Re: Quick way to remove outliers?

    Attachment: remove_outlier.zip

    Hello,

    In my analysis I usually remove outliers as part of my data processing, and using a data warehouse means my processing has to be converted as SQL.  For this reason, I'd probably not use the Data Audit node or some other form of output.

    See an attached example, its just a simple way to use the Aggregate node to obtain the maximum value for a field (or any number of fields).  This max value (and or minimum) can then be joined back to your data and used in any calculation, for example picking all rows that fall within the top 1% of the maximum (as I have done in this example).

    This method is also dynamic.  There have been a few related post in the past few weeks, I'd suggest reading through a few of the previous posts for other related examples.  My preference is to use binning by row count where I have 100 bins (and each row is a customer.

    Cheers

    Tim

  •  08-19-2008, 16:04 8234 in reply to 8169

    Re: Quick way to remove outliers?

    Hi!

    Thanks Tim.

    I came up with this method, it uses the traditional more-than-two-SD outlier method and allows you to apply it to any number of fields without loss of records. Please check it out, I'm posting this because it became a winner...

    Can this be traded for some data preparation secrets? =D

     

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed