Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

How to detect DUPLICATED records

Last post 06-30-2008, 16:01 by TimManns. 3 replies.
Sort Posts: Previous Next
  •  06-26-2008, 3:09 8071

    How to detect DUPLICATED records

    Urgent question seeking help

    How to detect DUPLICATED records and output them?

    How to find if a variable is fully $null$? (recognized as typeless so can't use Data Audit node)

  •  06-26-2008, 4:22 8074 in reply to 8071

    Re: How to detect DUPLICATED records

    hunterdong:

    How to detect DUPLICATED records and output them?

    Distinct node, option "Discard"

    hunterdong:

    How to find if a variable is fully $null$? (recognized as typeless so can't use Data Audit node)

    Here you have to test if the field is Null or Blank or undefined

    selectnode, condition : (@NULL(@FIELD) or @BLANK(@FIELD) or @FIELD= undef)

     

    You'll find all information about this kind of function in the clementine help 

  •  06-28-2008, 6:20 8081 in reply to 8074

    Re: How to detect DUPLICATED records

    thank you very much. That day I tried distinction node, and it worked but I did the wrong calculation ( Distinction Node produced 500 results which is correct but I thought 500>  (380+265) from my append node so I wandered why it produced more result.
     

  •  06-30-2008, 16:01 8091 in reply to 8081

    Re: How to detect DUPLICATED records

    Attachment: hunterdong4.zip
    There is an easy way to identify duplicate rows and perform ranking to get the highest or most recent (or lowest / oldest).  If you use the aggregate node to summarise rows by all your key fields (whatever fields you are looking for duplicates in) then any outputs that have a 'row count' above 1 are duplicated. 

    So, simply use an aggregate node, followed by a select to get only rows that have a row count above 1.  This will return all rows that are duplicates.

    If the rows are not completely duplicated, and some fields have higher values or dates, then you can use aggregate with max to identify the most recent row of any duplicates.  Attached zip file contains a stream and data to illustrate this concept.

    This method will be faster to process, especially against an indexed data warehouse.

    Cheers

    Tim

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed