Algorithm for fuzzy match of login id / username

I just joined and started working on a project that I'm wondering if it's already been done. I have a DB that stores info about users, things like login ID, firstname, last name, employee ID, email, etc... I've been asked to devise and algorithm to do some type of fuzzy match so that whenever we import a new user, we can compare the login id vs the data elements to see if it's the same person. Things like: jdoe has an 80% probability of matching an entry with first name john and last name doe.

So we would have a set of rules and pattern matching based on 5 or 6 data elements.

Does anyone know if this has been done and any references or open source code to help?




 atomz4peace  

Your task is very similar to duplicate detection in adresses. A number of good algorithms exist to do a fuzzy string matching, luckily most of them are very easy to implement:

A few months ago somebody posted the "Fuzzy String Matching Engine", which actually implements most of them in Visual Basic. See 

From my personal experience I recommend dice coefficient (with n = 2, because names are very short). Very easy to implement and works reliable. Hint: If you have a large number of users, try to store the N-Grams in a database table; this is a lot faster than a string comparison for thousands of names.


Hans_Meier

