Fuzzy Matching in Access

starstarstarstarstarstarstarstarstarstar Rating: 0/5 (0 vote cast) print

Hello All,


For those who are interested in Approximate String Matching or those who could use these algorithms; I have a complete suite of Approximate String Matching algorithms written in Visual Basic in an Access database.


In 2004 I decided to jump into the world of Fuzzy Matching with both feet.


As it is, I am working for a company that deals with names, addresses, etc. very intensely. It is a fair sized company that


uses Access on a grand scale. Since I am an Access programmer, I work in an Access gold mine!


I knew that if I could get a good handle on Fuzzy Matching, that when I hit the right person at the right time, the company could greatly benefit from my research on Fuzzy Matching. The right time and the right person are not here yet.


Nevertheless, since I have reaped much free source code and information from the Web, it is now time to return the favor.


I developed a package that is sort of a demo/tutorial on Approximate String Matching algorithms in Access that is very
robust in Fuzzy Matching. It would overtax the post in this forum for me to include it in a post.


To summarize, it works with the basic name - Last, First, and Middle. It has a user interface that allows a user to type in
what would be a good name and what would be a questionable name to resemble the good name. The weighted results of all the various algorithms can be chosen, or an individual algorithm can be chosen to display how closely the names match.


In addition, it has a table of 17,295 known good names with unique ID numbers as a reference table, and table of 1200
morphed names that are typical of names entered in a database with no input conventions. These morphed names have typos, transpositions, variations on maiden names, etc. 1200 good names were selected for alteration and the unique ID of each original good name was stored in the table with the altered names to determine the accuracy of the matching process.


The morphed names were compared to the known good names in a query with an approximate join using the suite of algorithms to determine match percentage. The altered names, the ID number of the original good name, the ID number of the name it matched to, and the match percentage were stored in a results table to determine the results of the matching run.


These tables were used to test and tweak the algorithms by comparing the morphed names with the known good names. The results of 1322 names were saved to a results table with match scores.


The matching process was executed in a query with an approximate join using the suite of algorithms.



The match results:


Total Approximate Matches: 1188    
(Recall) Precision Pct: 99.00%


Total Unmatched Names:       12           
Unmatched Pct:    1.00%


Total Other Matches:        134      
Other Matches Pct:      .77%



The tables are accessible in the database, so anyone can run their own tests. The interface is set up to accommodate this
as well.


The algorithms used: Dice coefficient as a threshold algorithm, Levenshtein Distance algorithm, Longest Common Subsequence, and the DoubleMetaphone. The names were passed to the algorithms by way of the bigram model.


I will email it to anyone who requests it.


It is in two platforms, Office 97 and Office 2000 as FuzzyMatching97.zip (692 KB) and FuzzyMatching2k.zip (721 KB).
The zip files include ApprxStrMatchingEngine97.pps or ApprxStrMatchingEngine2k.pps respectively, StrMatching97.mde or StrMatching2k.mde respectively, IEEESoundexV5.pdf, and VBAlgorithms.txt.


IEEESoundexV5.pdf is an abstract about Approximate Sting Matching that fired my curiosity about the subject, and pertains to the package.


VBAlgorithms.txt contains the entire suite of algorithms in Visual Basic extracted from the MDB modules.


The PowerPoint presentations describe the workings of the MDE and give a good overview of Fuzzy Matching.

 : OpnSeason     Reply  

Replies (35)

Hi:

I don't mind at all. You are providing a service to the community. Would you be able to write up some documentation describing the uses, benefits, algorithms, computations and anything else that would help people appreciate the work that you did? We would like to make it an announcement and feature it at the data mining source code download at http://www.kdkeys.net/forums/60/ShowForum.aspx

Thanks

: Dr. Who    Reply

You did an excellent job with the powerpoint slides and the zipped files. There are two zippes files FUZZYM~1.ZIP (963KB) and FUZZYM~1_1.ZIP (723KB)?

What are the differences between the 2 files?

Is it possible for you to put the same information in the PowerPoint slides in a text file, so that I can post the text file as an article describing the release?

There were no clear copyright restrictions (Open Source Code, no copyright, etc.) If you can include a copyright statement, it would help

Thanks for your excellent job.

: Dr. Who    Reply

Hi Admin,

I am a newbie to this site and would like to download the source code. Unfortunately no matter how hard I look I can seem to navigate my way to a link that allows this. Is the code no longer available?

Thanks in advance

ColinMaclean


profile
KINGSLEY TAGBO

I apologizefor the confusion. You can download the latest source code from http://www.kdkeys.net/forums/thread/6450.aspx

Thanks


profile
KINGSLEY TAGBO

Please download the Fuzzy Matching files from http://www.kdkeys.net/forums/thread/6450.aspx

The files are present at this location.

Thanks


Dear sir/madam,

 
I can not download the file that you have specify overhere : http://www.kdkeys.net/forums/thread/6450.aspx  It said, the file is corrupted.

Could you please guide me to the right one? 

Thanks  in advance,

 

 

 

: veasnamuch    Reply

Hello veasnamuch,

I just tried it with no problem.

Try it again and if it still is a problem, my guess is that the problem is on your end.

Thanks,

: OpnSeason    Reply

Hello M20070421,

Sorry, I haven't been out here for a while.

Run W. C. F/FTY BLK 1E Tx and WANG CHEONG F/FTY BLK 1C and W. C. F/FTY BLK 1E Tx and WANG CHEONG F/FTY BLK 1E through each individual algorithm in the Algorithm Suite form in the demo database and see which algorithm(s) give(s) the highest scores compared to each match.

I was curious so I tried it.

 

Levenshtein Edit Distance is the culprit.

W. C. F/FTY BLK 1E Tx

To WANG CHEONG F/FTY BLK 1C gets a score of 12

To WANG CHEONG F/FTY BLK 1E gets a score of 11

I will need to run the two comparisons in the debugger through the algorithm to see at what stage in the algorithm this abberration is occurring. Perhaps we can further fine tune this algorithm.

On another point, it is best to prepare most datasets before matching.

The two items above are a perfect example.

By scanning both datasets visually and looking for things like W. C. and Wang Cheong, it is best to create a translation table for acronyms/abbreviations.

Then an update query could be run on both datasets to replace abbreviations/acronyms with the translation.

For example if W. C. F/FTY BLK 1E Tx were changed to WANG CHEONG F/FTY BLK 1E Tx any matches would have much higher scores and there would be many fewer questionable matches to proof.

So in the translation table:

Abbrev   Xlation

W.C.     WANG CHEONG

etc.       etcetera

You would run each dataset with the translation table in an approximate join with the criteria for the field to be cleaned as Like "* " & translationtbl.Abbrev or Like translationtbl.Abbrev & " *".

The Update To would be replace([Data Field], translationtbl.Abbrev, translationtbl.Xlation)

The assumption here is that you have a replace function that would replace in the first parameter(string), search for the second parameter, and replace with the third parameter returning the entire string intact with the replacement.

Thanks,

OpnSeason

 

: OpnSeason    Reply

Hi OpnSeason,

I have recently downloaded your Approx Matching tool and is planning to use it to solve a matching problem.  I was wondering if you would allow user to customize your source code?  If so, would you mind sent to me the VBA source code and / or the MDB file, since the one I have downloaded is locked, I cannot access both the VBA modules and the Form designs.  Your help is much appreciated!

Many thanks

Regards

Casey 

: casey wong    Reply

Hi,

 Sorry, I forgot to attach my email addr:  ksh_wong at msn dot com, pls if anyone have the VBA source codes and / or the original MDB would you mind send it to the above address?  Any help is much appreciated.

 Thanks

Casey 

: casey wong    Reply

Dear OpnSeason,

 Have been read and study you great job for the last couple of weeks. Was trying to figure out the issimilar code and the scoring mechanism but fail. really appreciate if you can send the Issimilar() source code to me in order to work on my project. please send it to wjs121@yahoo.com. Many Many Thanks.

 

JaZ

: jackass621    Reply

Hi Admin,

 

There is somteting wrong:

http://www.kdkeys.net/forums/6450/PostAttachment.aspx

will not give the attachement.

 

Could you look inot it?

 

Thanx!

 

Rutger Krijnen


Very cool process, would love to see what you are up to lately, have you had a chance to add the porter stemmer to the code??

Thanks!


How long do you think it took you to run the Fullname-aprxmatch query? I've tried running your queries without making any changes and it appears like it is locking up.

 

pedoms

: pedoms    Reply

I would love a copy of the original MDB files so I could improve duplicate detection in some of my databases.

Mark Andrews
mandrews___NOSPAM___@rptsoftware.com
take the ___NOSPAM___ 

out of the email address

 

 


Hi:

Your source code and what your experience would be valuable to many people.

If you need a blog on the site to talk about or promote your fuzzy matching source code, let me know and I will create one for you.

In addition, if you want to makeyour fuzzy matchingsource codedownloadable from this web site, let me know also and we will do so.

Good Job!

: Dr. Who    Reply

Hello Admin,

If you don't mind, I can provide the two zip files I mentioned in my original post. The source code that I provide is the text file with the algorithms. The procedures that I developed to reap the best from the algorithms is the only source not included.

I procured the algorithms from the Web. Nearly all were either not complete, bug ridden, missing sub procedures, and in need of tweaking. It took intense research and elbow grease to get the algorithms to the superb state that they are presently in. Not only that, I found them to be the best flavor and combination of algorithms for the job. These algorithms come in many versions and flavors.

Thanks,

: OpnSeason    Reply

Hello Admin,

I don't know how the zip file names got truncated, but the large sized file is for Office 2k and the other one is for Office 97.

As far as the source code goes, I originally retrieved all from the Web. As far as copyright restrictions, there were none. Any acknowledgements or whatever, I left them intact as they were when I retrieved them. I could find no copyritestatements for any of the algorithms.

Yes, I will extract the PowerPoint presentation into a text file with some other comments about the process as well. I will email it to you after I get it together.

Thanks again,

OpnSeason

: OpnSeason    Reply

Please email the contents to simiking AT gmail dot com

We will review them and get back to you.

Thanks

: Dr. Who    Reply

Hello,

I would appreciate a copy of your library; if you could send it to dliuzzi [at] gmail.com. Thank you.

Daniel

: Dani    Reply

Hello,

I cannot find the link to download the source code..!

I dont know what the problem is.. since i am a newbie here..!

Anyway i would appriciate it very much if someone could

provide me with that package ( fotis_routsis NOMORESPAM yahoo.gr ).

I have an school assignment in string similarity algorithms and i believe

it would be a great help..

Thank you P@radox@aueb


Here is how to use the database for matching lists of names you might have:

Purge the FullNameRef table and purge the FullNameTest table. Make copies of them 1st if you want.

If you have names that are in FIRST, MIDDLE, and LAST fields, then you can load each table with your 2 sets of names accordingly.

If you have data where all is in one field, then load the LAST field only in each table with that data - or if you have say PREFIX, FIRST, MIDDLE, LAST, and SUFFIX, then concatenate them into the LAST field.

The function NameSimilarity() in the 'FullName - ApprxMatch' query will check FIRST and MIDDLE to see if they are blank, if they are, it ignores them.

Then make a copy of the 'FullName - ApprxMatch' query.

Edit the query and delete the two columns that do the RefNum updates to each table. They were needed for testing only - for my purposes to determine the accuracy of the match job, and to fine-tune the algorithms.

The FullNameResults table was only appended with the names from the FullNameTest table that had a good match. For your purposes you will want the names/data from the FullNameRef table as well so that you can visually check to see what kind of matches you got. So that means that you will need to modify the FullNameResults table and add fields to accept the other data.

Then you can modify the 'FullName - ApprxMatch' query to append to the new fields as well.

When you are all set up, run the query and the results of the match will be in the FullNameResults table. There you can see the percentages of the matches for each data set.

OpnSeason[H]

: OpnSeason    Reply

Dear OpnSeason,

Thank you so much your great contribution. I have tried the IsSimilar() function on my 288 string data and 99% of them did have good matching. However, 3 data seems to get abnormal result:

Test SampleRef StringTest StringScore = IsSimilar(RefString,TestString)
1W. C. F/FTY BLK 1E TxWANG CHEONG F/FTY BLK 1C  0.73455286
2W. C. F/FTY BLK 1E TxWANG CHEONG F/FTY BLK 1E  0.731951237
3CHUNG HING #2 Tx D2NAN YANG A                0.51954025

Compare sample 1 and sample 2, I don't understand why sample 1 get a little bit higher similarity score even though sample 2's Test String should better match the Ref Sring.

In sample 3, the score seems to be quite high even though the test string appears completely different from the Ref string. Is this behaviour normal ?

: M20070421    Reply

HI Folks,

I m looking for the IsSimilar() function code, so that I cant use them in the application I m developing.

Would appreciate if some one could mail me on ravindra.reddy@penske.com

Thanks in Advance,

Regards,

New bie to this forum. 

 

 


can any one send me the source code of fuzzy matching

my email address is anujgupta131@gmail.com

 

i am in great need of the code

 

regards

 

anuj 


Hi OpnSeason,
Many thanks for your wonderful contribution.
I would like to know where could I download the source for the IsSimiliar algorithm or if there is a
fee, I wouldn't mind paying it as I need them for a research project to find similiarity between words.
Thanks!
Cheers,
Xorion
 

: xorion    Reply

I want to integrate some of the functions you wrote in my applications and in queries to locate duplicate names and addresses on mailing lists.  I downloaded your source code but the modules were locked down.  Is there any chance of getting a version that does not have this locked down so I can modify as needed on my end? 

If so, please email me at jamiesroyce@hotmail.com

Thanks,

Jamie 

 

: jroyce    Reply

Hi Open

Fantastic article, and I really need your help.

I am working on a project at college to do with Social Studies and I am looking at some crime data.  I think that your functions are fantastic but I could really do with the IsSimilar function and Namesimilarity functions as they will help be work out a pattern I've spotted which might help me work out the likely hood of people starting to take drugs.

I can't afford much, but I'd be happy to pay a fee to get the functions.  Any chance you could contact me via email at

english_twit at hotmail dot com

to discuss.

Yours hopefully,

ET


Hello OpnSeason,

I have read you article and downloaded the project/documentation.  Your work is very impressive. Would you mind sending me the MDB file of your project.  I am planning on modifying and converting it to a C# web service.  My email is wwenzl@yahoo.com.

Thanks,

Walter

: wwenzl    Reply

Hello,

Sorry but, where can  I get it??

 Could you send me a this program??

my e-mail : rararara00@gmail.com

 I really need.....sorry..

 

: rarara00    Reply

I'd like to get a hold of the MDB/source code for the NameSimilarity function.

 Please send to ejlegarda@hotmail.com

thanks in advance! 

: ejlegarda    Reply

I am very interested in the Access source code for namesimilarity and issimilar functions. Could you send me this source code to sebastiaan@lionconnect.nlThanks!

: sebastiaan    Reply

Please share a copy with me. My email id is jigyasab@gmail.com

: jigyasa    Reply

is the source code still available?

: Steve    Reply

Please send me the code files. Thanks.

: rose    Reply


Post A Reply

 Questions & Answers