Hello All,
For those who are interested in Approximate String Matching or those who could use these algorithms; I have a complete suite of Approximate String Matching algorithms written in Visual Basic in an Access database.
In 2004 I decided to jump into the world of Fuzzy Matching with both feet.
As it is, I am working for a company that deals with names, addresses, etc. very intensely. It is a fair sized company that
uses Access on a grand scale. Since I am an Access programmer, I work in an Access gold mine!
I knew that if I could get a good handle on Fuzzy Matching, that when I hit the right person at the right time, the company could greatly benefit from my research on Fuzzy Matching. The right time and the right person are not here yet.
Nevertheless, since I have reaped much free source code and information from the Web, it is now time to return the favor.
I developed a package that is sort of a demo/tutorial on Approximate String Matching algorithms in Access that is very
robust in Fuzzy Matching. It would overtax the post in this forum for me to include it in a post.
To summarize, it works with the basic name - Last, First, and Middle. It has a user interface that allows a user to type in
what would be a good name and what would be a questionable name to resemble the good name. The weighted results of all the various algorithms can be chosen, or an individual algorithm can be chosen to display how closely the names match.
In addition, it has a table of 17,295 known good names with unique ID numbers as a reference table, and table of 1200
morphed names that are typical of names entered in a database with no input conventions. These morphed names have typos, transpositions, variations on maiden names, etc. 1200 good names were selected for alteration and the unique ID of each original good name was stored in the table with the altered names to determine the accuracy of the matching process.
The morphed names were compared to the known good names in a query with an approximate join using the suite of algorithms to determine match percentage. The altered names, the ID number of the original good name, the ID number of the name it matched to, and the match percentage were stored in a results table to determine the results of the matching run.
These tables were used to test and tweak the algorithms by comparing the morphed names with the known good names. The results of 1322 names were saved to a results table with match scores.
The matching process was executed in a query with an approximate join using the suite of algorithms.
The match results:
Total Approximate Matches: 1188
(Recall) Precision Pct: 99.00%
Total Unmatched Names: 12
Unmatched Pct: 1.00%
Total Other Matches: 134
Other Matches Pct: .77%
The tables are accessible in the database, so anyone can run their own tests. The interface is set up to accommodate this
as well.
The algorithms used: Dice coefficient as a threshold algorithm, Levenshtein Distance algorithm, Longest Common Subsequence, and the DoubleMetaphone. The names were passed to the algorithms by way of the bigram model.
I will email it to anyone who requests it.
It is in two platforms, Office 97 and Office 2000 as FuzzyMatching97.zip (692 KB) and FuzzyMatching2k.zip (721 KB).
The zip files include ApprxStrMatchingEngine97.pps or ApprxStrMatchingEngine2k.pps respectively, StrMatching97.mde or StrMatching2k.mde respectively, IEEESoundexV5.pdf, and VBAlgorithms.txt.
IEEESoundexV5.pdf is an abstract about Approximate Sting Matching that fired my curiosity about the subject, and pertains to the package.
VBAlgorithms.txt contains the entire suite of algorithms in Visual Basic extracted from the MDB modules.
The PowerPoint presentations describe the workings of the MDE and give a good overview of Fuzzy Matching.
Replies (33)
You did an excellent job with the powerpoint slides and the zipped files. There are two zippes files FUZZYM~1.ZIP (963KB) and FUZZYM~1_1.ZIP (723KB)?
What are the differences between the 2 files?
Is it possible for you to put the same information in the PowerPoint slides in a text file, so that I can post the text file as an article describing the release?
There were no clear copyright restrictions (Open Source Code, no copyright, etc.) If you can include a copyright statement, it would help
Thanks for your excellent job.
I apologizefor the confusion. You can download the latest source code from http://www.kdkeys.net/forums/thread/6450.aspx
Thanks
Please download the Fuzzy Matching files from http://www.kdkeys.net/forums/thread/6450.aspx
The files are present at this location.
Thanks
Dear sir/madam,
I can not download the file that you have specify overhere : http://www.kdkeys.net/forums/thread/6450.aspx It said, the file is corrupted.
Could you please guide me to the right one?
Thanks in advance,
Hello M20070421,
Sorry, I haven't been out here for a while.
Run W. C. F/FTY BLK 1E Tx and WANG CHEONG F/FTY BLK 1C and W. C. F/FTY BLK 1E Tx and WANG CHEONG F/FTY BLK 1E through each individual algorithm in the Algorithm Suite form in the demo database and see which algorithm(s) give(s) the highest scores compared to each match.
I was curious so I tried it.
Levenshtein Edit Distance is the culprit.
W. C. F/FTY BLK 1E Tx
To WANG CHEONG F/FTY BLK 1C gets a score of 12
To WANG CHEONG F/FTY BLK 1E gets a score of 11
I will need to run the two comparisons in the debugger through the algorithm to see at what stage in the algorithm this abberration is occurring. Perhaps we can further fine tune this algorithm.
On another point, it is best to prepare most datasets before matching.
The two items above are a perfect example.
By scanning both datasets visually and looking for things like W. C. and Wang Cheong, it is best to create a translation table for acronyms/abbreviations.
Then an update query could be run on both datasets to replace abbreviations/acronyms with the translation.
For example if W. C. F/FTY BLK 1E Tx were changed to WANG CHEONG F/FTY BLK 1E Tx any matches would have much higher scores and there would be many fewer questionable matches to proof.
So in the translation table:
Abbrev Xlation
W.C. WANG CHEONG
etc. etcetera
You would run each dataset with the translation table in an approximate join with the criteria for the field to be cleaned as Like "* " & translationtbl.Abbrev or Like translationtbl.Abbrev & " *".
The Update To would be replace([Data Field], translationtbl.Abbrev, translationtbl.Xlation)
The assumption here is that you have a replace function that would replace in the first parameter(string), search for the second parameter, and replace with the third parameter returning the entire string intact with the replacement.
Thanks,
OpnSeason
Hi OpnSeason,
I have recently downloaded your Approx Matching tool and is planning to use it to solve a matching problem. I was wondering if you would allow user to customize your source code? If so, would you mind sent to me the VBA source code and / or the MDB file, since the one I have downloaded is locked, I cannot access both the VBA modules and the Form designs. Your help is much appreciated!
Many thanks
Regards
Casey
Dear OpnSeason,
Have been read and study you great job for the last couple of weeks. Was trying to figure out the issimilar code and the scoring mechanism but fail. really appreciate if you can send the Issimilar() source code to me in order to work on my project. please send it to wjs121@yahoo.com. Many Many Thanks.
JaZ
Hi Admin,
There is somteting wrong:
http://www.kdkeys.net/forums/6450/PostAttachment.aspx
will not give the attachement.
Could you look inot it?
Thanx!
Rutger Krijnen
I would love a copy of the original MDB files so I could improve duplicate detection in some of my databases.
Mark Andrews
mandrews___NOSPAM___@rptsoftware.com
take the ___NOSPAM___
out of the email address
Hi:
Your source code and what your experience would be valuable to many people.
If you need a blog on the site to talk about or promote your fuzzy matching source code, let me know and I will create one for you.
In addition, if you want to makeyour fuzzy matchingsource codedownloadable from this web site, let me know also and we will do so.
Good Job!
Hello Admin,
If you don't mind, I can provide the two zip files I mentioned in my original post. The source code that I provide is the text file with the algorithms. The procedures that I developed to reap the best from the algorithms is the only source not included.
I procured the algorithms from the Web. Nearly all were either not complete, bug ridden, missing sub procedures, and in need of tweaking. It took intense research and elbow grease to get the algorithms to the superb state that they are presently in. Not only that, I found them to be the best flavor and combination of algorithms for the job. These algorithms come in many versions and flavors.
Thanks,
Hello Admin,
I don't know how the zip file names got truncated, but the large sized file is for Office 2k and the other one is for Office 97.
As far as the source code goes, I originally retrieved all from the Web. As far as copyright restrictions, there were none. Any acknowledgements or whatever, I left them intact as they were when I retrieved them. I could find no copyritestatements for any of the algorithms.
Yes, I will extract the PowerPoint presentation into a text file with some other comments about the process as well. I will email it to you after I get it together.
Thanks again,
OpnSeason
Hello,
I cannot find the link to download the source code..!
I dont know what the problem is.. since i am a newbie here..!
Anyway i would appriciate it very much if someone could
provide me with that package ( fotis_routsis NOMORESPAM yahoo.gr ).
I have an school assignment in string similarity algorithms and i believe
it would be a great help..
Thank you P@radox@aueb
Here is how to use the database for matching lists of names you might have:
Purge the FullNameRef table and purge the FullNameTest table. Make copies of them 1st if you want.
If you have names that are in FIRST, MIDDLE, and LAST fields, then you can load each table with your 2 sets of names accordingly.
If you have data where all is in one field, then load the LAST field only in each table with that data - or if you have say PREFIX, FIRST, MIDDLE, LAST, and SUFFIX, then concatenate them into the LAST field.
The function NameSimilarity() in the 'FullName - ApprxMatch' query will check FIRST and MIDDLE to see if they are blank, if they are, it ignores them.
Then make a copy of the 'FullName - ApprxMatch' query.
Edit the query and delete the two columns that do the RefNum updates to each table. They were needed for testing only - for my purposes to determine the accuracy of the match job, and to fine-tune the algorithms.
The FullNameResults table was only appended with the names from the FullNameTest table that had a good match. For your purposes you will want the names/data from the FullNameRef table as well so that you can visually check to see what kind of matches you got. So that means that you will need to modify the FullNameResults table and add fields to accept the other data.
Then you can modify the 'FullName - ApprxMatch' query to append to the new fields as well.
When you are all set up, run the query and the results of the match will be in the FullNameResults table. There you can see the percentages of the matches for each data set.
OpnSeason[H]
Dear OpnSeason,
Thank you so much your great contribution. I have tried the IsSimilar() function on my 288 string data and 99% of them did have good matching. However, 3 data seems to get abnormal result:
| Test Sample | Ref String | Test String | Score = IsSimilar(RefString,TestString) |
| 1 | W. C. F/FTY BLK 1E Tx | WANG CHEONG F/FTY BLK 1C | 0.73455286 |
| 2 | W. C. F/FTY BLK 1E Tx | WANG CHEONG F/FTY BLK 1E | 0.731951237 |
| 3 | CHUNG HING #2 Tx D2 | NAN YANG A | 0.51954025 |
Compare sample 1 and sample 2, I don't understand why sample 1 get a little bit higher similarity score even though sample 2's Test String should better match the Ref Sring.
In sample 3, the score seems to be quite high even though the test string appears completely different from the Ref string. Is this behaviour normal ?
HI Folks,
I m looking for the IsSimilar() function code, so that I cant use them in the application I m developing.
Would appreciate if some one could mail me on ravindra.reddy@penske.com
Thanks in Advance,
Regards,
New bie to this forum.
Hi OpnSeason,
Many thanks for your wonderful contribution.
I would like to know where could I download the source for the IsSimiliar algorithm or if there is a
fee, I wouldn't mind paying it as I need them for a research project to find similiarity between words.
Thanks!
Cheers,
Xorion
I want to integrate some of the functions you wrote in my applications and in queries to locate duplicate names and addresses on mailing lists. I downloaded your source code but the modules were locked down. Is there any chance of getting a version that does not have this locked down so I can modify as needed on my end?
If so, please email me at jamiesroyce@hotmail.com
Thanks,
Jamie
Hi Open
Fantastic article, and I really need your help.
I am working on a project at college to do with Social Studies and I am looking at some crime data. I think that your functions are fantastic but I could really do with the IsSimilar function and Namesimilarity functions as they will help be work out a pattern I've spotted which might help me work out the likely hood of people starting to take drugs.
I can't afford much, but I'd be happy to pay a fee to get the functions. Any chance you could contact me via email at
english_twit at hotmail dot com
to discuss.
Yours hopefully,
ET
Hello OpnSeason,
I have read you article and downloaded the project/documentation. Your work is very impressive. Would you mind sending me the MDB file of your project. I am planning on modifying and converting it to a C# web service. My email is wwenzl@yahoo.com.
Thanks,
Walter
Hello,
Sorry but, where can I get it??
Could you send me a this program??
my e-mail : rararara00@gmail.com
I really need.....sorry..
Ask A Data Miner - 75,000+ Members
Follow On Twitter
Request More Information
Hi:
I don't mind at all. You are providing a service to the community. Would you be able to write up some documentation describing the uses, benefits, algorithms, computations and anything else that would help people appreciate the work that you did? We would like to make it an announcement and feature it at the data mining source code download at http://www.kdkeys.net/forums/60/ShowForum.aspx
Thanks