For those who are interested in Approximate String Matching or those who could use these algorithms; I have a complete suite of Approximate String Matching algorithms written in Visual Basic in an Access database.
In 2004 I decided to jump into the world of Fuzzy Matching with both feet.
As it is, I am working for a company that deals with names, addresses, etc. very intensely. It is a fair sized company that
uses Access on a grand scale. Since I am an Access programmer, I work in an Access gold mine!
I knew that if I could get a good handle on Fuzzy Matching, that when I hit the right person at the right time, the company could greatly benefit from my research on Fuzzy Matching. The right time and the right person are not here yet.
Nevertheless, since I have reaped much free source code and information from the Web, it is now time to return the favor.
I developed a package that is sort of a demo/tutorial on Approximate String Matching algorithms in Access that is very
robust in Fuzzy Matching. It would overtax the post in this forum for me to include it in a post.
To summarize, it works with the basic name - Last, First, and Middle. It has a user interface that allows a user to type in
what would be a good name and what would be a questionable name to resemble the good name. The weighted results of all the various algorithms can be chosen, or an individual algorithm can be chosen to display how closely the names match.
In addition, it has a table of 17,295 known good names with unique ID numbers as a reference table, and table of 1200
morphed names that are typical of names entered in a database with no input conventions. These morphed names have typos, transpositions, variations on maiden names, etc. 1200 good names were selected for alteration and the unique ID of each original good name was stored in the table with the altered names to determine the accuracy of the matching process.
The morphed names were compared to the known good names in a query with an approximate join using the suite of algorithms to determine match percentage. The altered names, the ID number of the original good name, the ID number of the name it matched to, and the match percentage were stored in a results table to determine the results of the matching run.
These tables were used to test and tweak the algorithms by comparing the morphed names with the known good names. The results of 1322 names were saved to a results table with match scores.
The matching process was executed in a query with an approximate join using the suite of algorithms.
The match results:
Total Approximate Matches: 1188
(Recall) Precision Pct: 99.00%
Total Unmatched Names: 12
Unmatched Pct: 1.00%
Total Other Matches: 134
Other Matches Pct: .77%
The tables are accessible in the database, so anyone can run their own tests. The interface is set up to accommodate this
The algorithms used: Dice coefficient as a threshold algorithm, Levenshtein Distance algorithm, Longest Common Subsequence, and the DoubleMetaphone. The names were passed to the algorithms by way of the bigram model.
I will email it to anyone who requests it.
It is in two platforms, Office 97 and Office 2000 as FuzzyMatching97.zip (692 KB) and FuzzyMatching2k.zip (721 KB).
The zip files include ApprxStrMatchingEngine97.pps or ApprxStrMatchingEngine2k.pps respectively, StrMatching97.mde or StrMatching2k.mde respectively, IEEESoundexV5.pdf, and VBAlgorithms.txt.
IEEESoundexV5.pdf is an abstract about Approximate Sting Matching that fired my curiosity about the subject, and pertains to the package.
VBAlgorithms.txt contains the entire suite of algorithms in Visual Basic extracted from the MDB modules.
The PowerPoint presentations describe the workings of the MDE and give a good overview of Fuzzy Matching.
Post A Reply