The Approximate String Matching Engine
Approximate string matching (fuzzy matching) is the powerful technology of today used to correct the myriad of data problems and errors that exist in databases, communications, and text corpora.
Fuzzy matching is a product of artificial intelligence from the world of computer science.
Presently, large corporations, governments, and sizeable institutions are the ones who can afford to utilize fuzzy matching, since the expense of implementation is quite considerable.
The Approximate String Matching Engine demonstrates that fuzzy matching technology is within the reach of others as well.
Wherever large datasets exist, approximate string matching is needed.
The need for approximate string matching arises from the various problems associated with data and text sets.
Approximate string matching is used in data mining, data cleansing, data research, national intelligence, genetic research, communications, etc.
Fuzzy matching is a programmatic process of determining similarity between two strings, such as names, addresses, drug names, materials (as in engineering), parts descriptions, etc. when there is knowledge or suspicion that there is a difference between the two strings, and that they may need to be merged, updated, purged, or simply identified.
Optimally, exact matching should precede fuzzy matching.
A closer look at data
Some of the anomalies between two strings or bodies of text that call for approximate string matching are:
Typing mistakes, abbreviations, different data entry conventions, truncation, inconsistencies in data formatting, and a number of others specific to the type of data.
Match entries with typing mistakes:
Divesh Srivastava vs. Divesh Shrivastava
Match entries with abbreviations:
Euroaft Corporation vs. Euroaft Corp.
Match entries with different conventions:
Comp. Sci. Dept. vs. Dept. of Comp. Sci.
Match entries with inconsistent formatting:
010104 vs. 01-01-04
Sources of problems:
No uniform standards for content and formats
Parallel data entry (duplicates)
Combining data sets (acquisitions, across departments)
Common source of problems:
Heterogeneous data : no common key, different field formats
Legacy data: IMS, spreadsheets, ad-hoc structures
The MS Access® Database Program
On opening the StrMatching.mde the Algorithm Suite form opens. (screen shot in the PowerPoint presentation)
Names can be entered into the two name lines and the Ok button clicked to get a score from one of the algorithms selected in the Select Algorithm box.
It defaults to All Algorithms. This will return the similarity score of the two names – the percentage of similarity. A score of one is the highest similarity - exact similarity.
The other algorithms return their own scores of similarity.
A brief description of each algorithm and its scoring method is displayed in the large window upon selecting it in the
Select Algorithm box.
When last, first, and middle names are entered into their respective slots, each last, first, and middle are checked individually for similarity to the ones in the other name.
Last, first, and middle names could be entered into the last name slots only, and the entire names will be compared to each other for similarity.
This means addresses, etc. could be entered into the last name slots for similarity comparison.
Even though The Approximate String Matching Engine is not specifically set up to deal with addresses as it is with names, it will still do a reasonable similarity assessment for entries other than names.
The other algorithm that does not have a radio selection button is the Double Metaphone algorithm, which has its own standard button.
It opens another window for the algorithm with the names that are entered in the two name lines of the Algorithm Suite
Or another pair of names can be entered.
Clicking the Ok button will generate the phonetic codes returned by the algorithm.
See screen shot next slide.
The algorithms will be reviewed in more detail in the technical section of this presentation.
The Approximate String Matching Engine is an MS Access database that has a corpus of 17,295 names in a table with unique ID numbers (field name is RefNum). These names are considered known good names.
The table name is FullNameRef.
Another table, FullNameTest, has a corpus of 1200 altered names derived from the FullNameRef table.
Each altered name was copied as a record from the FullNameRef table complete with RefNum (ID number) as OrigNum for tracking purposes.
Each name was changed so that it exhibited some of the many name variations or errors resulting from data entry, import from a foreign source, etc.
Some of the alterations include truncations of names (first or last letter), transposition of letters, wrong or missing letters, transposed first and middle names, dashed last names vs. either name on either side of dash present in other
last or in other middle name, middle initial in one name vs. full middle in other middle, one middle with a name vs. the other middle blank, various misspellings, different naming conventions, etc.
A query was run comparing each name in the FullNameTest (altered names) table to each name in the FullNameRef table submitting both names to the NameSimilarity function that analyzes the names for similarity by using another function called IsSimilar to compare each last, first, and middle name for similarity.
This is a screen shot of the tables in Access 97.
The IsSimilar function passes the pair of names for comparison to a suite of similarity algorithms to determine if the name pair is similar enough to consider that the pair has a good similarity.
If the NameSimilarity function determines that the Ref name and the Test name have a similarity value of greater than .79, then the pair of names are appended to the FullNameResults table with the similarity value and RefNum of the
matched name from the FullNameRef table.
After the run is completed, a query is run on the FullNameResults table to get the number of original RefNums that equal the newly matched RefNums to calculate the Recall or Precision count and percentage.
The number of unmatched is the number of those that are in the FullNameTest table that do not reside in the FullNameResults table.
The number of other matches that could be additional good matches or false positives has matched RefNums that do not equal the original RefNums.
Upon completion of coding the algorithms and the other supporting analysis functions, a run was done of the 1200 altered names in the FullNameTest table against the FullNameRef table with 17,295 names, and the results were:
Total Approximate Matches: 1188
(Recall) Precision Pct: 99.00%
Total Unmatched Names: 12
Unmatched Pct: 1.00%
Total Other Matches: 134
Other Matches Pct: .77%
The Approximate String Matching Engine is a test bed, a demo, and an analysis tool for programmers who need to customize the Approximate String Matching Engine to accommodate various approximate string matching scenarios.
It is a very powerful approximate string matching tool.
The IsSimilar function is a generic approximate matching function that can be used to compare the similarity between any two strings.
That means it could be used for address matching, organizational name matching, genetic sequencing, or any string matching that needs to be met, short of long sentences, paragraphs, and large bodies of text.
For that, there are other algorithms and algorithmic methods suited to that end.
This is a screen shot of the queries in Access 97.
This adaptation of the Approximate String Matching Engine is customized for last, first, and middle name matching in the NameSimilarity function.
The NameSimilarity function can be modified easily to deal with prefixes, suffixes, etc. for more comprehensive name
Most string matching endeavors will need some sort of programmatic customization to accommodate the job at hand.
For example, addresses will need to have a reference table of postal/mail box acronyms, such as, P.O. Box, POB, Mail Stop, M Stop, M. Stop, PMBX, etc., and their translations.
Street abbreviations and translations would need to be in another reference table. Addresses are as daunting as names, and probably more so.
In the query screen shot the FullName – ApprxMatch query is the one that does all the work. It matches the names between the FullNameRef table and the FullNameTest table and appends the matched names into the FullNameResults table.
FullName – ApprxMatch query.
FullName – ApprxMatch SQL:
INSERT INTO FullNameResults ( last_name, first_name, middle_name, RefNum, OrigNum, MatchValue )
SELECT FullNameTest.last_name, FullNameTest.first_name, FullNameTest.middle_name, FullNameRef.RefNum,
FullNameTest.RefNum, NameSimilarity([FullNameRef].[last_name] & "~" & [FullNameRef].[first_name] & "~" &
[FullNameRef].[middle_name],[FullNameTest].[last_name] & "~" & [FullNameTest].[first_name] & "~" &
[FullNameTest].[middle_name]) AS Expr1
FROM FullNameRef, FullNameTest
WHERE (((NameSimilarity([FullNameRef].[last_name] & "~" & [FullNameRef].[first_name] & "~" &
[FullNameRef].[middle_name],[FullNameTest].[last_name] & "~" & [FullNameTest].[first_name] & "~" &
ORDER BY FullNameTest.last_name;
The other queries are supporting queries for compiling the result numbers and are the underlying queries for viewing the results in other forms.
The False Positives query is a select query for viewing other matches that have no OrigNum, since they were not set up in the FullNameTest table.
Some of these other matches will be good matches and others will be false positives.
To get an assessment of good matches vs. false positives a form can be opened to view the other matches by clicking the button on the Run FullName form shown on an upcoming slide.
A checkbox is provided (GoodMatch) for isolating good matches and false positives.
The next interface (screen shot on next slide) is the form for running the fuzzy matching query (FullName – ApprxMatch) between the table of reference names and test names. The Vu buttons open respective forms for viewing the results of the run.
The form is self-explanatory, so there is no need to elaborate.
There is no need to run this set of names because the run has already been done. If you want to time the run on a particular computer, then you may want to run the matching query again. After the run is finished, the form displays the elapsed time.
Naturally, the runtime will vary because it is subject to the ability of the computer being used.
The runtime can be considerable because the query uses the two un-joined tables, FullNameRef and FullNameTest.
On a Hewlett Packard dc7100 with a Pentium(R) 4, 3.20GHz, 504 MB of RAM, the run time was approximately 7 hours.
The cost of a run time of 7 hours is relatively cheap compared to the use of a human to resolve 1200 names to the known good base of 17,295 names. Chances are, the accuracy rate for a human in this case will probably be far below the Precision Percent of 99%. More than likely the Precision Percent of the human would probably be somewhere in the vicinity of 85% to 90%.
Also, the time cost of the human may be days.
In this database, you have the ability to delete the names from the FullNameTest table and create your own set of altered names from the FullNameRef table. Or you can reload the FullNameRef table with another corpus of names for testing.
I found this suite of algorithms to be the best for this application.
The strategy here was to use each algorithm in a certain order, as well as a threshold into the next algorithm.
Additionally, I had to find a way to evaluate the Double Metaphone encoding numerically.
Since the names are in three parts, Last, First, and Middle; the parsing strategy was to evaluate the full names to each other and then to evaluate each Last, First, and Middle to each other individually.
The nuts and bolts:
The Dice Coefficient
The method used to feed the string to this algorithm is the linear q-gram (or n-gram) model.
Q or N would be the length of the gram.
Q-gram: monograms, bigrams, trigrams, etc.
Example bigram for “Nelson”: %N Ne el ls so on n#
The Dice Coefficient is used as the threshold algorithm.
D is Dice Coefficient
SB is Shared Bigrams
TBg1 is total number of bigrams in Qgram1
TBg2 is total number of bigrams in Qgram2
D = (2SB)/(TBg1+TBg2)
Double the number of shared bigrams and divide by total number of bigrams in each string.
A good Dice Coefficient value would be a value greater than 0.2.
If the Dice value is less than 0.2, the comparisons are rejected immediately and not analyzed by the other algorithms.
Levenshtein Edit Distance
The Levenshtein Distance algorithm is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm
Levenshtein edit distance is the number of insertions, deletions, or replacements of single characters that are
required to convert one string to the other.
Character transposition is detected.
Transposition is given a cost of 1.
Transposition detection is taken from:
Berghel, Hal ; Roach, David : "An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm"
A Levenshtein Distance of 2, 3, or more characters is meaningless without evaluating the lengths of the two comparison strings as well.
Longest Common Subsequence
The LCS is calculated by the length of the longest common, not necessarily contiguous, sub-sequence of characters
divided by the average character lengths of both strings.
A good value for LCS is a value greater than or equal to 0.8.
The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms.
He wrote it as a replacement for SOUNDEX.
These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic
matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic
relationship between "Adam" and "Nelson."
Double Metaphone is designed primarily to encode American English names (though it also encodes most English words well) while taking into account the fact that such words can have more than one acceptable pronunciation.
Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most
likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name).
After obtaining these algorithms, I decided which were the best to use. Some came in many different flavors, so I had to decide which flavors to use as well.
I had to translate these algorithms from C, C++, Perl, and Visual Fox Pro. Debugging was the next step.
As it is, each one of these algorithms uses various loops to assess the similarity of each string passed to it.
As it was, each algorithm was using the substring method to parse each string, which of course, is very inefficient for performance.
Therefore, to improve performance I optimized the algorithms to accept the strings in arrays for loop parsing.
Now when the two strings enter the IsSimilar function, each character of each string is placed into a single array element of the respective string array as a one-time substring process.
In the case of bigrams, two characters are placed into one array element of each bigram array for each string using the one time substring process.
So the substring process is used only four times at the start before passing the string arrays to the respective algorithms – instead of possibly hundreds of times as the strings are processed in the algorithms.
One algorithm that is not included in this suite of algorithms and could possibly enhance it is the Porter stemming algorithm.
I am researching it at this time. Once again, there are many different flavors of the algorithm.
There are three files included with this package:
This file - ApprxStrMatchingEngine.ppt
The database - StrMatching.mde
I now call your attention to the IEEESoundexV5.pdf.
The document is an abstract about approximate string matching by David Holmes and M. Catherine McCabe.
It gives a fair overview of the fuzzy string matching process.
The peculiar thing about the document is that it mentions precision and recall frequently, but fails to touch on other possible good matches and false positives – the other side of the fuzzy matching coin.
Beyond in-house use, fuzzy string matching can be used to generate revenue in:
Research as mentioned in the abstract
Developing tailored applications for specific needs
Data standardization, etc.
I invite you to explore The Approximate String Matching Engine.
The PowerPoint presentation in the package is a full introduction into the world of fuzzy matching. Each algorithm is described, and the strategy of fuzzy matching is addressed. It also describes the database program.
PLEASE POST QUESTIONS TO THIS THREAD: http://www.kdkeys.net/approximate-string-matching-engine-fuzzy-matching-algorithm-and-source-code/#link-94#6460
OR CONTACT OPNSEASON DIRECTLY.
Post A Reply