If you’re reading this and you haven’t read Part 1 of the Messy Data trilogy, then please take a few minutes to read Part 1 now. This will make much more sense if you’ve done so.
In Part 1, the method I used compared every single row in ListA to every single row in ListB. It ran the Similarity function twice for each algorithm we used (once to do the JOIN, and once again to show the MatchScore in the SELECT statement). So if you have a list of 100,000 records that you’re trying to clean, you will do 80 billion (100,000 in ListA * 100,000 in ListB * 4 Similarity algorithms * 2 times per comparison) calculations. That takes forever.
Here is the faster approach: we find the first word of every name in the list. Then, we only run the Similarity function on records that having matching first words, instead of comparing every record to every record. Additionally, the default is only to run one Similarity algorithm. However, if that algorithm determines that the two records don’t match, we will continue running the other three algorithms until we either get a match score above zero or run out of algorithms to run. Much better, right?
Let’s look at the Pros and Cons of this method versus the slower method I described in Part 1:
- This is hundreds of times faster than the other approach. What would have taken multiple days before can be done in less than an hour now. That also means less stress on your server.
- If two records don’t have perfectly matching first words, they will never even stand a chance to be compared. In the data set I used in Part 1, I had a record called “Decisive Data” and a record called “Decisve Data”. Because these two have different first words, they could never be matched together.
- If the first algorithm returns a match score above zero, then no more algorithms will be run. But it’s possible that the other algorithms could offer a higher match score. You may end up ruling this match out when you are searching for matches above a certain level.
Fuzzy matching is, by nature, an imperfect process. In my opinion, the disadvantages to this method are well worth the ability to match hundreds of thousands of records to each other. But, if you have the time and the computer power, you can always use the slower method for a more comprehensive mapping.
This approach is almost the same as Part 1, except for three things:
- You will have to find the first word of every name, and create a new table with the first words in a separate column
- When creating your TempMatch table, make four MatchScore columns instead of just one.
- The FuzzyMatch stored procedure is a bit different.
To find the first word of every name, I use a function that I found here. You can use this code to create the function:
CREATE FUNCTION [dbo].[GetFirstWord] (@StringVar VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @rtStringVar VARCHAR(MAX) SELECT @rtStringVar = CASE CHARINDEX(' ', LTRIM(@StringVar), 1) WHEN 0 THEN LTRIM(@StringVar) ELSE SUBSTRING(LTRIM(@StringVar), 1, CHARINDEX(' ',LTRIM(@StringVar), 1) - 1) END RETURN @rtStringVar END GO
Now, you’ll want to create a new table that has your original columns plus a column to store the first word of each name. Here’s how I made mine:
SELECT [CompanyID] ,[CompanyName] ,[City] ,TestDB.dbo.GetFirstWord([CompanyName]) AS CompanyFirstWord INTO [TestDB].[dbo].[Company_FirstWord] FROM [TestDB].[dbo].[Company]
Ok, now you’re ready to create your TempMatch table.
Lastly, here is the code for the modified stored procedure, which I call dbo.FuzzyMatchBigData_Company:
USE [TestDB] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE procedure [dbo].[FuzzyMatchBigData_Company] ( @MatchScore float = .8 ) AS TRUNCATE TABLE [TestDB].[dbo].[Company_TempMatch] INSERT [TestDB].[dbo].[Company_TempMatch] SELECT ListA.[CompanyID] AS [FirstCompanyID] ,ListA.[CompanyName] AS [FirstCompanyName] ,ListB.[CompanyID] AS [SecondCompanyID] ,ListB.[CompanyName] AS [SecondCompanyName] ,ListA.[City] ,CAST(0 AS float) as Matchscore0 ,CAST(0 AS float) as Matchscore1 ,CAST(0 AS float) as Matchscore2 ,CAST(0 AS float) as Matchscore3 FROM ( SELECT [CompanyID] ,[CompanyName] ,[CompanyFirstWord] ,[City] FROM [TestDB].[dbo].[Company_FirstWord] ) ListA JOIN ( SELECT [CompanyID] ,[CompanyName] ,[CompanyFirstWord] ,[City] FROM [TestDB].[dbo].[Company_FirstWord] ) ListB ON ListA.[CompanyFirstWord] = ListB.[CompanyFirstWord] AND ListA.[City] = ListB.[City] UPDATE [TestDB].[dbo].[Company_TempMatch] SET MatchScore0 = MasterDataServices.mdq.Similarity([FirstCompanyName], [SecondCompanyName], 0, 1.0, @MatchScore) UPDATE [TestDB].[dbo].[Company_TempMatch] SET MatchScore1 = MasterDataServices.mdq.Similarity([FirstCompanyName], [SecondCompanyName], 1, 1.0, @MatchScore) WHERE MatchScore0 = 0 UPDATE [TestDB].[dbo].[Company_TempMatch] SET MatchScore2 = MasterDataServices.mdq.Similarity([FirstCompanyName], [SecondCompanyName], 2, 1.0, @MatchScore) WHERE matchscore0 = 0 and MatchScore1 = 0 UPDATE [TestDB].[dbo].[Company_TempMatch] SET MatchScore3 = MasterDataServices.mdq.Similarity([FirstCompanyName], [SecondCompanyName], 3, 1.0, @MatchScore) WHERE matchscore0 = 0 and MatchScore1 = 0 and MatchScore2 = 0 GO
There you have it! Now you can scan through your new TempMatch table using a WHERE clause like WHERE (MatchScore0 > 0.8 OR MatchScore1 > 0.8 OR MatchScore2 > 0.8 OR MatchScore3 > 0.8) and pick out which records will make it to FinalMatch.
Stay tuned for Messy Data Part 3, where I discuss some functions that can help clean up manually entered data.
Posted by Brian Pohl