In a perfect world, every database would be perfectly normalized, and nobody would ever manually enter a value into a table. Unfortunately, this is reality, and not everyone is a compulsively organized data analyst like me. The result? Messy data.
Nothing is more frustrating than seeing names that are only slightly off from each other like this:
CompanyID | CompanyName |
1000 | Decisive Data |
1001 | Decisive Data, Inc. |
1002 | Decisive Data Inc. |
A similar problem I often encounter is when you have to match two lists that only have names, and no common IDs or any other field to JOIN them on. For example, a client asks you to match their unique, internally maintained list of accounts to some universal or shared list of accounts. Fortunately, the solutions to both of these issues are almost identical. We can do a “fuzzy match” – the process of using algorithms to determine approximate (hence, fuzzy) similarity between two sets of data.
Performing this fuzzy match requires Master Data Services for SQL Server Management Studio. I’m using SQL Server 2014 Enterprise, but Master Data Services is available for the following versions:
- SQL Server 2014 Business Intelligence 64-bit
- SQL Server 2014 Enterprise 64-bit
- SQL Server 2012 Business Intelligence 64-bit
- SQL Server 2012 Enterprise 64-bit
- SQL Server 2008 R2 Datacenter 64-bit
- SQL Server 2008 R2 Enterprise 64-bit
If you have one of these versions, then you can follow these installation steps (if you’re not using SQL Server 2014, make sure to pick your appropriate version at the top of the page using the dropdown menu that says “Other versions”) to install Master Data Services.
Once you’ve got MDS installed, we can begin. Here is my starting list of messy data:
In the end, I will have a table to which I can JOIN dbo.Company that will allow me to map this messy list to a neat, distinct list. Before we get there, let’s create an intermediate table that we’ll use in the matching process. This table will show the ID and name of the potentially matching records, and since I’m going to make sure I only map companies in the same city, I will only need one city column. Lastly, there needs to be a column to hold the match percentage.
This table will hold all of the potential matches, which will be dumped there as the result of a stored procedure. I’m going to create another table called dbo.Company_FinalMatch, which will store the records that I determine are truly accurate matches. This Company_FinalMatch table is exactly the same as Company_TempMatch except it doesn’t have a blank for MatchScore.
Now comes the fun part! Here is the code for the stored procedure I will use to run the fuzzy match and dump the matches into dbo.Company_TempMatch.
USE [TestDB] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO --Use this code to execute the procedure once you've created it: EXEC [TestDB].[dbo].[FuzzyMatch_Company] @MatchScore = .8 CREATE procedure [dbo].[FuzzyMatch_Company] ( @MatchScore float = .8 --You can change this every time you execute the stored procedure, but here I've set it to default to an 80% match or greater ) AS TRUNCATE TABLE [dbo].[Company_TempMatch] INSERT INTO [dbo].[Company_TempMatch] SELECT ListA.CompanyID AS FirstCompanyID ,ListA.CompanyName AS FirstCompanyName ,ListB.CompanyID AS SecondCompanyID ,ListB.CompanyName AS SecondCompanyName ,ListA.City ,MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 0, 1.0, @MatchScore) as MatchScore FROM ( SELECT [CompanyID] ,[CompanyName] ,[City] FROM [TestDB].[dbo].[Company] ) ListA --ListB, in this example, is the same table as ListA (dbo.Company) because I want to clean up one list. If you are trying to match two different lists of names, --as I mentioned earlier, you would put your second list here in ListB JOIN ( SELECT [CompanyID] ,[CompanyName] ,comp.[City] FROM [TestDB].[dbo].[Company] comp --If you run this procedure more than once, you can use this to rule out records you've already matched LEFT JOIN [dbo].[Company_FinalMatch] fin ON comp.CompanyID = fin.SecondCompanyID WHERE fin.SecondCompanyID IS NULL ) ListB --This is the condition on which the two lists are joined - the match percentage must be above the MatchScore that you set when you call this procedure ON MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 0, 1.0, @MatchScore) >= @MatchScore --And, if the records are to be matched, the two companies must be in the same city AND ListA.City = ListB.City GO
I’ve provided some comments so that you can more easily adapt my code to your tables. As one of the comments mentions, if you are fuzzy matching two different lists, rather than just trying to clean one list as I am, you can change the “ListB” section to pull from your second list.
There is actually a little more modification I still need to make to this code before it is ready to be executed. This function has four different algorithms that it can run to compare two strings, and at the moment, it’s only using one of them. Below, I’m adding several UNIONs to my stored procedure so that it uses all four different algorithms.
USE [TestDB] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO --Use this code to execute the procedure once you've created it: EXEC [TestDB].[dbo].[FuzzyMatch_Company] @MatchScore = .8 ALTER procedure [dbo].[FuzzyMatch_Company] ( @MatchScore float = .8 --You can change this every time you execute the stored procedure, but here I've set it to default to an 80% match or greater ) AS TRUNCATE TABLE [dbo].[Company_TempMatch] INSERT INTO [dbo].[Company_TempMatch] SELECT ListA.CompanyID AS FirstCompanyID ,ListA.CompanyName AS FirstCompanyName ,ListB.CompanyID AS SecondCompanyID ,ListB.CompanyName AS SecondCompanyName ,ListA.City ,MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 0, 1.0, @MatchScore) as MatchScore FROM ( SELECT [CompanyID] ,[CompanyName] ,[City] FROM [TestDB].[dbo].[Company] ) ListA JOIN ( SELECT [CompanyID] ,[CompanyName] ,comp.[City] FROM [TestDB].[dbo].[Company] comp LEFT JOIN [dbo].[Company_FinalMatch] fin ON comp.CompanyID = fin.SecondCompanyID WHERE fin.SecondCompanyID IS NULL ) ListB ON MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 0, 1.0, @MatchScore) >= @MatchScore AND ListA.City = ListB.City UNION --The following are exact copies of the above code, except the matching algorithms are changed in each. If you want to be more concise, you could do this easily with a cursor and dynamic SQL! SELECT ListA.CompanyID AS FirstCompanyID ,ListA.CompanyName AS FirstCompanyName ,ListB.CompanyID AS SecondCompanyID ,ListB.CompanyName AS SecondCompanyName ,ListA.City ,MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 1, 1.0, @MatchScore) as MatchScore FROM ( SELECT [CompanyID] ,[CompanyName] ,[City] FROM [TestDB].[dbo].[Company] ) ListA JOIN ( SELECT [CompanyID] ,[CompanyName] ,comp.[City] FROM [TestDB].[dbo].[Company] comp LEFT JOIN [dbo].[Company_FinalMatch] fin ON comp.CompanyID = fin.SecondCompanyID WHERE fin.SecondCompanyID IS NULL ) ListB ON MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 1, 1.0, @MatchScore) >= @MatchScore AND ListA.City = ListB.City UNION SELECT ListA.CompanyID AS FirstCompanyID ,ListA.CompanyName AS FirstCompanyName ,ListB.CompanyID AS SecondCompanyID ,ListB.CompanyName AS SecondCompanyName ,ListA.City ,MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 2, 1.0, @MatchScore) as MatchScore FROM ( SELECT [CompanyID] ,[CompanyName] ,[City] FROM [TestDB].[dbo].[Company] ) ListA JOIN ( SELECT [CompanyID] ,[CompanyName] ,comp.[City] FROM [TestDB].[dbo].[Company] comp LEFT JOIN [dbo].[Company_FinalMatch] fin ON comp.CompanyID = fin.SecondCompanyID WHERE fin.SecondCompanyID IS NULL ) ListB ON MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 2, 1.0, @MatchScore) >= @MatchScore AND ListA.City = ListB.City UNION SELECT ListA.CompanyID AS FirstCompanyID ,ListA.CompanyName AS FirstCompanyName ,ListB.CompanyID AS SecondCompanyID ,ListB.CompanyName AS SecondCompanyName ,ListA.City ,MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 3, 1.0, @MatchScore) as MatchScore FROM ( SELECT [CompanyID] ,[CompanyName] ,[City] FROM [TestDB].[dbo].[Company] ) ListA JOIN ( SELECT [CompanyID] ,[CompanyName] ,comp.[City] FROM [TestDB].[dbo].[Company] comp LEFT JOIN [dbo].[Company_FinalMatch] fin ON comp.CompanyID = fin.SecondCompanyID WHERE fin.SecondCompanyID IS NULL ) ListB ON MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 3, 1.0, @MatchScore) >= @MatchScore AND ListA.City = ListB.City GO
In each statement I changed, in both the SELECT clause and the JOIN clause, the parameters entered into the Similarity function. Here are the different versions:
MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 0, 1.0, @MatchScore)
MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 1, 1.0, @MatchScore)
MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 2, 1.0, @MatchScore)
MasterDataServices.mdq.Similarity(ListA.CompanyName, ListB.CompanyName, 3, 1.0, @MatchScore)
If you’d like to learn more about the different algorithms and other parameters in the Similarity function, you can read this post.
Now I can finally run my stored procedure! I’ve set @MatchScore to equal 0.9, and I got 40 rows returned in my Company_TempMatch table. Since some of these are duplicated because of the different matching algorithms used, I’m going to change my SELECT statement to be DISTINCT and to not include MatchScore. Now I have 28 rows.
This point in the process is everyone’s least favorite – the manual confirmation of each match. Since I have only 28 rows, it’s very easy for me to confirm these. But usually, there are thousands if not millions of rows, and reading each one is not plausible. The main strategy I use is using WHERE to narrow my search down to records with a MatchScore of 0.98 or higher, briefly scanning those to make sure they’re accurate, and then inserting them into my Company_FinalMatch table. After that, I change my WHERE clause to say WHERE MatchScore >= 0.95 AND MatchScore < 0.98, and then repeat the process. Another strategy I use is to remove the repeated incorrect matches – sometimes you’ll find that one name like “ABC Company” is matching to “AXE Company” and “NBC Company” and “BBC Company”, or things like that. I will add AND SecondCompanyName <> ‘ABC Company’ to my WHERE clause in that case.
Ok, now I’ve added all of my matches into my Company_FinalMatch table. As I said above, this is still 28 rows. If you are like me, and you’re cleaning one list of messy names, then we still have a little more work to do. If you are using this process to map two different name lists to each other, then you’re done! Pat yourself on the back. You’ve earned it.
For the rest of us, at the moment, Company_FinalMatch is a many-to-many mapping. You can see in my screenshot above that there are lots of redundant mappings in this table, but we need a one-to-many mapping if we are going to get any practical use out of this. The below code can do exactly that. The “one” in the one-to-many will be picked somewhat arbitrarily, unfortunately, so if you want to pick a specific name for your name cleanup, it will require some extra work.
SELECT ROW_NUMBER() OVER (ORDER BY FirstID) AS RowNo ,x.* INTO #CFM FROM ( SELECT DISTINCT CASE WHEN [FirstCompanyID] > [SecondCompanyID] THEN [SecondCompanyID] ELSE [FirstCompanyID] END AS FirstID ,CASE WHEN [FirstCompanyID] > [SecondCompanyID] THEN [SecondCompanyName] ELSE [FirstCompanyName] END AS FirstName ,CASE WHEN [FirstCompanyID] > [SecondCompanyID] THEN [FirstCompanyID] ELSE [SecondCompanyID] END AS SecondID ,CASE WHEN [FirstCompanyID] > [SecondCompanyID] THEN [FirstCompanyName] ELSE [SecondCompanyName] END AS SecondName ,[City] FROM [TestDB].[dbo].[Company_FinalMatch] WHERE FirstCompanyID <> SecondCompanyID ) x DECLARE @MaxCounter AS INT ,@Counter AS INT SET @MaxCounter = ( SELECT MAX(RowNo) FROM #CFM ) SELECT FirstID, FirstName, SecondID, SecondName INTO #Final FROM #CFM WHERE RowNo = 1 SET @Counter = 2 WHILE @Counter <= @MaxCounter BEGIN INSERT INTO #Final SELECT FirstID, FirstName, SecondID, SecondName FROM #CFM WHERE RowNo = @Counter AND FirstID NOT IN ( SELECT SecondID FROM #Final ) SET @Counter = @Counter + 1 END DROP TABLE #CFM --Alter this code to insert the results from #Final into a permanent table --SELECT * --INTO --(Your table here) --FROM #Final --DROP TABLE #Final
Note that at the end of this, the final one-to-many mapping is stored in a temp table called #Final. You will have to edit the commented part of my code to insert this data into a permanent table.
With the SQL that I’ve given you in this post, you will have the power to systematically map two lists of names to each other, or to clean up a messy list of names. If you are finding that the FuzzyMatch stored procedure I gave is taking too long, you’ve probably got quite a bit of data. You will want to read Part 2 of my Messy Data trilogy, which will detail how to make this process much faster (though slightly less thorough).
Brian Pohl
Posted by Brian Pohl