As a Data Analyst at Decisive Data, I spend a lot of time taking disorganized, poorly maintained, and generally ugly data and straightening out so that our clients can make sense of it. To review, here are some posts worth checking out on the topic of data maintenance:
- Part 1 of this trilogy introduces Fuzzy Matching, a way to assign a “match score” to two strings that represents how closely the strings match. This can be used to either match two lists to each other, or to take one list and consolidate its entries.
- Part 2 builds upon Part 1, using essentially the same Fuzzy Matching method, but introducing an alternate method that can accommodate much more data. While it is significantly faster and more efficient, it is less accurate.
- This post discusses best practices for keeping your data organized and clean so that it will be usable even as your BI evolves.
- Here you can learn how to normalize comma-separated data. This is a SQL trick to split comma-separated values into multiple rows.
- Lastly, you can use the SQL in this post to systematically find Unicode characters. This is important because many applications are not configured to support Unicode characters, and trying to display them can cause bugs or failure.
For this post, I’m going to address my least favorite of all data quality issues: handwritten data. This doesn’t necessarily mean somebody used a pen and paper, but rather they entered a string into a textbox, and there were no restrictions or guidelines for what they should enter. For example, take a look at this healthcare data. I was trying to perform a calculation based on patients’ hemoglobin levels that would require all of them to be decimals, like 5.0 or 9.2. But the field in which users input the hemoglobin levels is a textbox. Here is a small sample of some of the values.
6.3 |
Machine states "low hemoglobin, no results avail" |
> 14% |
HgbA1c is 6.6. This is second one, confirming diabetes DX. |
11 |
Pre-diabetic range |
Hgb A 1 C = 5.6 . / RM |
The best we can hope for in a situation like this is to be able to pull the numbers out of these text strings, convert them to numeric data types, and then perform our calculations. There is nothing we can do about the results that are strictly text, such as “Pre-diabetic range”. I’m going to introduce you to some SQL functions that can clean up these strings.
- SUBSTRING() is the foundation of all string cleanup. It returns a portion of a string using this syntax: SUBSTRING(Original string, Starting position, Length of new string).
- LEN() returns the length of a string. This is especially useful when used in combination with SUBSTRING(). Its syntax is LEN(String).
- LEFT() and RIGHT() are similar to SUBSTRING(), except that instead of indicating a starting position, they either start from the start or the end of the string respectively. Both have the syntax (Original string, Length of new string).
- LTRIM() and RTRIM() remove spaces from the start and end of a string, respectively. For example, LTRIM(‘ Active’) would return ‘Active’.
- CHARINDEX() allows you to determine if a string contains a certain character. If it does, it will return an integer that indicates the position in the string of your desired character. If the character is not present, it returns zero. The syntax is CHARINDEX(Character you’re searching for, String you’re searching in, [Position to start searching from]). The last parameter is optional; the default is 1 (the beginning of the string).
- ISNUMERIC() tells you if a string can be successfully converted into a numeric data type. If so, it returns one, and if not, zero. The syntax is ISNUMERIC(String).
So let’s tackle this healthcare data. Since the hemoglobin readings are, when entered properly, decimals, I figure an approach that can get me pretty far is to search for a decimal point in the string. Once I know where the decimal point is, I can take the characters adjacent to it, and convert them into a decimal number.
SELECT Value FROM dbo.LabResults WHERE CHARINDEX('.', Value, 1) <> 0
That will get me all of the results that have a decimal somewhere inside them:
That’s a good start. Now let’s try using some of these nifty functions:
SELECT Value ,SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4) FROM dbo.LabResults WHERE CHARINDEX('.', Value, 1) <> 0
With the WHERE clause, I filtered to only results that have a decimal somewhere in the string. Then, in the SELECT statement, the first column is the original string value. For the second column, I use SUBSTRING() to return the part of the string starting 2 characters before the decimal and ending one character after.
This is pretty close, but there are still a few problems: Rows 1 and 2 have a leading space before them. I could eliminate this by using LTRIM(). Row 3 worked perfectly. Row 6 doesn’t have a number in it at all, so I will remove it using an ISNUMERIC() function in my WHERE clause. Now, what do we do with Rows 4 and 5? They will require a CASE statement.
SELECT Value ,SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4) ,CASE WHEN ISNUMERIC(LEFT(SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4),1)) = 1 THEN SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4) ELSE SUBSTRING(Value, CHARINDEX('.', Value, 1) - 1, 3) END FROM dbo.LabResults WHERE CHARINDEX('.', Value, 1) <> 0 AND ISNUMERIC( CASE WHEN ISNUMERIC(LEFT(SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4),1)) = 1 THEN SUBSTRING(Value, CHARINDEX('.', Value, 1) - 2, 4) ELSE SUBSTRING(Value, CHARINDEX('.', Value, 1) - 1, 3) END ) = 1
This CASE statement in the third column uses LEFT() to find the very first character in the second column. Then, ISNUMERIC() checks to see if the first character is a number or not. The result of that will tell us whether or not we need 3 characters or 4 characters. This CASE statement corrected Rows 1, 2, 4, and 5 from above. Row 6 was removed by using ISNUMERIC() on the CASE statement.
To wrap up this issue, I also added a UNION statement in which I searched for Values that did not have a decimal but are numeric. This returned all of the values that were integers: “11”, “9”, etc.
So concludes the Cleaning Messy Data blog trilogy. I hope you learned some valuable tricks!
Brian Pohl
Posted by Brian Pohl