Ever been tasked with analyzing data from a table, but you have little-to-no understanding of the "origin" and/or "meaning" of its various columns?
- Perhaps your table has columns like "TypeID", "Code", or some other grouping or hierarchy and you need some quick insight into the cardinality of these groups.
- Perhaps you have a staging table that was imported directly from a CSV or Excel file, and you need to ETL this data into a target table that has columns with more appropriate data types and sizes.
- Perhaps the source file had pivoted, un-normalized data, where a list is stored in columns ending in "_1", "_2", and so on up to 10, 20 or beyond, and you need to quickly understand how many rows actually use more than just one or two.
Often, even if you are fortunate enough to be provided with a data dictionary, it can fall far short of being particularly useful, since it tends to document the structural aspects of each column, but not necessarily the topology/distribution of data within the column.
I consider this to be similar to the distinction between DDL (Data Definition Language) which defines the structure of a database table, and the statistics that a RDBMS maintains to represent the distribution of data within the table. For example, knowing that a million row data set has only 5 distinct "TypeID" values in an integer column can be very useful information in understanding the nature of the data in the table.
Recently, I was involved in a project for a client in the healthcare industry where I was tasked with wrangling data from several (10-20) disparate data sources (typically CSV files). Though not incredibly large, these files were large enough (some > 1GB, with a few million rows) that I felt compelled to load them into a RDBMS (SQL Server in my case) in order to slice and dice the data more effectively.
Almost immediately, I realized that I needed insight into the data - especially the cardinality - so that I could determine appropriate data types and sizes, and also to aid in determining any unique composite keys. The latter could be deduced by identifying the columns with the greatest cardinality (and if the cardinality of a column matches the number of rows in the data set, then that column alone would likely be the a unique key). But I had several tables, some of which had hundreds of columns each. Not the type of analysis that I wanted to do with manual queries against each column, so I wrote a meta-script that greatly reduced the monotony of the task. Here is a cleaned-up version wrapped up into a parameterized stored procedure…
CREATE PROCEDURE pGetColumnStats ( @ObjectName SYSNAME ,@ColName SYSNAME = NULL /* NULL = ALL columns in table */ ,@SampleSize INT = 0 /* 0 = ALL, any +ve value = SUBSET of that size */ ) AS BEGIN SET NOCOUNT ON SET @SampleSize = ISNULL(@SampleSize, 0) DECLARE @ObjectID INT SELECT @ObjectID = o.object_id FROM sys.objects o WHERE o.object_id = OBJECT_ID(@ObjectName) AND o.type IN ('U','V') /* Table, View */ IF (@ObjectID IS NULL) BEGIN RAISERROR('The specified object name [%s] does not exist.', 10, 1, @ObjectName) END DECLARE @ColStats TABLE ( SeqNum INT IDENTITY(1,1) NOT NULL ,ColName SYSNAME NOT NULL ,NumDistinctVals INT NOT NULL ,NumNull INT NOT NULL ,NumEmpty INT NOT NULL ,MinLen INT NULL ,MaxLen INT NULL ,MinValueStr NVARCHAR(MAX) NULL ,MaxValueStr NVARCHAR(MAX) NULL ) DECLARE ColCursor CURSOR FOR SELECT c.name FROM sys.columns c WHERE c.object_id = @ObjectID AND (@ColName IS NULL OR c.name = @ColName) ORDER BY c.column_id OPEN ColCursor DECLARE @ColNameTmp SYSNAME ,@ExecStr VARCHAR(MAX) FETCH NEXT FROM ColCursor INTO @ColNameTmp WHILE (@@FETCH_STATUS = 0) BEGIN SET @ExecStr = ' SELECT ''' + @ColNameTmp + ''' ,COUNT(DISTINCT [' + @ColNameTmp + ']) AS DistinctVals ,SUM(CASE WHEN [' + @ColNameTmp + '] IS NULL THEN 1 ELSE 0 END) AS NumNullVals ,SUM(CASE WHEN CAST([' + @ColNameTmp + '] AS VARCHAR(128)) = '''' THEN 1 ELSE 0 END) AS NumEmptyVals ,MIN(LEN(NULLIF([' + @ColNameTmp + '],''''))) AS MinLen ,MAX(LEN(NULLIF([' + @ColNameTmp + '],''''))) AS MaxLen ,MIN(NULLIF([' + @ColNameTmp + '],'''')) AS MinVal ,MAX(NULLIF([' + @ColNameTmp + '],'''')) AS MaxVal FROM ' + CASE WHEN @SampleSize > 0 THEN '(SELECT TOP ' + CAST(@SampleSize AS VARCHAR(20)) + ' * FROM ' + @ObjectName + ' (nolock)) x' ELSE @ObjectName + ' (nolock) x' END + ' ' INSERT INTO @ColStats ( ColName ,NumDistinctVals ,NumNull ,NumEmpty ,MinLen ,MaxLen ,MinValueStr ,MaxValueStr ) EXEC(@ExecStr) /* Execute dynamic SQL and return row as result set */ FETCH NEXT FROM ColCursor INTO @ColNameTmp END CLOSE ColCursor DEALLOCATE ColCursor SELECT ColName ,NumDistinctVals ,NumNull ,NumEmpty ,MinLen ,MaxLen ,MinValueStr ,MaxValueStr FROM @ColStats ORDER BY SeqNum END
A few quick pointers:
• The @ObjectName parameter is intended (by me) to hold the name of either a table or view, which is why I included the "o.type IN ('U','V')" clause. I am not sure if there are any other object types that can be selected from. If so, have at it.
• Depending on the way the @ColName parameter is set, the stored procedure can be executed to process an ENTIRE TABLE (@ColName omitted or set to NULL), or for just a SINGLE COLUMN OF A TABLE (@ColName set to a specific column name).
• The @SampleSize parameter is something that I incorporated to get a quick-and-dirty (and therefore *incomplete*) sampling of the top N rows (no sort specified) in the table, after waiting far too long for my 4.4+ million row real-world table with over 300 columns to execute. And since this is intended to be as fast as possible, I did not incorporate logic pick random rows, due to the computational expense of doing so. But if you find this script useful, and want to extend it for that purpose, feel free to do so. It is a rather trivial modification.
• The MIN/MAX length logic is based on the length of the values cast as a string. So, for example, my DATE columns all come back as 10 (MM/DD/YYYY). Just be aware that the length is not literal bytes, but rather the number of characters in the string representation (after an IMPLICIT CAST).
• Similarly, the MIN/MAX value logic does an implicit conversion to a string, which might not be suitable in all situations (such as sorting 1, 11, 9, etc). This was done to avoid having empty strings come back as the MIN. But this logic can easily be tweaked to suit your needs.
• The NumDistinctVals does include NULL and empty string as distinct values, so potentially these need to be adjusted for.
Here is a sample execution against a real-world dataset (from the Medicare website) which has over 2.1+ million (2,108,618) rows and 43 columns:
EXEC pGetColumnStats @ObjectName = 'dbo.PhysicianCompare'
And here is the output that it produced:
From this, we can quickly see that:
• There are no null values in any column, BUT there are empty strings in several columns. I will choose to treat these empty strings as NULLs in my target (scrubbed) table.
• No single column is truly a unique key (which was a bit of a surprise to me). Good thing to know from the get-go.
• Very few "Secondary specialty" values are specified (even in #1, which has 86% empty, and #2-4 have even more empty).
• The Credential group has only 22 distinct members (one of which is the empty string), and it is therefore a very low-cardinality group.
• The earliest graduation year is (apparently) 1773. Leads me to believe that we might have some data quality issues upstream.
• There are several Y/N flag columns (which can be converted to an appropriate bit/Boolean column in my target table).
• There is also one M/Y flag column, which seems curious. Knowing that there are only 2 distinct values for this column makes it easy to see that N is not a valid value here.
• There are 56 states represented in the data set, which was a bit of a surprise to me. A quick histogram of row count by state shows that US territories (Guam, Virgin Islands, etc) are included in this data set. That was not obvious to me until I saw this statistic.
In closing, I have found tremendous value in gathering meta-data statistics and sample values from data sets with which I have little-to-no previous knowledge, to aid in understanding the data better. And I have found the use of a meta-script to pull column information (from the system catalog in the RDBMS) given just the table name to be an effective approach to the problem at hand. I hope that you find it to be of some value to you as well.
Darren Gardner
Posted by admin