Our team at Intrepid Insight is in the middle of several pretty cool projects and leads with clients (stay tuned to hear more soon!), so I thought it was a good time to post some more blog-like content!
This week marks the end of my time as an associate economist at Welch Consulting. Much of my work at the company involved processing and cleaning data for use in statistical analyses. While I was working at Welch, I also worked part-time as a volunteer data analyst for the California Public Policy Lab. These experiences taught me what it really means to “clean data,” and before I get into my five tips, I think it is useful to give my definition for the often thrown around phrase:
Data Cleaning: The process of combining raw records into a format that can be used directly for an analysis.
A few great examples of common data cleaning tasks are:
- Creating numeric dates and times from strings (“July 4, 2018 10:00 AM” to x milliseconds since January 1, 1960 12:00 AM).
- Putting together (merging) data from several sources into one file that uses a common unit of observation (like days, employees, employee-days, etc).
- Assessing missing values in control variables prior to running a regression.
- Identifying gaps in data coverage, whether that is missing individuals or missing time periods or something else.
Surprisingly, data cleaning is quite time intensive and often not something that can be automated. In my experience, the skill set that makes someone good or bad at data cleaning is not exactly the same as the skill set that makes someone a good researcher or a good analyst. There is overlap of course – creativity is helpful in overcoming data cleaning challenges and in developing analytical models. But being a good analyst often requires one to be alert to the big picture, and to be able to make simplifying assumptions when necessary, whereas data cleaning is all about the odd cases – in fact I think I have spent the vast majority of my time dealing with the 0.1% of observations that do not quite look like the other 99.9%.
With all that in mind, here are my top five data cleaning tips that I learned the hard way.
Note that because most of my experience with data cleaning is in STATA, these tips are a little more geared towards STATA.
- Develop a consistent organization scheme for programs and folders: I am not a naturally organized person, and when I first started doing data analysis I was beyond irritated by the suggestions of some of my colleagues to number programs and adopt a shared folder structure. It felt tedious and a waste of time. I begrudgingly began to follow along. One day I came back to a project, and I felt like Gandalf in Moria: I had no memory of this place. I had to back track timestamps to associate work product with programs and search through code to even recall what I was doing. I learned that organizing your folders and programs is like taking notes: its not just that you always have the notes to go back to, taking the notes also improves your memory of the project.
- Tools: None. Just determination and an operating system.
- When in doubt, keep a count: Data cleaning involves a lot of seemingly harmless things. Dropping non-analyzable observations here, incorporating outside information to update a variable there. But at the end of it all, the working data that is produced is often significantly different than the original inputs. With large data sets that require lots of processing, data cleaning programs can take hours or even days to run. And because most data cleaning tasks operate in a sequential fashion (meaning you do task a, then on the resulting data you do task b, etc) the effect of a given change at one point will not always be the same as if that change was performed at a different point. One too many times I found myself scrambling to figure out just how many observations were excluded for a particular reason, or just how many conflicts were resolved in a particular fashion. Keeping track of these counts is particularly difficult to do after the fact from a log if the data was not collapsed to the final unit of analysis during processing. For example, say you are interested in looking at person-weeks, but your data was originally person-days. In STATA or other languages, log files will record counts of dropped observations, but you (or your client) may be interested in the number of dropped relevant units. More on this in the tools section!
- Tools: The biggest and most basic thing to do to keep track of drops and counts is to embed “count” commands in your program. If you are encountering the issue where the current observation unit is not the relevant unit, you can make use of the ssc command “unique” discussed more here. Once it is installed, the user can even reference the number of distinct observations in the return value “r(sum).”
- Original sort order always matters, even when it doesn’t: Okay, admittedly that was hyperbole. There are some situations where the original sort order is mostly irrelevant, but there are a surprising number of situations where it is helpful to keep it around. That is why I always keep it around until it is no longer possible to do so. I have found that more often than not the sort order of the original data contains useful information. Sometimes, when data does not have time stamps, it can reveal a potential order in which the data was collected. Other times it may assist in grouping data that was appended together from many sources prior to its current former. Finally, it provides a defensible, clear way to keep data when you encoutner duplicative data (in terms of the variables you care about) that can not be meaningfully resolved by outside knowledge or other variables. It sounds a lot better to say “When I encountered duplicates, I kept the observation that was recorded first in the original data” than “I randomly kept the observation that happened to be sorted first in the current state of the data.”
- Tools: I make it a habit of storing a variable called ord or raworder or orig_ord that stores the order of the data after conversion into STATA format. I also store the source filename or a source file abbreviation. I also try to avoid the “duplicates drop” command unless the observations are complete duplicates across all variables. Duplicates drop can always be replaced by a well-worded “bysort” command.
- Be wary of old commands, and be cautious of new commands: This is less of a problem in a centrally controlled, closed environment like STATA, but it is still a concern. I have frequently found that old commands depreciate
- Just because an analysis ran, does not mean the data was cleaned correctly: This is one I constantly have to catch myself on. It is a real joy to run a program and watch the lines of code fly beautifully across the screen without any errors, ending with a cleanly closed log. It has a sense of finality, just like hitting the send button on an email with a completed work product. But just because you generated something, and just because that something has reasonable looking numbers, does not mean that the underlying data was cleaned well. Strange values in your end result are a hint that something is probably wrong, but their absence is not a confirmation that everything is right. The old adage that “even a broken clock is right twice a day” rings true here as well: the data analyst could add $10 billion to the highest salary and subtract $10 billion from the lowest salary and a reported average would still be the true average. This is why final results can never be the final validation of initial processes.
- Tools: Embedding “assert” commands throughout a program that verify logic prior to analysis help check that the data was cleaned correctly. They may cause your program to be less flexible, but for targeted projects this is a low cost to pay.
PS – Tip #5 is especially relevant when running logistic, probit or other binary choice models in STATA. These types of regressions will often run and report results even if the data has a number of significant problems, like multicollinearity, perfect prediction, and separation. This is because STATA has a pretty aggressive (when compared to other software) default behavior for handling these problems: namely it will drop variables and observations from the specification prior to fitting the model. Because errors in data cleaning can create sparse categorical variables or continuous variables with clusters of high outliers, which in turn can artificially cause issues like separation, perfect prediction, and multicollinearity, proper data cleaning is extra important, and the fact that the regression ran is very weak evidence of good data cleaning.