With data scientist being hailed as the sexiest job of the 21st century, there has been an influx of “big data” companies, visualization tools, and other products. But unless the input data is cleaned and managed, all these products are fairly useless. As the saying goes: Garbage in, garbage out!
This blog post is about the un-sexy aspects of data science – the practices of cleaning and managing large volumes of data.
When do you need to clean data?
Cleaning is needed when there are possible errors in the data. In a pen and paper world, errors happen at multiple stages during a survey process – while entering the data into the questionnaire, translating from one language to another, and digitizing the data. It takes lot of effort, time, and money to correct the anomalies in the data.
To cite an example of how unclean data might impact final analysis, “China” can be written as “China” and “PRC” (People’s Republic of China) at two different places in the same dataset. Now the entire dataset needs to be standardized so that tabulating it will count them as one country (#duplicates).
Similarly, if data from one source gives population information and another one gives number of deaths due to natural disasters, then the country names need to be standardized (best if coded) in both the data sets before integrating (#merging) them.
Another example would be comparing data for districts between two censuses. A district might be merged or split between two censuses although the district name remains the same. For instance, Tarn Taran district was carved out of Amritsar district in Punjab in 2006. In Census 2011, these are two separate districts, but the data for “Amritsar” district from Census 2001 cannot be compared with that of 2011 since it geographically is not the same unit. In such cases, special care should be taken in cleaning before comparing the data for the district between the two periods.
Traditionally, to get rid of data entry problems, many surveyors also adhere to double data entry methods, where the same data is entered twice by two different teams and then the results are compared to get rid of problem cases. This is obviously a time and cost intensive alternative.
But all these problems get reduced to a large extent by smart data collection tools. At Atlan, our Android-based data collection tool Collect has been built by learning from existing problems arising out of manual data collection process. For example, features such as a smart skipping pattern, multiple visits, and multi-conditional error checking have helped to greatly improve data quality and reduce errors.
How to clean data
Manual cleaning should be the last resort of a data scientist, and it should only be considered when the concerned data sets are small. Moreover, every organization should have its own standardized data cleaning protocol. Organizations can use tools like Excel VBA or Python to clean hundreds of sheets of cluttered but patterned data sets in small amounts of time, which would otherwise take months to clean.
Irrespective of data collection tools (pen and paper or software tools), cleaning the data is always necessary.
1. Check for missing values
The first step is to check for missing values and outliers. There are standard ways of treating missing value problems. For example:
- Replace the missing values by average (mean, median, mode) of same variable for other observations. You can also replace the value by the 3rd quartile or 1st quartile values, depending on the shape of the distribution of data. Read more about how to replace missing values in this NCBI paper.
- Sometimes there can be subjective treatment of missing value problems. For example, missing value for a state or region can be filled by values from the nearest state or regions if both the areas replicate similar patterns in other parameters.
2. Check for outliers
If some values of the variable are far away, in either direction, from overall data distribution, then they are considered outliers. Outliers can be genuine or can be due to errors such as data errors, sampling error, standardization failure, etc.
One of the common practices for getting rid of outlier problems is to do a capping and flooring for the variable. For example, if one is analyzing an opinion poll data for voter preferences aimed at people aged 18 or more and it turns out that there are few observations for age less than 18, then in that case age can be replaced as 18 – hence a floor. Similarly, one can set a cap value after looking at the distribution of the data. Read more in this NESUG paper.
3. Typographical errors
Typographical errors can happen in cases of manual entries. Codes can help get rid of this problem.
For example, if there is a question about marital status of the respondents, typographic errors can lead to multiple representation such as “Single” for “Unmarried” or “Maried” (as a typo for “Married”). Codes, such as 1 for “Married” and 2 for “Unmarried”, can help in standardizing these types of questions.
4. Data referencing
Referencing helps in improving credibility of data. It involves checking the values against some standard reference data.
For example, if the Census says that the average household size in India is around 5, but your survey gives the average household size as 3, then the credibility of your data will be questioned.
How to manage data
Once the data is cleaned, its management comes into picture. Data management helps in organizing the data in a way that improves efficiency in work.
1. Building a great directory structure
This is often the most overlooked aspect of data management. Most folks working with data end up having to learn this the hard way! A good directory structure or file naming convention increases the efficiency by 10 times.
There is no right or wrong way of doing these things, but life will be easy when best practices are followed. For instance, many times people name their file as ‘final’ only to learn that it will go through many more rounds of revision! Instead, having a ‘file version’ is a better practice for naming files.
The directory structure has to be such that both the parent directory and child directory should be self-explanatory and the name should not be too long. Long names may create problems when putting the ‘path’ in a programming environment for working with data.
Similarly, good file names go a long way in identifying the files at a later stage. One of the good practices in file naming is to incorporate version control along with title of the file, date, and author name. For example, the file name “Poverty_Data_V1.1_20Jan2015_Sanjib” says that this file contains data on poverty, created by Sanjib on 20th January, with the version 1.1. Any modification of this file can be renamed as version 1.3 or 2, with a new date.
This process will help with easier referencing in the long run. If we compare this file naming convention with other file names like only “Poverty Data” or “Poverty Data1”, the benefit of the former naming convention is obvious.
2. Creating a data dump
If you are working on spreadsheet-type applications, you should create a data dump before working. If anything goes wrong with the spreadsheet data, the original backup is essential.
3. Fine-tune variable names
Once you take the data dump and set up the file name, the variable names in the file should be fine-tuned.
The best practices in variable naming convention is to keep names short and self-explanatory. Short variable names make typing the names easier, which is helpful in the programming environment for analysis or variable transformation. Transformed variable names can be even longer than the original variables. So unless the original variable is small, the transformed variable name will take an ugly shape.
For example, rural male population percentage of a person belonging to scheduled caste category can have a variable like “pop_sc_r_m_per” or “population_sc_male_rural_percentage”. The value of former variable naming convention will be obvious in the long run.
Ultimately, there is no good way or bad way to name variables but, as the file and data size increases, optimized variable names help in improving efficiency.
4. Create a data dictionary
Once variable names are defined, you need to create metadata (data about your data). Metadata gives a detailed description of variables along with other information about the data set.
For example, a survey data set might have more than 100,000 observations. The metadata can be helpful for documenting the data and its problems. It might include information like missing data for a variable, which would take someone hours to find. If this problem is not documented, then future users do not have to waste hours to find the same problem in the data set.
Thus, we highly recommend creating metadata to help future users save time and increase productivity.
5. Tagging data
Once the data is cleaned and set up in a way for larger use, the data set needs tagging. If each variable can be tagged by every possible dictionary meaning, then a smart search engine will be able to find the right data set for the right variables.