For data collected through both paper and digital surveys, you should conduct some basic data checks before carrying out thorough data cleaning. Keep reading for 4 basic data checks that you can use to check for underlying errors in almost any data set.
Number of Respondents vs. Rows
For any kind of survey, you should always match the number of rows in your data to the number of respondents surveyed to ensure data completeness.
For example, for a survey involving 500 households, the first step will be to ensure the number of rows in your data set equals 500.
This is a good first check. However, it doesn’t discount instances of simultaneous exclusion and duplication of a certain number of households.
You can use a simple hack to check if all your unique identifiers (UIDs) are unique.
Copy the values in the column to a different sheet in Excel. Select the entire column, and then click on the “Remove Duplicates” option under the “Data” tab of the Excel ribbon. A pop-up like this will appear:
You know you’re in the clear if you receive a pop-up like this:
Not sure what a UID is or why it’s important? Check out our blog on how to build unique identifiers (UIDs) into any survey.
Number of Questions vs. Columns
Quickly match the total number of columns in your spreadsheet with the total number of questions in your survey.
Make sure to be well-versed with the structure of your survey — have a thorough understanding of all question and response types.
It will be of immense help when you validate the responses received for these questions. You can then quickly match the total number of columns in your spreadsheet with the total number of questions. Similar to the rows check done earlier, this will help you identify missed or duplicate values.
Make sure that all your lat-long combinations are within the desired range of geo-coordinates.
Often, surveys require recording the location of the respondent. The common practice is to add the location in the form of a latitude-longitude (lat-long) combination.
Sometimes, especially in a digital survey, there could be outliers in the location due to issues with the collection device’s accuracy. For example, while conducting a survey in a certain region, you should be aware of the north, south, east, and west boundaries of the region and should also run a basic check on whether the collected geo-coordinates are within this predefined range.
Pro Tip: Say that your region is the rectangular area whose vertices are defined by the geo-coordinates
- (1.644745, 49.582651)
- (1.644745, 91.492333)
- (-9.773677, 91.492333)
- (-9.773677, 49.582651)
You can check if a point is in this area with the following formula:
=IF(AND(lat>-9.773677,lat<1.644745,long>49.582651,long<91.492333),"In area","Not in area")
Make sure that your time stamps are consistent and valid throughout the data set.
Time stamps can be in different formats. Some common formats are listed below:
- YYYY-MM-DD HH:MM:SS (Year-Month-Day Hour:Minutes:Seconds)
No matter what format you choose, it is important to keep it consistent throughout the data set. Also ensure that the second, minute, hour, day, month, and year are valid. For instance, in the DD-MM-YY format, the date 35-13-16 would be incorrect because the day and month cannot go beyond 31 and 12 respectively.
In case you need to convert from one format to another, you should be very careful, since this is a highly error-prone step. Make sure to check the date ranges before and after conversion to check if the minimum and maximum dates match.