Data collection is a minefield of errors. It doesn’t matter whether you’re a researcher with one survey in the field, an NGO with 10 data collection drives per month, or a market research agency with tens of thousands of surveyors — any survey is full of opportunities for errors to slip into your data.
Surveyors may be poorly trained, or well-trained ones may misunderstand a question’s responses or units. They may mistype a respondent’s answer or enter fake data to save time. Respondents may lie to save face, or they may answer randomly if they don’t understand questions. And anyone involved may get confused or distracted, even during short surveys.
In short, errors are inevitable. So how can you deal with them?
Extensively training your data collection team and conducting a thorough pilot are a great start, and they are essential steps of any good survey. But they won’t completely end data errors. To reduce errors even further, turn to data validations. Digital data collection tools have diverse question types and technological tricks to help you easily build validations into your survey and collect spotless data.
Keep reading for an overview of all the different data validations you can add to any survey. This guide is a bit long, so pour yourself some coffee, settle in, and use the table of contents to the right to skip to your favorite section.
What are data validations?
Data validations are checks built into a survey that allow you to control what data is submitted.
The most basic data validation is different question types. These are a simple way to ensure that data for each question can only be submitted in one standardized format.
For example, a basic question asking for someone’s age could receive the answers “17”, “seventeen”, and “seventine”. All represent the same value, but a computer won’t know that. A numerical question restricts this to “17”.
Even basic question types can also be helpful for validating data. For example, an email question will automatically check if the data entered is a valid email. A phone number question can check whether the phone number has the right number of digits, based on its country code.
Data validations also go way beyond different question types. You can place limits within questions, specifying what data can be entered. You can create questions just to keep an eye on surveyors. You can control which questions appear based on someone’s previous answers. You can even flag and re-collect data while your survey is still in progress.
Each type of data validation has pros and cons. Use the right data validations at the right time, and you’ll be amazed how much data quality increases and data cleaning time decreases.
Make your most important questions mandatory
Use mandatory questions for fields that you absolutely need for analysis. Just be sure that every respondent will understand the question and know the answer.
Anyone who has filled out an online form is familiar with this data validation — mandatory questions. Users are required to fill mandatory questions before they can submit a survey, but they can choose whether to answer non-mandatory questions.
Mandatory questions are helpful because they guarantee that every respondent submits the most critical data. Trying to analyze data without a UID (like an Aadhar number or employee ID) or draw conclusions without key impact questions, for example, is just a waste of time.
A crucial step in survey design is making sure that every unit in the survey — person, household, program, location, etc. — has a UID. Learn more about how to add them to any survey.
Mandatory questions are simple, but they can be incredibly frustrating if not set up correctly. Imagine a survey that asks Americans what browser they use to access the internet. Mark this question mandatory, and the 11% of Americans who don’t use the internet will be stuck. Some will enter fake data (Netscape it is!) to complete the survey, which skews the final data analysis. Other people will abandon the survey, which means losing the rest of their data.
Bulletproof your multiple choice questions
Multiple choice questions seem easy, but they’re prone to bad data. There are several data validations you can add to improve data quality — dynamic or static limits on the number of choices someone can select, randomized options, and logical checks on special options like “All of the above”.
One of the most common question types, multiple choice questions (MCQs) seem simple enough. With a set list of options, it should be a breeze to keep respondents from submitting bad data, right? If only.
Even the most basic MCQ needs data validations. Without them, someone can select options that don’t make sense, like a 20-year old who selects both 13-18 and 19-24 as their age bracket.
Here are a few key data validations for MCQs that any data collector should know.
Set a maximum and minimum number of choices
Most MCQs need a limit on the number of choices a respondent can select. Limits help to reduce contradictory or illogical data, since surveyors won’t be able to submit data until they’ve selected the correct number of choices.
The minimum number of choices could be zero. (However, it’s usually better to include an “NA” or “None of the above” option to cover this possibility, rather than letting the question go unfilled). The maximum shouldn’t be greater than the number of options.
Sometimes the maximum and minimum may be the same. Asking someone to report their total yearly income? Restrict respondents to one income bracket. Being in zero or two income brackets isn’t possible, as long as the brackets are MECE (mutually exclusive and collectively exhaustive).
Pro tip: If the question itself mentions limits, it’s easy to think you don’t need the limits built into the MCQ options. After all, people will just follow the question, right? Unfortunately, no. For example, our job applications don’t support data validations on multiple choice questions. One of our applications asks candidates for their 3 strongest and 3 weakest skills. We’ve gotten all sorts of unexpected responses, like candidates who select no options or 5 options!
Use dynamic limits
Want to get fancy with minimums and maximums? Make them dynamic, which means the limit is based on the answer to previous questions.
For example, imagine you’re asking which mobile phone brands someone has purchased. To get more accurate data, you can set the maximum number of options to the number of phones they’ve purchased before. Someone who has only purchased 3 mobile phones can’t have purchased from 6 mobile phone brands.
Add logic to “None of the above” and “All of the above”
“None of the above” and “All of the above” are helpful for ensuring that an MCQ is MECE (mutually exclusive and collectively exhaustive). But it’s important to make sure that these options behave correctly, or they can lead to confusing data.
Imagine that you’re surveying homeowners, and one question asks which household appliances they own. These special options should definitely be included. “None of the above” is helpful for new homeowners with few assets, and “All of the above” is helpful for wealthy homeowners with fully stocked homes.
If these options behave like normal — where users can select any combination of options they want — it can lead to errors. What if a user selects “Fridge”, “All of the above” and “None of the above”? How will analysts interpret that data? They’ll most likely end up throwing it out.
We logically know how these options should behave. If someone selects “All of the above”, they shouldn’t be able to select anything else. It’s all already covered. The same is true for “None of the above”. Though this seems intuitive, it’s important to make sure your survey includes this logic. (P.S. Good data collection tools, like our app Collect, will have this logic built into special options for MCQs.)
People can only hold a small amount of information at any time. This makes us terrible at choosing from lists, since we have trouble remembering all the items in the list.
People are likely to pick one of the first options in a list (which is called “primacy bias”). Since people generally rush to finish surveys as quickly as possible, so they pay less attention as lists of options go on. For example, it was shown in New York City’s Democratic primary elections and North Dakota’s general elections that candidates received more votes when they were listed first on the ballot.
Alternatively, people are also likely to pick one of the last options they hear (called “recency bias”), since it’s in their short-term memory.
A great way to mitigate these biases is randomizing the order of MCQ options. This means that the order of your options will randomly change for each respondent. (“All of the above” or “None of the above” will always come as the last option though.)
Randomized order is great for most unordered options, like a list of assets that a household has or a list of books. It helps to ensure that surveyors aren’t just going through the motions when they select answers. However, randomization shouldn’t be used if the options have an inherent order, such as age or income brackets.
Set maximums and minimums for numerical questions
If you know the valid range for numbers in your survey, build in hard or soft limits to keep out bad numbers.
Numerical questions — which only take numbers as answers — help you control what numbers can be submitted. There are two types of numerical data validations.
Value-based validations let you set minimum and maximum values. For example, if you are surveying schoolchildren, you might select 5 as the minimum age and 18 as the maximum age that can be entered.
Soft value-based validations (also called “soft limits”) are similar but more flexible. Soft limits let you set minimum and maximum values. If data collectors try to submit data outside of this range, they’ll get a warning. However, they can still choose to submit the data anyway.
Soft values are useful when there is an expected range of values, but you expect some outliers. For instance, some children start school early or finish late, so you might make their age question a soft limit. Then surveyors can submit data for children who fall outside of the expected age range, after checking to make sure that this age is correct.
Limit the number of digits or characters
Use digit-based data validations to limit the number of digits someone can enter into a numerical question. Use character-based data validations to limit the number of characters entered in a text question.
Imagine that you’re asking for a phone number, but your survey tool doesn’t have a specific phone number question. How can you ensure that you are getting a valid phone number? If you know your country’s phone number has 10 digits, you can set a minimum value of 1,000,000,000 and maximum of 9,999,999,999.
However, that’s a pretty clumsy workaround. There’s a much easier solution — set a limit on the number of digits.
Digit-based validations help you improve data quality by limiting the number of digits in a numerical question. Similarly, character-based validations limit the number of characters in a text question. These data validations are useful when you know exactly how many digits or characters something should have — like an ID code or phone number.
Add data validation questions
Photo, geo-tagged location, audio, or signature questions can be added to any survey to check if surveyors are following survey protocols and entering accurate data.
Adding data validations to existing questions is helpful. However, that doesn’t stop surveyors from sitting at home and filling your forms themselves, or from breaking survey protocols when they are in the field. There are many types of questions that you can add to catch these rogue surveyors and ensure that your data is coming from respondents accurately.
Location questions allow surveyors to geotag the location of each survey, even if they don’t have internet. This helps you ensure that surveyors are filling forms from the correct locations.
Besides being a source of valuable data, photo questions are a great way to guard against bad data. Asking surveyors to take a picture of a person, location, or any other data point proves that a data collector actually engaged with what they were supposed to. It’s also useful for verifying that responses are unique, so there won’t be any duplicates in your data set.
Geotagged photo questions kill two birds with one stone. The photo shows that a surveyor engaged with what they were supposed to, while the geotagged location shows exactly where that photo was taken.
Like photo questions, there are two types of audio features that are great for verifying data. Audio questions, which record a snippet of audio, can be used to verify a particular piece of data and hear it in the respondents’ own words.
You can also use audio audits, which randomly record background audio while surveys are in progress. These verify that surveyors are asking questions accurately and recording what respondents actually say.
Audio questions are also helpful for collecting qualitative data. Learn more about how to carry out qualitative research through interviews, focus group discussions or observations.
Signature questions, which allow respondents to submit their signature, can be useful for verifying that a user has actually submitted a form. They also can be used to confirm or acknowledge any agreements or allow a user to give informed consent for a survey.
No one should carry out research or collect personal data without a good understanding of what informed consent is and how to get it. Learn more here.
Customize your survey for different users
Skip logic, which ensures that each question is only shown to the right people, is a great tool for making surveys shorter, more targeted, and less susceptible to bad data.
Different users have different needs, so why give them the same survey? Long surveys with irrelevant questions open the door for bad data.
Skip logic (also called conditionality) is a simple way to create surveys that flow seamlessly for respondents and surveyors alike. It allows you to show or hide questions based on the answers to previous questions. For example, with skip logic, only women will be asked if they’re pregnant. Men can skip straight to the next question, without having to waste time marking “NA”.
Skip logic can improve data quality because it reduces the length of a survey, ensures that people don’t answer a question that’s not meant for them, and lets you make questions more relevant for the people who see them. It also lets a survey have more mandatory questions (since each question will only be shown to the right people), which can lead to less missing data.
Flag and correct bad data on the fly
As data is being collected, you might spot incorrect or incomplete responses. Flagging lets you flag them so that they can be resurveyed back in the field.
While checking your data, you might come across iffy responses, like a school with zero rooms. If you discover this during data cleaning, you’ll be stuck asking what the data means and whether you should keep it. But what if you could flag iffy responses before the data is finalized?
Flagging (also called “resurveying”) is a feature built for that situation. Flagging lets you flag questionable data as it’s being collected. Then your surveyors can recheck and, if necessary, re-collect that data point or survey immediately.
For example, you can send surveyors back to the zero-room school to see exactly what’s going on. Was the zero a mistyped “10”? Or is the school actually a teacher and students under a tree? In either case, the surveyor can check and re-collect accurate data immediately, leading to fewer questions during data cleaning and analysis.
Are data validations right for you?
It’s always a good idea to add data validations to your survey. They’re a great way to improve data quality. However, it’s important to pilot your data validations to make sure they’re correct before you roll them out.
The short answer — absolutely. Data validations are a simple, affordable way to increase data quality. All you need is a digital data collection tool (which you should be using anyway!) and a bit of time spent building a better survey.
Though there is one catch — data validations need to be added thoughtfully. Create mandatory questions that people won’t be able to answer, add incorrect character limits, or ask for the wrong number of MCQ options, and your respondents will be stuck and confused.
Not sure if you’ve added the right data validations? Pilot them. Run them by your team, sector experts, and field staff. Test them out with a small group of people. Listen to your surveyors and learn what questions are confusing them. Watch the pilot data and see what’s going wrong. The more you can test your data validations before you roll them out, the more likely you’ll have them right when you start collecting data at scale.
Piloting helps you identify and fix issues that would have led to poor quality data. Learn more about how to use pilots to test all aspects of your survey.
Piloting and flagging can seem like a lot of work up front, but they’re worth the effort. The time you spend on piloting and resurveying will be saved several times over during data cleaning and analysis.
This blog was co-authored by Carson Whisler and Christine Garcia.