Enforcing unique study IDs - how to avoid duplicate IDs

How SurveyCTO identifies records internally

Before we get into the how-to portion of this article, I should clarify that even if you never set up a way to uniquely identify your own records, SurveyCTO has an internal system to prevent duplicate submissions. If you have ever taken a close look at your exported .csvs you’ve probably noticed the "KEY" column, and in that column, values that start with ‘uuid:’ followed by what might look like an indecipherable code. UUID is an acronym for “universally unique identifier” and when one is generated using standard methods, the chance of it matching any other in the world is so marginal as to functionally be zero. When you finalize a SurveyCTO survey, either in the SurveyCTO Collect app or via the web interface, SurveyCTO generates a UUID for that record that the server will use to make sure that there is only ever one copy of that record in your ‘master’ database. If you submit that same survey more than once, it will only be accepted the first time.

Note: KEY is one of a few additional columns that appear in exported data. To understand additional columns including KEY, read more here: understanding the format of exported data.

Strategies for preventing duplicate entries

What you have in mind as a unique ID is probably different than the automatically generated UUID. You're probably thinking in terms of your study ID system, right? One household or individual has a study ID and you wish to prevent the same household or individual from being assessed/interviewed a second time, and you wish to prevent the same study ID from being used again in error, for the wrong household or individual.

When a blank form opens, it is blind as to what is on the server and what is being recorded inside other blank instances of the same form. As a result, there is no 100% airtight solution for guarding against duplicates study IDs. There are however, measures for minimising the problem and making duplicates less likely.

Firstly, there isn't a direct feature for the server to control submissions according to a duplicate value which isn't the KEY value. The KEY is actually determined at the moment of finalization of the form, and that is the only factor the server takes into account when deciding whether to accept a submission or not. It is not possible to designate another field in the form which contains your study ID for the server to monitor.

Even if there were such a feature, there would be unavoidable conflicts that you'd need to work through. For example, let’s say your enumerators work offline for a few days. Enumerator A uploads submissions first, with study ID 1234 amongst them. Then enumerator B uploads records after, also including a submission with study ID 1234. A system like this might declare the first instance of ID 1234 as valid and any others that follow invalid. That might not be true though! What if the first instance of ID 1234 was an error? In this case you wouldn’t want to discard one of the instances - you'd want to receive that data to protect it and resolve conflicts later.

So, keeping that in mind, here are some strategies that you can consider depending on the specifics of your project.

Baseline or new respondents

If this is a baseline project where you are visiting new respondents for the first time, you might want to create IDs using values in your form using the concat() function in a calculate field. This could include parts of the date and time, area levels (e.g. province, district, village) as part of a cascading select, and even a random number or sequence.

Alternatively, you can rely upon a randomly generated ID. In our experience, a random number of 7 digits is sufficiently random, to avoid duplicate IDs. Do this most simply with SurveyCTO's uuid() function in a calculate field to create and ID that is a random combination of numbers and letters. uuid() should be used with once() (as in "once(uuid())"). Give uuid() an integer value to get a random ID of that length (e.g. "once(uuid(7))" to return a 7 digit random ID). Alternatively, click here to see an example form demonstrating a longer process which can be used to generate a more customisable ID. In this example form, a 6 character random ID is generated and you can choose a custom length of ID.

However, if you do want to incorporate a randomly generated ID, we’d suggest using some kind of location information in the ID as well. If you have to revisit the same household again, having a hint about the location in the ID could help surveyors be sure they are pulling up the correct record when they revisit the respondent. Of course there are other ways to ensure this, but the more fail-safes you have against mistakes, the more straightforward your monitoring and data-cleaning steps will be.  

Something else to take into account is that if you are collecting baseline information and expect that your project might need to revisit the same respondents at a later date, you should set yourself up for success now. Collect as much information as possible so that your future surveyors will be able to find the same person or household again. Information like phone numbers for household members, and even neighbors, if you might not revisit for a long time, GPS points, and pictures of the front door if you are operating in an area where street addresses aren’t the norm, can all be useful later on when surveyors need to verify that they’re revisiting the correct household.

Revisiting respondents

Initial design

You might be at a stage of your project where you are working with pre-determined, more user-friendly study IDs. With such IDs it’s harder to guarantee uniqueness but there are some approaches you can try. First, intelligent form design goes a long way here. Giving your enumerators a few opportunities to narrow down the list of respondents across many levels of identifying information can cut down on a lot of mistakes.

For example, assuming you are working in many locations throughout a country, allowing your enumerators to select from a final list that has been narrowed down to a town, village, or even smaller units, instead of a higher level like district, will help prevent mistakes (all the way down to individual households or respondents). The longer the list your surveyors are selecting from, the more potential there is for error. Cascading selects and filtering can help towards this end. And once surveyors have made a selection, provide them with enough pre-loaded information, such as names and ages of household members, identifying features of the household, etc, to confirm they’ve found the correct respondent.

Something else to consider is how much identifying information you provide to your surveyors. Did you collect pictures or GPS data last time? Once a surveyor thinks they’ve identified the correct household or person, you can include a pre-loaded picture of the correct person, or a picture of the front of their home. If you have GPS data from a previous round of data collection, take a look at this article about using Google Maps to help with follow-ups.

Using server datasets

See this method illustrated in this sample form. Please read the following and test the sample form. Either save a copy of the sample form in your Google Drive or download as an Excel workbook.

If you wanted to incorporate another layer of precaution, SurveyCTO's advanced publishing with server datasets feature (available on our Professional Premium plan) offers a good but limited approach. Server datasets are intermediary table repositories for data which sit on the server. Forms can publish data into server datasets and forms can pre-load data from server datasets, just like you would from an attached CSV file.

So, the thing to do would be to publish your study IDs to a server dataset. At the same time, the form will be setup to pre-load data from this same server dataset. Imagine that the study ID is a 4 digit numeric value, recorded in a text field with the "numbers" appearance. This same field has a constraint like this:

string-length(pulldata('studyid', 'id_key', 'id_key', .)) = 0

The following is happening here:

  • The current value entered into the field (represented by “.” in the 4th parameter in the pulldata() expression) is being used to pull a study ID value ("id_key") of the same value.
  • This value is being tested with string-length(). If there's no match, then the string length is zero, and that means that the study ID value in question had not yet been submitted to this server dataset. I.e. there is no duplicate.
  • However, if there is a match, this means that the study ID value has been submitted already. Thus, the string length will be greater than 0 (or not equal to 0). In this case, you know that there was a match. The constraint is violated in this scenario, and the enumerator will not be allowed to proceed using that ID.

The above approach has one main limitation: The first is the delay to update the local copy of the server dataset. Server datasets will need to be updated through form updates. To promote regularity, enable the Auto download option and the send/receive status for the device under General Settings (click on the 3-dot icon in the top right hand corner of the screen). The user will still have to install the form updates from the Install Form Update menu item that will appear at the top of the screen. Send/receive status (also enabled from General settings) provides a button for synchronising with the server on the main menu. Lastly, use the Get Blank Form menu to download the latest form version, to update the attached server dataset. However, given that this isn't 100% automated, if enumerators can't (or forget to) update the form to update the server dataset, it will be out of date and won't prevent duplicates as above. This will also require that your enumerators have pretty steady access to the internet so that they can both send, and receive, data updates on a fairly regular basis.

To mitigate this limitation, you can take a slightly different approach - the form could be programmed to rather alert the user that the same study ID had already been submitted, and also display other pre-loaded data from the server dataset, like the date of the first interview with that study ID and the name of the enumerator, to display in a label. This will help the user with decision-making. This softer approach would allow the user to proceed anyway but they would do so knowing that this study ID had already been submitted, causing them to double-check the study ID, and maybe even alerting a team leader to investigate if and how a mix-up happened.

Using case management

Another great way to prevent your enumerators from accidentally interviewing the wrong respondent is to use our case management feature. The standard SurveyCTO workflow is for enumerators to open blank versions of a form, enter the ID of the respondent, and then proceed with an interview. A case management workflow is slightly different. Instead of starting from a blank form, enumerators open a list of cases, each of which will have one or more forms associated with it. By selecting a case, the form is automatically populated with the correct study ID. The wrong case might be selected in error but confirmations steps including pre-loading and displaying pre-existing information about that case can help guard against this.

One of the features of case management is being able to assign specific cases to specific enumerators. This can be a powerful way to prevent duplicate IDs, since each enumerator will only have access to the cases specifically assigned to him or her. Forms using case management can also be designed to publish data to the cases dataset, updating that information. So once an enumerator has submitted a form for a particular case, the publication of that data into the cases server dataset could un-assign that case from that user, helping to prevent a form submission with the same ID a second time.

Other ideas

Otherwise, the way you organise your fieldwork with shorter user-friendly study ID systems can help guard against duplicates too. Do each of the numerators get a list of all the IDs, or just the IDs assigned to them? They are less likely to use study IDs assigned to other enumerators if they don't get the whole list.

ID value sequence could help or hinder too. Are your ID values sequential, as in, 0001, 0003, etc? It is much easier to make mistakes with sequential IDs. Non-sequential IDs can help lower the incidence of errors that might contribute to duplicates.

Also, if non-sequential IDs have a pattern to them, you can validate them with a constraint. For example, the constraint “. mod 3 = 2” would require that a modulo of 3 equals 2, so only numbers in this sequence are allowed (you'd need to work this out in advance).

Summary

We’ve covered quite a bit of material here, so here are some key takeaways:

  • No matter what approach you use, SurveyCTO internally maintains its own system for uniquely identifying records.
  • When you open a form on a device, it is unable to access any information about other forms on the same device or any real-time data from forms submission sent to the server after the most recent time that data was transmitted between the device and the server (so there's no way to compare your ID value with other values directly until that data is submitted).
  • If you are collecting data at the baseline of a longitudinal study, you have a good opportunity to set yourself up for future success. Unless you are working with data that is sensitive enough to warrant extra precautions, assign IDs that provide some context just by looking at them. For example, if you are collecting data on a household in Region A -> Town B -> Division C, the ID could be something like ABC-[randomly_generated_number]. In the event of a mixup, partly descriptive IDs could make it easier to sort out problems.
  • If you are revisiting respondents, allow enumerators to hone in on a small geographic area from which to select respondents through filtering choice lists. Enumerators should not have to scroll through long lists to find the correct respondent or household.
  • Server datasets can be useful for baseline and follow-up data collection. When an ID is entered or selected, it can be checked against a server dataset storing IDs that have already been collected in the current round. A form could be programmed so that if a surveyor attempts to interview a respondent with an ID that already exists in the server dataset, they can either be prevented from proceeding, or receive a warning that they can acknowledge before proceeding.
  • Making use of our case management workflow can prevent enumerators from submitting data using IDs which are not assigned to them as well as preventing enumerators from submitting data more than once with an ID that was assigned to them.

Depending on your project’s needs, you can use a combination of the above strategies to help prevent duplicate study IDs from ending up in your data.


The above features the following functions and operators: uuid(), concat(), string-length(), pulldata() and mod. Consult the product documentation on using expressions in your forms to read about these and other functions and operators. 

0 Comments

Article is closed for comments.