Using sensor meta-data

In SurveyCTO 2.50, we introduced new, experimental sensor meta-data field types. If you want to collect better and more accurate data, you can use device sensor data to get valuable insights about whether data is being collected according to plan.

To help you make the most of this feature, this article will explain how, combined with SurveyCTO’s other unique quality control features - geodata (which can be secretly collected in the background on Android), form duration, audio audits, text audits, automated quality checks, the ability to review and correct data, and data visualisation in the Data Explorer - you can create a very powerful and efficient system to detect and correct any issues in your data collection process and collect the most accurate data possible.

Before you dive in to this article, make sure you are familiar with the basics! Read the product documentation on this feature here. It is a prerequisite to properly understand the explanations and recommendations below.

What do the new field types do?

Android devices can come with a number of sensors beyond GPS including an accelerometer, gyroscope, light sensor, microphone, among others. The new field types use these sensors to capture data during the survey that can provide users with an idea of:

  • The light conditions around the device.
  • How much the device moved.
  • How loud the sounds were around the device.
  • The pitch of the sounds around the device.
  • An estimate of whether a conversation was taking place around the device.

Why is it useful to know the above information?

  • Different types of data collection projects will involve different light conditions. For example, a household interview probably takes place inside the home of the respondent, in light conditions that may vary – but the intensity of that light will still be lower than the intensity of full sunlight outdoors. In general, if an observation should take place mostly indoors, outliers for light levels might be worth checking on.
  • Most surveys involve an enumerator sitting, reading questions off a tablet screen, while typing answers.The enumerator might show the screen to the respondent from time to time, but on the whole the tablet shouldn’t move a whole lot. A high level of movement might suggest the device was being held while walking, or while driving in a vehicle, which may not have been according to plan. In contrast, in a form where enumerators are expected to walk around a school, clinic or farm, completing a series of observations on a checklist, you might expect plenty of movement.
  • The loudness of sounds around the device during a survey  can tell you something about the conditions under which data was generated. Certainly, human voices fall inside a range in terms of volume. Is it expected that forms would be completed in loud places, with lots of background noise? Or in the context of your project, would that be more of an outlier? In either case, this is something you can get an indication of using device sensor data.
  • The most common use case for SurveyCTO forms is to collect responses from a respondent during a face-to-face interview. This involves a conversation: the enumerator reads out questions and the respondent answers. Hence we’ve tried to measure the percentage of time during a survey that a conversation seems to be happening using the volume and pitch of the sounds around an Android device as indicators. More than other statistics and stream data, the conversation sensor data should be regarded with some skepticism, as it is possible that the current version of this feature will falsely identify non-human voice sounds as conversation in cases. As explained in the documentation, these are experimental features, so should be used with that in mind.

While each type of sensor data can be useful individually, they are even more powerful when used in combination. For example, in the case of a household study, one might expect mostly low light readings reflecting being indoors, nominal indications of movement, moderate sound levels over which a conversation would be taking place for the majority of the duration that the form is open.

What does that look like in a form, in practice?

You might use the following sensor_statistic fields with these parameters:

  • sensor_statistic pct_light_level_between, with the appearance “min=5;max=500”, to get the percentage of time spent inside the lux range that likely reflects being indoors.
  • sensor_statistic pct_sound_level_between, with the appearance “min=0;max=60”, for an indication of the amount of time where there was mostly a quiet space where a conversation might be heard, or with the appearance, “min=80”, for the amount of time that loud sounds were detectable, during which you might have doubts whether the respondent’s answers could be heard properly and recorded accurately.
  • sensor_statistic pct_movement_between, with the appearance, “min=25;max=65”, for a rough sense of the duration of moderate to low movement, for a reflection of the device being held by hands. Equally, one might try using the sensor_statistic mean_movement field to see how it compares.
  • sensor_statistic pct_conversation, to see what percentage of time a conversation is taking place. As in the product documentation, this is an experimental feature, so it may not always work well.

The above is by no means a firm recommendation on specifically how to use sensor statistics. Rather, this is a suggestion on where to start, if you happen to be doing household interviews.

Important caveats

Because a lot of natural variation is possible in the sensor statistics - different devices have sensors of different sensitivities, the ranges of light/movement/sound conditions that are normal for one project might be abnormal for another project - if you are interested in sensor statistics, you need to first establish a baseline. Add these fields into your forms, and observe some initial surveys. Compare the sensor data gathered to the conditions you observed, to get a baseline of what light, movement and sound statistics are recorded in some normal surveys.  This will give you a sense of what thresholds to use for your sensor statistics and what value ranges should be flagged as outliers.

In general, you also have a choice as to whether to flag desirable or undesirable ranges of values. Also consider adding a few questions for the enumerator at the end of your form designs to help gauge their sense of how long they were indoors or outdoors, or for how long a conversation was taking place while the form was open, to give additional context to any readings you get.

Given the experimental nature of these fields, aside from taking this data with a pinch of salt, we also ask that you include related sensor_stream fields with the default time period (by leaving the field’s appearance blank) along with their related sensor_statistics fields, even if you don’t have any specific plans to do analysis on the sensor stream data. It can still be useful for you to review a few sensor_stream files as a check against the sensor statistics being reported. Stream data can also be helpful for our team to review when you are giving us advice on how to improve these features. So, at least for the first project where you experiment with sensor data, add the relevant sensor_stream fields that inform the sensor_statistic fields you’re using. For example, if you’re using the sensor_statistic pct_conversation field, you should include the sensor_stream conversation field as well as the sensor_stream sound_level and sensor_stream sound_pitch fields in your form.

Combining sensor data with other features to create a cohesive data quality and review system

With an idea of what a good and bad sensor statistic value is using your devices on a project, you can start designing controls that make use of this data. For example, you can combine this sensor data with SurveyCTO’s automated quality checks, which are statistical checks that establish thresholds for flagging outlying submissions for review. A very popular check amongst our users is to flag surveys that have very short durations compared to the sample as a whole. Now, you can do the same with sensor statistic data. Using the household interview example from above, let’s say that during a pilot survey mean light readings fell between 50 and 250 lux. So you might decide to flag records where the lux value is greater than 500, just to allow for some variation (keeping in mind that an overcast day is about 1000 and full sunlight can be 10,000+).

You can improve your process even further: use the review and corrections workflow feature to automatically flag records for review based on automated quality checks. That  is, submissions with quality check violations – so, for example, those with lux values above 500 – would automatically be moved to the “awaiting review” queue in the Data Explorer.

Finally, if you are using the pct_conversation statistic, you may want to also record audio audits to provide context and validate the findings of that statistic. You might add in several short audio audits into your form for review, firstly to confirm expected pct_conversation values, and to investigate values that seem to deviate from what you expect.

This sort of sensor statistic + automated quality check + review and corrections workflow approach can be an invaluable data review system that automates a lot of your work, so that you can focus your time on the specific records among thousands that need the closest scrutiny.

Over time, we aim to use machine learning technology to automate even more of the quality control review that users do, and sensor data is critical to that effort (though it still some time away). If you’re excited about sensor data and have experiences to share, we would love to hear from you, so do get in touch! You can help make sensor meta-data as useful as possible for all SurveyCTO users.

0 Comments

Article is closed for comments.