From response to visualisation: how to streamline survey data processing

The best results for ""

12 01

Before a response given in a survey makes it into a report or spectacular visualisation, it must go through a multistage data handling process. We provide a step-by-step guide to improving data acquisition and cleaning in order to accelerate the presentation of research outcomes and produce better quality responses.

Karol PiekarskiProcess

With just three coordinators supervising the work of a dozen interviewers, whose job is to attract several thousand responses from participants in dozens of different events. Our study into participants of Katowice’s cultural events is a major logistical challenge. In addition to routine questions recurring at every event, the surveys also contain queries that are unique to particular studies, such as whether the audience of the Silesian Jazz Festival is also involved in other Katowice jazz festivals.

Survey

As the quality of the results tends to be inversely proportional to the complexity of the survey, elaborate questions naturally require more time and attention from the respondents. For this reason, a clear graphic design for the survey is an additional hurdle to overcome. Just how to fit all the questions on one page without overwhelming the respondent with excess content or compromising the clarity and legibility of the form, is one of several issues to resolve, others including possible legibility difficulties related to e.g. the recipient’s age, poor vision or the venue’s dimmed mood lighting. At the first few events, we studied the way respondents were coping with the questionnaires, introducing some minor improvements accordingly.

Form / spreadsheets

In order to facilitate the process of transferring data from paper questionnaires to spreadsheets, we use Google Forms. By using Google Spreadsheets, data entered simultaneously by several people ends up in one file, thus helping to avoid copying mistakes and control the consistency of the content being added. It is already at this stage that we perform the basic cleaning operations, such as removing unneeded columns or change their names, though the cleaning process is carried out using an application known as OpenRefine.

OpenRefine

This versatile and user friendly application allows users to efficiently accomplish tasks that would be too time-consuming or even impossible in Google Docs. At the same time, OpenRefine offers solutions that are much more effortless than writing custom scripts, making it a great tool for solving specific problems of particular data sets. For instance, students or graduates of the Medical University of Silesia provided the name of their university in a variety of ways using both abbreviations, full official name and colloquial forms, written in uppercase, lowercase or each word capitalised, with occasional misspellings, etc.

OpenRefine algorithms identify similarities between different records, while all unrecognised phrases can be changed to appropriate alternatives with just a few clicks. The same applies to other variables, such as university degrees pursued, names of cities and districts, etc. The application also spots numbers and dates, organises numerical data into columns, provides for quick data filtering and performs, automatically or semi-automatically, dozens of other operations useful at this stage, such as e.g. removing unnecessary characters from text fields. After checking the consistency of the data, we can export it to a csv text file, i.e. the most universal tabular data format which is compatible with all applications used in further stages of data analysis and visualisation.

CSV

We expect the data contained in csv files to be of benefit to two groups of users. The first of these are programming-savvy analysts who will experience no problems reading them (for people working in R, we have developed a starter kit in the form of a script that retrieves the current version of the database from the server and prepares the data frame in appropriate formats, allowing the user to start their own analyses instantly). The other group consists of researchers, journalists or commentators who are not necessarily familiar with programming, but would like to have a look at the data, or produce simple calculations and visualisations. Applications well suited for this purpose include any spreadsheet or even WTFcsv, a concise online app allowing instant preview of the csv file content. One may also be tempted to use a bit more sophisticated, but still low-skill-intensive tools, such as Exploratory (we will describe easy-to-use online tools for automatic data visualisation in a separate post).

Tabular Data Package

Problems with csv files may arise from the fact that the format was created for developers working with data processing tools, rather than for the convenient use by users wanting to quickly inspect the content structure. As a csv file offers no capacity for detailed descriptions of the necessary metadata, we have opted for the Tabular Data Package, an industry standard for the publication of data in tabular form, created and promoted by the Open Knowledge Foundation. In addition to the data set under consideration, this package creates a second (json) file, which contains the detailed metadata about the associated csv file – in our case, the survey questions – as well as information on data authorship and publication context.

GitHub

The files thus created are published on the GitHub server, where the newest version of our database is kept. We share our data in an open format under the Creative Commons CC BY 4.0 licence, which means that it can be used in any way or form, even in commercial projects, as long as users include information about its author i.e. Medialab Katowice. This way, the content we have produced can take on a new lease of life serving other researchers, employees of cultural institutions and all those looking for knowledge on Katowice’s cultural consumers.

Data provided in this form, can then provide input content for analytical software and visualisation tools. Of course, we continue to improve the process of content acquisition and refinement. Currently, we are developing an online version of the survey, designed with the tablet-equipped interviewer in mind. It will also be available as part of a Facebook app, which we will launch to explore social media users participating in cultural events. By streamlining and automating the process of data acquisition and processing we seek not only to reduce costs and speed up the work while eliminating errors, but also to collect better quality answers. The respondents are bound to get more involved in the study, if they can complete the survey to an engaging effect, as their answers get immediately converted into an aesthetically pleasing visualisation.