Medialab coders talk about the challenges of data acquisition and analysis

The best results for ""

08 05

Successful data analysis in cultural projects would not be possible without the work of a team of coders who work hand in hand with designers and researchers to create the best tools and methods of working with data. We asked our experts – Dawid Górny and Marcin Chojnacki – to talk about their daily work and the challenges they face during each stage of the project.

Marcin Chojnacki, Dawid Górny, Łukasz MirochaInterview

Marcin Chojnacki − His main involvement in the project has been the acquisition of data from external database systems. Due to the nature and scope of Shared Cities, the data came primarily from Facebook (open API) and newsletters about various Katowice events, such as e.g. silesiaspace.pl. This includes both posts, comments, posts from the pages suggested by the other members of the workgroup. What lies in the pipeline for him as part of the project is changing of the format of the available data, so that the information from Facebook can be combined with the data describing the cultural circuits of other services, e.g. Katowice’s local Ultramaryna.

Dawid Górny − His responsibilities include, among others, the selection of tools and environments that the entire workgroup uses for communication and data work. The key criterion in selecting software is its capacity for exchanging and combining different types of data. Along with Marcin, he prepares scripts to download data from e.g. social media. He is also involved in developing new methods of obtaining data from other sources. In the latter part of the project, he is going to create tools and interfaces for data analysis and online presentation.

Łukasz Mirocha: What steps do you need to take to transform a collection of raw data into knowledge that is intelligible to a wide audience?

Marcin: Raw data requires appropriate processing. External services do not always allow direct access to the data we are interested in. We need to systematise it first, then save it locally on our server to keep its current state and have continuous access to it. This is a necessary prerequisite for the ‘proper’ research to begin.

Dawid: Also, you mustn’t forget about the essentials – before you download any data, you need to prepare the server, the repository, the database, and any scripts you may need, as these are the key tools without which the research work using the acquired data would simply be impossible.

Dawid Górny, data visualisation workshop

Marcin: That’s exactly why we use a wide variety of tools and environments. We are currently working with MySQL Workbench and the Facebook API Graph Explorer . Data is downloaded using the Facebook SDK v5 for PHP linked to the database with MySQL Improved. Sometimes it is necessary to develop your own tools just to do a single task, e.gg to filter out invalid characters using an own application written in C#.

Dawid: Worth mentioning are also JavaScript, Node.js, Google Places API, Mapbox, Microsoft Cognitive API and OpenCV. Working with data in a project as vast as Shared Cities requires the development of a system of interlinked software packages, hence the variety of tools and methods.

What challenges do you typically encounter when working with data? Does it often turn out that the pre-planned scenario needs to be modified as you go in response to problems?

Marcin: Problems can already be encountered at the early preparation stage of reading the data source documentation, which is sometimes outdated or incomplete. As for popular data sources, such as Graph API for Facebook, you can also get information from threads available in developer forums on the web. However, some developers’ functions are deliberately concealed, because access to them offers too much leeway to “outsiders.” When the documentation is full and has been in use for some time, the work scenario is quite easy to develop. At the stage of Facebook data acquisition, I have repeatedly come across things either not working according to the documentation, or some “extra” functionalities which enabled me to access data from outside the documentation and thus improve our work efficiency.

Dawid: I would consider it to be not so much a scenario, but rather a series of related stages that make up a not-so-linear process. We often need to go back to the previous stage, when we want to e.g. supplement the data set or process it again in search of answers to new questions we have not thought of before. It may also turn out that the tool or library you are using requires you to export or reformat the structure of the stored data.

Marcin Chojnacki at one of the Medialab's workshops

Do you often run into problems caused by inconsistent data sets?

Marcin: Definitely. So far, all the datasets I have acquired information from had discrepant formats and required an individual approach. Some information needs to be broken down into different columns. For instance, a one-field address of the event venue may have to divided into separate fields: street address, city, post code, as not all databases allow you to download these fields separately. The data set at hand should also be stripped of all the data that is of little or no interest to our study, e.g. items referring to events taking place in another voivodeship which were erroneously classified as happening within Katowice.

Dawid: It turned out on multiple occasions that for some Facebook data we were missing detailed topographic or address information. In such cases, we would go round the problem by employing a Google Places API based script designed to search for a place name or description fragment containing the address. Our data sets consist of tens of thousands of records, so you need to automate the data supplementation process and data export more user-friendly formats such as Google Spreadsheet, as this allows the ‘manual’ verification of the sets by volunteers.

In the study of Katowice cultural circuit, you use data collected both from Facebook and traditional surveys. At what stage will the set containing both of these data types be used to produce the final visualisations?

Dawid: I must admit that early on in the project I wasn’t aware of the volume of data we would be dealing with. It is something of a unique endeavour to even carry out cultural circuit related surveys on such a scale. Combining both types of data will be a challenge not only in formal terms of technical but also due to methodological concerns, as reported by our colleagues and from the research team. At this stage, it is difficult to specify whether the findings obtained as a result of their analysis will be complementary to each other, or mutually exclusive. Either way, however, they will certainly widen the research field and allow us to formulate new questions.