We have become accustomed to the fact that cashiers in superstores ask us for our postcodes as part of their market research needs. We had our doubts, however, about the possible reactions of our surveyees (conducted during cultural events) when requested to provide their home address information. We also wanted to have more detailed data that would allow us to conduct studies at least at the city district level.
For this reason, we have introduced a grading system: ‘If this is not a problem for you, please provide the postcode, street address or district name; If you do not know the post code, please provide the name of the town/city’. We left out the question about the voivodeship in order not to drag out what was already a sizeable survey, usually conducted in unfavourable conditions – e.g. in a dark foyer a few minutes before the beginning of a theatre performance.
place of residence | example data |
---|
The solution we adopted, i.e. to gather as much data as possible even though it may be incomplete and of inferior quality, has caused us many problems, as did working with data obtained from social media. Users arbitrarily declared their places of residence, providing e.g. German names of towns and districts (Kattowitz, Bogutschütz). The process of cleaning up the survey data, therefore, had to first deal with a number of issues including:
problem | example data |
---|
After performing a series of automatic actions and manually correcting the wrong names, we went on to the next step, which was to assign geographical coordinates to all addresses and locale names, in order to enable the visualisation of the content on the maps. For places other than Katowice, we were not interested in details such as district or street address, so we adopted two complementary solutions.
For post codes outside Katowice, we first matched them to relevant names of municipalities using Poland’s postal code database. In the case of small localities which do not have a separate post code, we provided the names of municipalities to which they belong (e.g. Sarnów = Psary). Then, using geographic coordinates database for Poland’s municipalities, we were able to identify and assign their geolocation data. One problem at this stage were duplicate municipality names within the country. For instance, each record containing Chrzanów ended up with two coordinates assigned, one for a town in Lesser Poland Voivodship and one for a village in Lublin Voivodship.
In the first place, we simplified the database by abandoning the isolation of neighbouring municipalities of the same name (e.g. rural commune of Siedlce and the isolated city of Siedlce). Next, after merging the above, we verified duplicate records based on post codes, thanks to which we knew that we were dealing with e.g. Chrzanów, Lesser Poland Voivodship, or Olsztyn, Varmian-Masurian Voivodship, as opposed to its much smaller namesake in Silesian Voivodship. We used a similar method when processing data from Facebook page statistics, where the name of voivodship features alongside the town/city name.
More problematic would have been cases of having just the name of the municipality or commune, e.g. Bobrowniki. Then, it was helpful to know the context in which the data was obtained. If it came from a survey at a local event, such as a small concert in a pub, it would most likely mean Bobrowniki in Silesian Voivodship rather than the one in Kuyavian-Pomeranian Voivodship. However, such an inference would be risky in relation to Katowice’s Off Festival, which gathers audiences from all over Europe.
A parallel approach was used for addresses in Katowice, where there is a markedly denser grid of postal codes. Here, we were able to use a more precise database to match the post code – rather than the name of commune/municipality – with a precisely defined area in space. As a result, we were able to obtain data of varying degrees of detail, usually containing different combinations of the following: post code, district name, street name, house number.
Of course, the most convenient situation is to have a complete address including the building number, but postal codes are also capable of providing detailed information. The same applies to districts, at least for certain types of studies, such as the level of cultural participation in particular neighbourhoods of Katowice (e.g. whether the residents of the remote district of Murcki are deprived of cultural access). The city’s longest streets, such as Kościuszki Street, running from the city centre through several districts to the city border with Mikołow, could also be problematic.
However, even those addresses which contained the street name and house number were not without their without complications. On the one hand, we had to deal with e.g. new addresses not yet available in our database or non-existent numbers (such as No. 7 Dyrekcyjna Street in the city centre). On the other hand, it was also a challenge at times to combine the acquired addresses with our geographic coordinates because of the different forms of certain street names, especially those that consisted of many words, mostly personal names, like ‘Gen. Józefa Hallera’ or ‘Generała Józefa Zajączka’.
As the residents of these streets customarily use the surname alone, they mostly provided only the last part of the name, or, at best, its shortened form, e.g. ‘gen. Zajączka’. In order to solve this problem, we used an algorithm that combined the databases by the last two components first (Kosciuszki 23), and then, as required, by three or more components (Tadeusza Kościuszki 23, Generała Józefa Zajączka 23).
The last step was to check for any errors and empty records. Despite trying to automate the entire data processing process, it was difficult to avoid manual additions and corrections. Fortunately, with a few thousand addresses, we came across no more than a few dozen faulty records, so their verification proved a relatively smooth process.
Naturally, a question arises as to whether these operations could have be avoided by using an online tool to automatically geocode desired localities and addresses.
Unfortunately, with incomplete data, tools such as Google Maps (tested with Google Fusion) failed to deliver adequately accurate results, especially for geographic identification of post codes. Although going through the above process is time-consuming, it allows for better understanding of the data, making it easier to detect errors that could lead to undesirable consequences at a later stage.
The data thus developed will serve us to perform a number of spatial analyses of Katowice cultural life. We will investigate, among other things, the scope of activities of particular institutions as well as the distance travelled by participants to get to their chosen events. We will be interested in both geographic relations at the national level as well as more detailed issues regarding Katowice districts.
In order to perform the operations described above, we have used various public data sets .
The first of these was the Official Directory of Postal Address Numbers, which is available from on the Polish Post website as a PDF file. We needed a editable database, so we developed the document as a CSV file, saving data on post codes and names of communes/municipalities and voivodships, thus enabling their clear identification.
Another database contained polygons representing the footprints of particular communes/municipalities. We obtained the data from the National Register of Boundaries [Polish: Państwowy Rejestr Granic, PRG], made available by the Central Office for Surveying and Cartographic Documentation. We used the QGIS application to prepare a set of centroids for all selected areas in the WGS84 layout, which we additionally combined with the post code database for the whole country. All post code points corresponding a given city were assigned the same location, which is sufficient considering the scale of our geographic analysis.
The National Register of Boundaries was also the basis for two further sets. Data on address points was used directly for geocoding detailed survey information, as well as indirectly, as the foundation to prepare the database of post code areas within Katowice. The second database was developed by creating a Voronoi diagram for these points (i.e. dividing the city into smaller areas closest to each point) and merging the areas with the same code. Finally, we created a data set containing their centroids.