Analysing the Impact of Large Data Imports in OpenStreetMap

OpenStreetMap (OSM) is a global mapping project which generates free geographical information through a community of volunteers. OSM is used in a variety of applications and for research purposes. However, it is also possible to import external data sets to OpenStreetMap. The opinions about these data imports are divergent among researchers and contributors, and the subject is constantly discussed. The question of whether importing data, especially large quantities, is adding value to OSM or compromising the progress of the project needs to be investigated more deeply. For a recent study by Witt et al. published Open Access, OSM’s historical data were used to compute metrics about the developments of the contributors and OSM data during large data imports which were for the Netherlands and India. Additionally, one time period per study area during which there was no large data import was investigated to compare results. For making statements about the impacts of large data imports in OSM, the metrics were analysed using different techniques (cross-correlation and changepoint detection). It was found that the contributor activity increased during large data imports. Additionally, contributors who were already active before a large import were more likely to contribute to OSM after said import than contributors who made their first contributions during the large data import. The results show the difficulty of interpreting a heterogeneous data source, such as OSM, and the complexity of the project. Limitations and challenges which were encountered are explained, and future directions for continuing in this field of research are given.

In this study, only the time period was investigated where approximately 80% of the data were added. Therefore, potential influences and impacts outside the observation periods have not been considered.
For the computation of the contributor activity, all contribution types (i.e., creation, deletion, tag changes and geometry changes) were included to get a general understanding of the number of active users per timestamp. As a consequence, contributors who were deleting data were weighted as much as users who were creating or updating elements.
The changepoint detection was used to compute the most significant changepoint in the observation period. For most of the imports, this approach provided useful results which showed the distinct changes of the contributor activity during or after the import. However, for imports such as the 3dShapes import or the BAG import, the development of the user activity was more complex. During the 3dShapes import, the contributor activity was increasing with the start of the import, and dropped down after most of the data were imported. Then, the contributor activity increased again. A similar pattern could be seen during the BAG import, where the contributor activity increased throughout the conduction of the import but decreased afterwards. By computing only one changepoint, these processes were simplified. Additionally, other OSM events could have influenced the changepoint detection. However, this problem is omnipresent when working with OSM data.
The algorithm for finding peaks in the development of contribution types used a multiple of the mean as the detection criterion. Therefore, other OSM events which happened in the same time period might have been detected as a peak, even though they were maybe not related to the import. Moreover, also peaks that happened before an import but within the observation period were counted. Additionally, more analysis is needed to investigate the peaks in more detail to ensure that the peaks are directly related to the import.
For the import, dedicated user accounts have to be used for importing data into OSM. These accounts were included in the results. Moreover, the differences in the total number of contributors who were involved in the imports (e.g., AND import in India with 18 contributors; AND import in the Netherlands with 108 contributors) have to be considered when evaluating and comparing results.
Furthermore, this study did not distinguish between different types of large imports (e.g., automatically or manually conducted imports, or a combination of both). The quantity of imported data was the only criterion for the selection.
This study presented a first approach for getting a deeper understanding about the impact of large data imports in OpenStreetMap by investigating large data imports in the Netherlands and India.
The results were manifold. It was found that for most of the large data imports which were analysed, the contributor activity increased during or after the conduction of the import. Looking at the imports in the early stage of OSM, especially the AND imports in 2007 and 2008, one can see that significantly more contributors were active than before. Imports which happened at a later stage did not show such a strong impact. During the BAG import in the Netherlands and the building import in India, the number of contributors increased. However, after most of the data were imported, the contributor activity slightly decreased again. Nonetheless, one can see that during a large data import the number of unique active contributors rose.
The analysis of the contributor engagement pointed out that the majority of users who were involved in an import were import-inspired; i.e., their first contributions happened during an import. Again, this finding supports the argument that with large data imports, more contributors were actively joining the project. However, mappers who were active beforehand were more likely to keep contributing in the time after the import was concluded. Therefore the study showed that already activate mappers were not driven away from the project. This study did not differentiate between dedicated user accounts which were created only for importing data and regular user accounts which need to be considered when reasoning about the findings.

Regarding the contribution patterns and the development of tag keys, no specific impact of large data imports was found in this study. The number of unique tag keys increased as the number of elements increased, given that external information was mapped to OSM tags. More research is needed to understand how the community is changing OSM data after a large data import.

The study considered the impact of large data imports from a data perspective on a small subset of imports that were conducted. For future research, the analysis of different data imports might also incorporate other aspects of OSM—for example, community events or mapping events and how they are related to imports. The investigation of automated processes, e.g., scripts or bots, could lead to better understanding about how large chunks of imported data are changed. Moreover, the phase of OSM in which an import is conducted could be analysed more thoroughly. This might help to understand if an import could be performed to also support the establishment or growth of a community in a specific region. Additionally, in that regard, the effect of the media or the OSM community creating awareness about data donations and respective data imports needs to be investigated. Additionally, the analysis of OSM contributors could be extended, for example, by considering the locations of contributors who are involved in an import process. Emerging spatial patterns could help to understand how local communities are developing during an import. The attributes of imported elements and how they are evolving over time could be analysed with a focus on the semantics of the data.

Witt, R.; Loos, L.; Zipf, A. (2021): Analysing the Impact of Large Data Imports in OpenStreetMap. ISPRS Int. J. Geo-Inf. 2021, 10, 528.

Selected earlier & related work: