IDEAL-VGI: Analyzing and Improving the Quality and Fitness for Purpose of OpenStreetMap as Labels in Remote Sensing Applications

We are happy to announce that the IDEAL-VGI project by GIScience has been successfully completed. IDEAL-VGI was a tandem project in cooperation with Begüm Demir from the TU Berlin and was conducted under the umbrella of the VGIscience Second Phase Projects which ran from 2020 to 2022. VGIscience received funding as a Priority Programme by the German Research Foundation (Deutsche Forschungsgemeinschaft – DFG). The VGIscience SPP was completed with the release of the open access book “Volunteered Geographic Information. Interpretation, Visualization and Social Context” on December 12, 2023, to which IDEAL-VGI has contributed a chapter on “Analyzing and Improving the Quality and Fitness for Purpose of OpenStreetMap as Labels in Remote Sensing Applications”.

The team behind IDEAL-VGI assessed the quality and fitness of purpose of OpenStreetMap (OSM) as labels for the remote sensing (RS) domain. Volunteered Geographic Information (VGI), such as OSM, is a productive method to enrich geographic databases through the work of volunteers. User Generated Content (UGC), such as VGI, promises to be a large source for data labels, which are needed in ever growing amounts. Especially in the area of machine learning, the quality of the data labels is of high importance. Currently, the majority of state-of-the-art land-use and land-cover classification models take the input labels given for training at face value. In consequence, the lack of authoritative involvement in the data collection process can lead to doubts among the RS community regarding the data quality. It is assumed that noisy labels such as redundant, incomplete, heterogenous or incorrect data will lead to bad model performance and wrong model output.

Two approaches were used to evaluate OSM land-use and land-cover information: (1) assessment of OSM fitness for purpose for samples in relation to intrinsic and semi-intrinsic data quality indicators at the scale of individual OSM objects and (2) assessment of OSM-derived multi-labels at the scale of remote sensing patches (1.22 x 1.22 km) in combination with deep learning approaches.

The first approach has shown that the high variability within indicator distributions is the main difficulty in data-quality-prediction of individual land-use objects. The intrinsic indicators (such as mapper-experience, number of mappers per region, and the number of edits made to an object) were derived by a newly developed tool called OSM Element Vectorisation which is based on the OpenStreetMap History DataBase (OSHDB). The authors could show, for example, that the element size is a strong quality indicator yet it has to be interpreted in combination with other indicators. For example, meadows exceeding a certain size in central Europe had a strong lean to be of lower quality, meaning that a large amount of the covered area was in fact not meadow.

Feature importance in relation to data quality. The importance was derived based on a quantile random forest for 1000 randomly selected OSM objects. The features are sorted by the percentage increase in squared mean error if the feature would be dropped. In addition, node purity is provided as a second feature importance indicator. To ease interpretation, the second indicator is displayed together with its position relative to the median node purity value across all selected features.

The second approach employed a label-noise robust deep learning method on remote sensing data with OSM labels developed by the TU Berlin. First it could be proven that OSM was indeed capable of creating high quality multi-labels in the study area. Multi-labels are multiple LULC class tags that are assigned to a given area – indicating the presence of a specific class in the area. These labels derived from OSM were then used as training data for a machine learning model. This method was capable of identifying correct multi-labels, even for situations where significant levels of artificial noise were added to the original training data.
In a subsequent step, the method was also used to identify areas where input labels were likely wrong. Thereby, it is possible to provide feedback to the OSM community as areas of concern can be flagged.

For more results from the IDEAL-VGI project visit the projects website or our past blogposts. We especially recommend you the introduction video that was created within the project and that gives a good overview of the topic of UGC quality analyses.
We also encourage you to have a look at the results from the other projects conducted during the SPP here.

The team of the IDEAL-VGI project would like to thank the DFG, the SPP and especially Dirk Burghardt for the great opportunity to have this relevant and interesting project and to be able to create all these results.

Reference:

Schott,M., Zell, A., Lautenbach, S., Sumbul, G., Schultz, M., Zipf, A., Demir, B. (2024). Analyzing and Improving the Quality and Fitness for Purpose of OpenStreetMap as Labels in Remote Sensing Applications. In: Burghardt, D., Demidova, E., Keim, D.A. (eds) Volunteered Geographic Information. Springer, Cham. https://doi.org/10.1007/978-3-031-35374-1_2 , p. 21-42