How big is OSM? Estimating the size of OpenStreetMap’s history

In this blog post we report a small study on the size of the OpenStreetMap history data set, which we carried out in the context of developing the OSHDB. We estimated the size of OSM’s history in terms of raw data as it would be represented in a systems programming language such as C or Rust. We were interested in the raw amount of space the data would occupy, i.e., without the overhead of general purpose data formats like XML and without employing data compression that would lead to a computational overhead.

This size estimate gives us

  1. an estimation of the hardware resources required to store and process the full OSM history;
  2. a base reference to which we can compare our efforts of representing the data more compactly.

Overall Methodology and Results

We first defined systems-level data structures for the OSM data objects changeset, member, nd (node reference), node, relation, tag and way. We calculated the size of these data structures and multiplied them with the counts of corresponding XML elements in the OSM history dump. The results are presented in the following table.

Table 1: Number of XML elements, size of data structures and total data size (reported on the full history planet XML file downloaded on 2018-08-08).

XML element name data structure
size (bytes)
XML element count OSM history
size (bytes)
changeset 65 61,149,797 3,974,736,805
member 13 2,127,505,107 27,657,566,391
nd 8 14,537,932,143 116,303,457,144
node 49 7,024,909,604 400,419,847,428
relation 32 24,631,930 1,379,388,080
tag 16 4,861,971,375 77,791,542,000
way 32 957,849,585 47,892,479,250
total size 675,419,017,098
variant 603,352,497,027
compressed XML 111,691,457,431
uncompressed XML 2,001,921,015,360

In Table 1, one immediately sees that the nodes consume more space than the rest of the data, with the way’s node references coming second. Overall, OSM’s history occupies between 600 GB and 700 GB, which is roughly 6 times the size of the compressed XML (or 10 times the .pbf) and 1/3 of the uncompressed XML. This estimate does not include the concrete string representations of the tags which we represent by unsigned integer IDs. To our experience, these strings take less than 1% of the total size of the data and, hence, are not a major factor of our estimate.

The raw data structures have been designed to closely represent the XML data structure. That is, there is still some room for space optimizations. We could leave out the redundant user id stored at each node, way and relation because it is also stored in the corresponding changeset. Further, we could store the the visibility-bit in the highest bit of the version number, instead of representing it as an otherwise unused byte. With these two simple optimizations we could save 9 bytes per node, way and relation, which results in the total size reported as “variant”. With the current number of tag keys and tag values, we could also halve the size of the tag data structure, which would save another 38GB.

Concerning question (1), the hardware resources, this result is quite promising. As the above numbers report only the raw data size, we would have to add some indices to make effective use of the data. However, assuming the indices to be smaller than the data itself, the data would still fit into less than 1TB. In the era of TB SSD hard drives this is not too big.

Technical Details

For the technically interested reader, the remainder of this blog post discusses the detailed layout of the data structures and how we computed the above numbers. Recall that our goal is to estimate the size of OSM’s history if we would store it in plain old data structures. This shall serve us as a baseline for evaluating our effort in finding more compact representations.

Data Model

We describe the data structures with a notation roughly borrowed from Rust in order to easily compute the size of the data structures. That is, we use types like u64 denoting an 64-bit unsigned integer or i32 denoting a 32-bit signed integer, which consume 8 and 4 bytes respectively. A type in square brackets denotes an array of objects of the given type. The data structures closely follow OSM’s XML format, so there is not much to explain here. The comment in the first line of each structure reports the total size of the data structure not including the arrays because the latter will be counted separately.

changeset { // 65 bytes
  id: u64,
  creation_time: time64,
  closing_time: time64,
  open: bool,
  user_id: u64,
  min_lon: i32,
  min_lat: i32,
  max_lon: i32,
  max_lat: i32,
  num_changes: u32, // with current limit of 10000, u16 would suffice here
  comments_count: u32,
  tags_count: u64,
  tags: [tag],
}

node { // 57 bytes
  id: u64,
  lat: i32,
  lon: i32,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  tags_count: u64,
  tags: [tag],
}

member { // 13 bytes
  type: u8,
  reference: u64,
  role: u32,
}

relation { // 56 bytes
  id: u64,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  members_count: u64,
  members: [member],
  tags_count: u64,
  tags: [tag],
}

tag { // 16 bytes (maybe 2x u32 = 8 bytes are sufficient here)
  key: u64,
  value: u64,
}

way { // 42 bytes
  id: u64,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  nodes_count: u16,
  nodes: [u64],     // node ids
  tags_count: u64,
  tags: [tag],
}

Computing the Estimates

The next step was to compute the size of the full history planet XML file. The size of the compressed XML file was easily read from the file system. Because the uncompressed XML file would not fit on the hard disk of the employed machine, we went by running lbzcat history-latest.osm.bz2 | wc -c for computing the size of the uncompressed XML.

The next step was to process the XML file in order to count the numbers of occurrences for each of the OSM data types. To this end, we implemented a small program that counts the number of occurrences of XML tags in the file. Assuming the file is well-formed, the sum of start tags and empty-element tags corresponds to the number of XML elements. We used lbzcat to uncompress the packed XML file in parallel and piped the result into our program. In order to validate the results and to gain some experience of how suitable different programming approaches are for our purposes, we implemented this program in three different programming languages: Python, Haskell and Rust.

The Python and Haskell implementations exploit the fact that OSM’s XML dump stores each XML tag on a separate line. So counting the occurrence numbers was easily done by inspecting the first word on each line. The Rust implementation employs the full fledged XML pull parser quick_xml as we wanted to be sure that the previous exploit was correct.

Performance Considerations

We computed the above numbers on an Intel® Core™ i7-7500U CPU @ 2.70GHz × 4. The Python and Haskell implementations took about 10 hours to process the data, the Rust implementation only 4 hours. This is particularly remarkable because the Rust implementation solves a more complex task of processing the full XML.

We also have to consider that the bz2-compression causes a significant overhead. For instance, the Rust implementation was running at 60% CPU while lbzcat was eating up the remaining 340% CPU of the 400% CPU provided by the four cores. Limiting lbzcat to three cores made our implementation use even less CPU, which indicates that the decompression is the limiting factor here. In fact, only decompressing the file by running lbzcat history-latest.osm.bz2 > /dev/null took 3 hours. Therefore, we measured the performance of our implementation on a smaller region that we were able to uncompress in advance. Processing directly the uncompressed XML showed a performance improvement by 30-40% compared to uncompressing the file on the fly. This roughly matches the fact that our program used 100% CPU with a priori decompression, which would take our 4 hours down to roughly 2.5 hours. When exploiting that we have a fixed set of possible XML tags, we can speedup our implementation by factor 2 on the small data set, which would take down the processing time to 1.25 hours for the global history dump.

Of course, XML is a less than optimal data format for such a task. Our main reason for working with the XML dump was the ease of implementing the processing. Both the Python and the Haskell program have less than 10 lines of code and were implemented in a few minutes each. With about 20 lines of code, the Rust implementation is slightly more complex as it really processes the XML. The obvious alternative is the full history planet PBF file, which is designed to be more friendly to the machine but involves considerably more work for the programmer. For comparison, we implemented the same counting process for the PBF dump in about 50 lines of C++ using libosmium. The computation took 0.5 hours with the process running at 200% CPU. Apparently, reading the data is still a bottleneck that prevents the process from running at 400% CPU.

The major factors we identified for the performance benefit of the C++ implementation are:

  • The PBF dump is easier on disk IO: 68GB vs 2TB is about 1/30 of the data transfer;
  • The PBF dump is easier on processing: estimated 700GB of structured data vs 2TB of unstructured data is about 1/3 of the data amount and reduces parsing and structuring the data;
  • The PBF dump is easier on parallelization due to the data being sharded.

Considering these factors, the performance of the Rust implementation is not bad.


Posted

in

by