SUM: A Benchmark Dataset of Semantic Urban Meshes

Figure 1: Overview of the semantic urban mesh benchmark. Left: the textured meshes. Right: the ground truth meshes.


We introduce a new benchmark dataset of semantic urban meshes which covers about 4 km2 in Helsinki (Finland), with six classes: Ground, Vegetation, Building, Water, Vehicle, and Boat.

We have used Helsinki 3D textured meshes as input and annotated them as a benchmark dataset of semantic urban meshes. The Helsinki's raw dataset covers about 12 km2 and was generated in 2017 from oblique aerial images that have about a 7.5 cm ground sampling distance (GSD) using an off-the-shelf commercial software namely ContextCapture.

The entire region of Helsinki is split into tiles, and each of them covers about 250 m2. As shown in the figures below, we have selected the central region of Helsinki as the study area, which includes 64 tiles.

Generic placeholder image

Figure 2: Selected area of Helsinki

Generic placeholder image

Figure 3: The input textured mesh (part of our dataset).

Generic placeholder image

Figure 4: The labelled mesh (part of our dataset).

Data Download

The mesh data can be visualized in MeshLab and our 3D Annotator. We also provide the sampled point clouds with semantics, colours and corresponding face ids in two sampling density (refer to surface area): 30 pts/m2 and 300 pts/m2. In addition, we only provide the data in PLY format, and the semantic classes and colours are defined as follows:

Labels Semantics RGB
-1 unclassified 0 , 0 , 0
1 ground 170, 85 , 0
2 vegetation 0 , 255, 0
3 building 255, 255, 0
4 water 0 , 255, 255
5 car 255, 0 , 255
6 boat 0 , 0 , 153

Download link: SUM Helsinki 3D

Labelling Workflow

Rather than manually labelling each individual triangle face of the raw meshes, we design a semi-automatic mesh labelling framework to accelerate the labelling process. Firstly, we over-segment the input meshes into a set of planar segments. To acquire the first ground truth data, we manually annotate the mesh (with segments) that is selected with the highest feature diversity. Then, we add the first labelled mesh into the training dataset for the supervised classification. Specifically, we use the segment-based features as input for the classifier, and the output is a pre-labelled mesh dataset. We then use the mesh annotation tool to manually refine the pre-labelled mesh according to feature diversity. In addition, the new refined mesh will be added to the training dataset in order to improve the automatic classification accuracy incrementally.

Generic placeholder image

Figure 5: The pipeline of the labelling workflow

Following the proposed framework, a total of 19,080,325 triangle faces have been labelled, which took around 400 working hours. Compared with a triangle-based manual approach, we estimate that our framework saved us more than 600 hours of manual labour.

Comparison of Existing 3D Urban Benchmark Datasets

Urban datasets can be captured with different sensors and be reconstructed with different methods, and the resulting datasets will have different properties. The input of the semantic labelling process can be raw or pre-processed urban datasets such as the pre-labelled results of semantic segmentation. Regardless of the input data, it still needs to be manually checked and annotated with a labelling tool, which involves selecting a correct semantic label from a predefined list for each triangle (or point, depending on the dataset) by users. In addition, some interactive approaches can make the labelling process semi-manual. However, unlike our proposed approach, the labelling work of most of 3D benchmark data does not take full advantage of pre-processing steps like over-segmentation and semantic segmentation on 3D data, and interactive annotation in the 3D space.

Generic placeholder image

Table 1: Comparison of existing 3D urban benchmark datasets.

Data Split

To perform the semantic segmentation task, we randomly select 40 tiles from the annotated 64 tiles of Helsinki as training data, 12 tiles as test data, and 12 tiles as validation data. For each of the six semantic categories, we compute the total area in the training and test dataset to show the class distribution.

Generic placeholder image

Figure 6: Overview of the data used in our experiment. Left: The distribution of the training, test, and validation dataset. Right: Semantic categories of training (including validation data) and test dataset.


We sample the mesh into coloured point clouds with a density of about 30 pts/m2 as input for the competing deep learning methods. To evaluate and compare with the current state-of-the-art 3D deep learning methods that can be applied to large-scale urban dataset, we select five representative approaches (PointNet, PointNet++, SPG, KPConv, and RandLA-Net), and we perform the experiment on an NVIDIA GEFORCE GTX 1080Ti GPU. In addition, we also compare with the joint RF-MRF, which is the only method that directly takes the mesh as input and without using GPU for computation. The hyper-parameters of all the competing methods are tuned to achieve the best results we could acquire. We achieve about 93.0% overall accuracy and 66.2% mIoU. Specifically, our approach outperforms RF-MRF with a margin of about 5.3% mIoU, and other deep learning methods from 0.7% to 28.1% mIoU.

Generic placeholder image

Table 2: Comparison of various semantic segmentation methods on the new benchmark dataset. The results reported in this table are per-class IoU (%), mean IoU (mIoU, %), Overall Accuracy (OA, %), mean class Accuracy (mAcc, %), mean F1 score (mF1, %), and the running times for training and testing (minutes). The running times of RF-MRF and the baseline (ours) methods also include feature computation.

We also evaluated the performance of semantic segmentation with different amounts of input training data on our baseline approach and KPConv with the intention of understanding the required amount of data to obtain decent results. We found that we only need about 7% of the training dataset (which covers about 0.23 km2) to achieve the acceptable and robust results compared to 33% (which covers about 1.0km2) for KPConv.

Generic placeholder image

Figure 7: Effect of the amount of input training data on the performance of our baseline method and KPConv.

Video Demo


If you use SUM-Helsinki in a scientific work, we kindly ask you to cite it:

SUM: A Benchmark Dataset of Semantic Urban Meshes . Weixiao Gao, Liangliang Nan, Bas Boom and Hugo Ledoux arXiv preprint arXiv:2103.00355. 2021
author = {Weixiao Gao and Liangliang Nan and Bas Boom and Hugo Ledoux},
title = {SUM: A Benchmark Dataset of Semantic Urban Meshes},


EuroSDR logo
CMT logo

This project has received funding from EuroSDR and support from CycloMedia.


Weixiao  Gao photo

Weixiao Gao
PhD Candidate


Hugo  Ledoux photo

Hugo Ledoux

| | |

Liangliang  Nan photo

Liangliang Nan


Jantien  Stoter photo

Jantien Stoter

| | |

Student Helper

Ziqian Ni, assists in software development, from 2019-07 to 2019-09.
Mels Smit, assists in mesh annotation, from 2020-07 to 2020-09.
Charalampos Chatzidiakos, assists in mesh annotation, from 2020-07 to 2020-09.