1. Home
  2. Stories
  3. Four popular sources of labelled data for machine learning
Artificial Intelligence

Four popular sources of labelled data for machine learning

Publication date: 15-01-2025, Read time: 5 min

In remote sensing, machine learning models are powerful instruments for classifying land cover and land use, identifying vegetation cover, determining soil types, detecting changes in the landscape, and more.

However, these models need to be fed labelled data as training material for optimum performance. This data type allows machine learning models to learn patterns and relationships between input features and their corresponding labels. Labeled data can be obtained from several sources. This article briefly describes four important examples.

1. In-situ measurements

In-situ measurements are the foundation of remote sensing. The remote sensing community started by collecting measures on the ground, e.g., to establish the diameters of tree stems and assess land cover categories.

These kinds of surveys are extremely important because they provide first-hand insight into the condition of the environment being measured.

Furthermore, they allow researchers to compare remote sensing estimates with actual measurements. Unfortunately, in-situ measurements are incredibly costly and time-consuming.

An example of a project where in-situ measurements are vital is the Land Use/Cover Area frame Survey (LUCAS) database initiated by the European Statistical Office (EUROSTAT).

LUCAS involves harmonised surveys across all EU member states to gather information on land cover and land use. Estimates of the area occupied by different land use or land cover types are computed based on in-situ measurements taken at a total of 1 million sample points throughout the EU. Changes to land use can be identified by repeating the survey every three years.

2. Ancillary data

Ancillary data is additional information that is supplementary or supportive to the main data being analysed or processed. It often provides context, helps interpret the main data, or assists in its processing.

For example, ancillary data can be retrieved from regional or local administrative databases, such as population census data, or freely accessible database applications like OpenStreetMap.

Although these kinds of databases can contain useful data, it's important to remember that there may be a semantic gap between the data provided and remote sensing data. Using multiple sources usually requires thorough checking and harmonising.

To clarify all this with an example, European researchers assigned to train a deep learning model for crop type mapping were having difficulty finding enough labelled crop type samples. Many European countries share databases of farmers' declarations, which are a treasure trove of data. In a farmers' declaration, the farmer reports the crop they're cultivating in their own field.

However, this information should not be taken for granted since farmers are free to report whatever they want. Because of this, the researchers needed to put a lot of effort into optimising their training databases to make them suitable for the deep learning model.

However, the effort paid off; in the end, these precious ancillary data enabled the researchers to retrieve over one million labelled samples unsupervised.

3. Visual annotation

Visual annotation can be seen as the counterpart of in-situ measurements. Instead of someone going into the field, a photo interpretation expert sits behind their pc and tries to make sense of what they're looking at.

But looking at images is not enough. Proper understanding requires lots of ancillary data, especially when certain land cover types need to be annotated.

Another thing about visual annotation, as opposed to field measurements, is that it's easier to make mistakes simply because the annotator is not actually "in situ." These mistakes result in faulty labelling, which in turn causes noise in the datasets. To reduce this noise, other experts will need to take a second look. This makes visual annotation, however valuable, as painstaking and time-consuming as in-situ measurements.

4. Outdated maps

Outdated maps can also be a prime source of labelled data. Maps are typically aggregated at the polygon level and are never 100% reliable. By definition, there is a semantic gap between a map and the corresponding remote sensing data. Still, outdated maps are a great way to collect many samples at a large scale unsupervised way, if only because they are free.

To give an example of how outdated maps can supply labelled data, researchers updated the CORINE land cover map of Italy. This map used to have a raster spatial resolution of 100 metres, which the research team aimed to increase to 10 metres using Sentinel 2 satellite data.

An important aim was to minimise semantical and spatial aggregation. The LUCAS database mentioned earlier proved to be very helpful in validating the 26,000 samples retrieved from the outdated map, and the project was a success. Details that were not large enough to be included in the original map, such as rivers, were duly represented in the updated map.

To learn more, see our other Geoversity article, The Remote Sensing Data Labelling Challenge.

Tags
Artificial Intelligence Big Geodata Remote Sensing
Last edited: 15-01-2025

Related articles

Personalize your experience

Create a free account to save your favorite articles, follow important topics, sign up for newsletters and more!