Projects - TobWeb

How accurate are weather forecasts actually and what weather model performs best for different places in Switzerland?

To answer these questions, Leo Posva and I developed a streaming data analysis pipeline relying on Apache Spark and Hadoop. The pipeline allows us to visualize the accuracy of recent weather predictions in real-time in a manner that can easily be scaled up for Big Data. The project was completed as part of the fantastic Data Information Systems master lecture.

Pipeline Architecture

The developed data processing pipeline architecture is shown in the cover image. It was deployed on multiple AWS EC2 instances.

Data Collection

Data is periodically fetched from the free Open-Meteo API. The processed forecast data

comes from 4 different weather models,
for 10 locations across Switzerland,
makes hourly predictions for up to 7 days into the future
and includes the variables temperature, relative humidity, surface pressure and total cloud cover.

In addition, we also record the actual measured values for these variables, allowing us to quantify the forecast accuracy.

Data Processing

The collected data is continuously streamed into the Structured Streaming engine of Apache Spark. Spark runs on a YARN cluster on top of 3 Hadoop nodes.

After cleaning and mapping the data to a tidy table, the next step is to group forecasts and measurements by their target time. This was implemented as an instance of a stream-stream join and required watermarking.

Conceptual alignment of forecast and measurement data. — Incoming API responses are appended column-by-column. The highlighted row indicates that further processing is performed row-by-row.

A windowed grouped aggregation on the target time is performed. This computes the average error per day for each combination of location, model, variable and forecast length:

short: forecasts up to 1 day into the future
medium: 1-3 days
long: 3-7 days

Finally, we add a rescaled version of the errors (97-th percentile is mapped to 100) that makes the different weather variables comparable. The results are incrementally written to a PostgreSQL database.

Data Visualization

We used the Python data visualization library seaborn to visualize the results from within a Jupyter Notebook. Additionally, Jupyter Widgets were used to make the plots interactive.

Using Voila behind a nginx web server, the interactive notebook was turned into a standalone application that can be accessed from anywhere over a web browser.

Results

The absolute error over time fore different forecast lengths. — The absolute error for the temperature in Basel for different forecast lengths.

Error distribution for short-term forecasts. — The (rescaled) distribution of errors for short-term forecasts (< 24 hours).

Weather Forecast Streaming Data Analysis