Sonde Raw Data

The historical maps on this site, such as the heatmaps and calendars are all based on SondeHub data. SondeHub has graciously made their data public. However, it’s not in a very convenient format: an enormous directory of hundreds of thousands of individual JSON files. There’s one file for every sonde ever reported to SondeHub. Each file contains three records: the first, highest and last data points reported to SondeHub for that sonde. It takes many hours to download and parse them all.

I’ve downloaded the data, done some data conversion and cleaning to remove apparently invalid flights, and reuploaded it in the Parquet format. Parquet is a far more efficient format, taking only seconds to download and parse. It is supported by a wide variety of languages and analysis frameworks.

Data

I’ve broken the summary data into files per year for easier downloading:

Example Usage

Python’s popular data science library Pandas is a great way to explore the data. For example, to see how many unique sondes were reported in 2022:

In [1]: import pandas as pd

In [2]: df = pd.read_parquet("sonde-summaries-2022.parquet")

In [3]: df['serial'].nunique()
Out[3]: 140629

Each sonde has 3 rows in the dataset: the first, highest, and last data points reported to SondeHub. The datetime column lets you see how many launches there were broken down by month:

In [4]: launches = df.groupby('serial').first()

In [5]: launches['datetime'].dt.month.value_counts().sort_index()
Out[5]:
datetime
1     10423
2     10550
3     11812
4     10997
5     11658
6     11410
7     11718
8     12479
9     12514
10    12379
11    12561
12    12128
Name: count, dtype: int64

Python Library (optional)

The parquet files above can be used directly with any language or tool that supports parquet. But if you’d like some convenience, the sonde-search repo includes a small Python library that automatically downloads, caches, and concatenates all available years of data for you.

Loading the data

from data.cache import get_sonde_summaries_as_dataframe

# Load all years, all columns
df = get_sonde_summaries_as_dataframe()

The parquet files contain over 50 columns, but most analyses only need a few. Pass a columns parameter to load only what you need — this is much faster and uses far less memory:

# Load only the columns needed for landing analysis
df = get_sonde_summaries_as_dataframe(
    columns=['serial', 'frame', 'lat', 'lon', 'alt', 'datetime']
)

Filtering to real flights

The raw data includes ground-based transmitters, bench tests, and other non-flights. The filter_real_flights function keeps only sondes that reached at least 5,000m altitude and have started descending:

from lib.data_utils import filter_real_flights, get_landing_rows

df = get_sonde_summaries_as_dataframe(
    columns=['serial', 'frame', 'lat', 'lon', 'alt', 'datetime']
)

# Remove ground tests and non-flights
df = filter_real_flights(df)

Getting landing positions

Each sonde has multiple rows. To get just the landing position (the last telemetry point received), use get_landing_rows:

# One row per sonde: the last point received
landings = get_landing_rows(df)

Putting it all together

Here’s a complete example that finds all sondes that landed within 50km of a given location:

from geographiclib.geodesic import Geodesic
from data.cache import get_sonde_summaries_as_dataframe
from lib.data_utils import filter_real_flights, get_landing_rows

df = get_sonde_summaries_as_dataframe(
    columns=['serial', 'frame', 'lat', 'lon', 'alt', 'datetime']
)
df = filter_real_flights(df)
landings = get_landing_rows(df)

# Find landings within 50km of Seattle (47.6, -122.3)
home_lat, home_lon = 47.6, -122.3
landings['dist_km'] = landings.apply(
    lambda r: Geodesic.WGS84.Inverse(home_lat, home_lon, r['lat'], r['lon'])['s12'] / 1000,
    axis=1,
)
nearby = landings[landings['dist_km'] < 50].sort_values('dist_km')
print(f"{len(nearby)} sondes landed within 50km")
print(nearby[['serial', 'datetime', 'lat', 'lon', 'dist_km']].head(10))

For more examples, see the analyzers directory on GitHub, which includes code for generating landing calendars and heatmaps.

Join the Conversation