Writing about Data Engineering problems and SQL
by Brahmanand Singh
Inner and outer joins are common use cases in SQL but one in a while we get a use case where we need to use cross join, one such use case is to find the nearest location from a given lat/long position.
We have store location data and need to find correct weather conditions to predict the logistics delays for supply chain analytics. What we have is the source location (lat and long) but we do not have reliable weather readings. To solve this problem we are going to use the NOAA weather data that is published by the National Oceanic and Atmospheric Administration (NOAA) government department; it contains historical information as well . Now the problem is these 2 tables do not have any keys to join the data, only possible link is to use the location fields and find out which weather station is the nearest to this store and then enrich the data for feature engineering pipeline. To summarise We have two tables:
Our goal is to find the nearest weather station for each location in the source table.
The simplest approach is to perform a cross join between our input table having the source location and referece table having weather stations records, calculate the distance for each pair, and then select the nearest station for each source location.
Here’s the query:
WITH source_data AS (
SELECT
id,
CAST(latitude AS FLOAT64) AS source_lat,
CAST(longitude AS FLOAT64) AS source_lon
FROM `projectid.datasetname.source_table`
),
weather_stations AS (
SELECT
station_id,
CAST(latitude AS FLOAT64) AS station_lat,
CAST(longitude AS FLOAT64) AS station_lon
FROM `projectid.datasetname.weather_stations_table`
),
distance_calc AS (
SELECT
s.id AS source_id,
w.station_id,
ST_DISTANCE(ST_GEOGPOINT(s.source_lon, s.source_lat), ST_GEOGPOINT(w.station_lon, w.station_lat)) AS distance
FROM source_data sd
CROSS JOIN weather_stations ws
),
ranked_stations AS (
SELECT
source_id,
station_id,
distance,
ROW_NUMBER() OVER (PARTITION BY source_id ORDER BY distance) AS rank
FROM distance_calc
)
SELECT
source_id,
station_id AS nearest_station_id,
distance AS distance_to_nearest_station
FROM ranked_stations
WHERE rank = 1
This approach works well for small to medium-sized tables. But, as the number of input records or refereces grows, the cross join can become computationally expensive.
To optimize our query, we can use BigQuery’s geospatial functions to create a bounding box around each source point and only consider weather stations within that box. This significantly reduces the number of distance calculations needed.
Here’s the optimized query:
WITH source_data AS (
SELECT
id,
ST_GEOGPOINT(CAST(longitude AS FLOAT64), CAST(latitude AS FLOAT64)) AS source_point
FROM `projectid.datasetname.source_table`
),
weather_stations AS (
SELECT
station_id,
ST_GEOGPOINT(CAST(longitude AS FLOAT64), CAST(latitude AS FLOAT64)) AS station_point
FROM `projectid.datasetname.weather_stations_table`
),
nearest_station AS (
SELECT
s.id AS source_id,
ARRAY_AGG(
STRUCT(w.station_id,
ST_DISTANCE(s.source_point, w.station_point) AS distance)
ORDER BY ST_DISTANCE(s.source_point, w.station_point)
)[OFFSET(0)] AS nearest
FROM source_data sd
JOIN weather_stations ws
ON ST_DWithin(s.source_point, w.station_point, 100000) -- 100km radius
GROUP BY s.id
)
SELECT
source_id,
nearest.station_id AS nearest_station_id,
nearest.distance AS distance_to_nearest_station
FROM nearest_station
Geospatial Functions: We use ST_GEOGPOINT
to create geography points directly, which simplifies our distance calculations.
Bounding Box: The ST_DWithin
function creates a bounding box around each source point, considering only weather stations within 100km. This drastically reduces the number of distance calculations.
Efficient Aggregation: We use ARRAY_AGG
with ORDER BY
and fetching the 1st record to efficiently select the nearest station for each source point in a single pass.
The optimized query will generally perform much better, especially for larger datasetnames. Here’s why:
When working with geospatial data in BigQuery, it’s crucial to leverage the platform’s specialized functions. While a cross join approach is intuitive and works for smaller datasetnames, using functions like ST_DWithin
and ST_DISTANCE
can significantly improve performance for larger datasetnames.
Remember to adjust the radius in the ST_DWithin
function based on your specific use case and data distribution. A larger radius ensures you don’t miss any potential nearest stations but may increase computation time.
By optimizing your geospatial queries, you can achieve faster results and more efficient use of your BigQuery resources.
tags: BigQuery - BQ - Latitude - Longitude - geospatial