Brahma's Personal Blog

Writing about Data Engineering problems and SQL

View the Project on GitHub brahma19/blog

21 September 2024

Using BigQuery Geospatial functions to find nearest location

by Brahmanand Singh

Finding Nearest location based on lat/long fields in BigQuery

Inner and outer joins are common use cases in SQL but one in a while we get a use case where we need to use cross join, one such use case is to find the nearest location from a given lat/long position.

The Problem

We have store location data and need to find correct weather conditions to predict the logistics delays for supply chain analytics. What we have is the source location (lat and long) but we do not have reliable weather readings. To solve this problem we are going to use the NOAA weather data that is published by the National Oceanic and Atmospheric Administration (NOAA) government department; it contains historical information as well . Now the problem is these 2 tables do not have any keys to join the data, only possible link is to use the location fields and find out which weather station is the nearest to this store and then enrich the data for feature engineering pipeline. To summarise We have two tables:

  1. A source table with locations stored as string (latitude and longitude)
  2. Reference Table with target location (also with latitude and longitude) , let’s call this a “weather station” table.

Our goal is to find the nearest weather station for each location in the source table.

Approach 1: Cross Join and Distance Calculation

The simplest approach is to perform a cross join between our input table having the source location and referece table having weather stations records, calculate the distance for each pair, and then select the nearest station for each source location.

Here’s the query:

WITH source_data AS (
  SELECT 
    id,
    CAST(latitude AS FLOAT64) AS source_lat,
    CAST(longitude AS FLOAT64) AS source_lon
  FROM `projectid.datasetname.source_table`
),

weather_stations AS (
  SELECT 
    station_id,
    CAST(latitude AS FLOAT64) AS station_lat,
    CAST(longitude AS FLOAT64) AS station_lon
  FROM `projectid.datasetname.weather_stations_table`
),

distance_calc AS (
  SELECT 
    s.id AS source_id,
    w.station_id,
    ST_DISTANCE(ST_GEOGPOINT(s.source_lon, s.source_lat), ST_GEOGPOINT(w.station_lon, w.station_lat)) AS distance
  FROM source_data sd
  CROSS JOIN weather_stations ws
),

ranked_stations AS (
  SELECT 
    source_id,
    station_id,
    distance,
    ROW_NUMBER() OVER (PARTITION BY source_id ORDER BY distance) AS rank
  FROM distance_calc
)

SELECT 
  source_id,
  station_id AS nearest_station_id,
  distance AS distance_to_nearest_station
FROM ranked_stations
WHERE rank = 1

This approach works well for small to medium-sized tables. But, as the number of input records or refereces grows, the cross join can become computationally expensive.

Approach 2: Optimized Query with Bounding Box

To optimize our query, we can use BigQuery’s geospatial functions to create a bounding box around each source point and only consider weather stations within that box. This significantly reduces the number of distance calculations needed.

Here’s the optimized query:

WITH source_data AS (
  SELECT 
    id,
    ST_GEOGPOINT(CAST(longitude AS FLOAT64), CAST(latitude AS FLOAT64)) AS source_point
  FROM `projectid.datasetname.source_table`
),

weather_stations AS (
  SELECT 
    station_id,
    ST_GEOGPOINT(CAST(longitude AS FLOAT64), CAST(latitude AS FLOAT64)) AS station_point
  FROM `projectid.datasetname.weather_stations_table`
),

nearest_station AS (
  SELECT 
    s.id AS source_id,
    ARRAY_AGG(
      STRUCT(w.station_id, 
             ST_DISTANCE(s.source_point, w.station_point) AS distance)
      ORDER BY ST_DISTANCE(s.source_point, w.station_point)
    )[OFFSET(0)] AS nearest
  FROM source_data sd
  JOIN weather_stations ws
  ON ST_DWithin(s.source_point, w.station_point, 100000)  -- 100km radius
  GROUP BY s.id
)

SELECT 
  source_id,
  nearest.station_id AS nearest_station_id,
  nearest.distance AS distance_to_nearest_station
FROM nearest_station

Key Optimizations

  1. Geospatial Functions: We use ST_GEOGPOINT to create geography points directly, which simplifies our distance calculations.

  2. Bounding Box: The ST_DWithin function creates a bounding box around each source point, considering only weather stations within 100km. This drastically reduces the number of distance calculations.

  3. Efficient Aggregation: We use ARRAY_AGG with ORDER BY and fetching the 1st record to efficiently select the nearest station for each source point in a single pass.

Performance Comparison

The optimized query will generally perform much better, especially for larger datasetnames. Here’s why:

Conclusion

When working with geospatial data in BigQuery, it’s crucial to leverage the platform’s specialized functions. While a cross join approach is intuitive and works for smaller datasetnames, using functions like ST_DWithin and ST_DISTANCE can significantly improve performance for larger datasetnames.

Remember to adjust the radius in the ST_DWithin function based on your specific use case and data distribution. A larger radius ensures you don’t miss any potential nearest stations but may increase computation time.

By optimizing your geospatial queries, you can achieve faster results and more efficient use of your BigQuery resources.

tags: BigQuery - BQ - Latitude - Longitude - geospatial