1

Problem statement

My goal is to filter the observations table and find all points that intersect with the grid cells of my area of interest. I'm using the dbplyr-package in R so that I can write and send queries to a PostGIS database using R-code. This works for the most part, but I'm having troubles with spatial filtering. In my case, I want to filter points/centroids that intersect with my area of interest.

Normally I would use the sf-package for this, but it seems you can't use sf-functions to directly query a PostGIS database. I checked other posts, but they only use SQL-statements directly and those solutions don't work with a lazy-table setup as I have. About lazy table/evaluation: The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible: It never pulls data into R unless you explicitly ask for it. Additionally, tt delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step (source: https://smithjd.github.io/sql-pet/chapter-lazy-evaluation-queries.html)

Code examples

I'm aware that the code below is not reproducible, but hopefully it still provides enough insight into what I'm trying to do. In my case, I first had to create two different connections to finally get the grid cells I'm interested in:

## Connection "con_va" with names of the area of interest and get x-y
    all_grids <-
          tbl(con_va, "all_grids") |> 
          filter(
            area_name == "my_area"
          ) |> 
          distinct(x, y)
  
## Connection 2 'con_new' with postgis-geometry###
    area_of_interest <-
      tbl(con_new, dbplyr::in_schema("geodata", "gridceells")) |> 
      inner_join(
        y = all_grids,
        join_by(am_x == x, am_y == y),
        copy = TRUE
      ) |> 
      select(
        the_geom
      )

The second connection (con_new) also contains the table with all the observations and I want to filter those observations that fall within my area of interest. The observations-table have a column 'epsg28992_centroid' which is of type 'pq_gmtry'.

all_obs <- 
  tbl(con_new, dbplyr::in_schema("observation", "observations")) |> 
  select(
    subject_txn_id, osn_id, epsg28992_centroid
  )

As far as I know, only SQL-statements are currently available for spatial filtering PostGIS-data. But because 'area_of_interest' is a lazy-table, this can't be directly used in an SQL-statement (as far as I'm aware). I have tried the code below, but obviously that doesn't work:

all_obs |>
    inner_join(
        y = area_of_interest,
        join_by(sql("ST_Intersects(area_of_interest.the_geom, epsg28992_centroid)"))
    )

## Different approach; similar results
all_obs |>
 filter(
        sql("ST_Intersects(area_of_interest.the_geom, epsg28992_centroid)")
    )

Question

My goal is to filter the observations table and find all points that intersect with the grid cells of my area of interest. What is the appropiate way to write these types of queries for spatial filtering from R? Can I still use the lazy-table I have with the area of interest? The observation-table is massive so not something I want (or even can) read into memory. The locations-tables are much smaller something I could potentially pull into memory if needed, but I doubt that's the best way to go about it.

4
  • There's an example in ?st_read that shows reading data from a Postgres database with an SQL query to subset. What's a "lazy-table"? Some dbplyr specific thing? Commented Feb 13 at 11:41
  • Also, could you explain the query you are trying to do, eg "I want to select all points in foo that intersect polygons in bar" because its not that clear from your text or code. Commented Feb 13 at 11:44
  • I added a little explanation on lazy-tables and tried to clarify what I'm trying to achieve. It's indeed finding all points that intersect with my area of interest. Commented Feb 13 at 14:01
  • Why do you have two connections? Are there two different databases (in the PostGIS sense of database) here? Normally you'd have one connection. If you can figure out what SQL the lazy tables would run on evaluation then you could put that in an st_read query parameter, but I'm not sure if that would work across databases. Or even across schemas. Commented Feb 13 at 15:13

1 Answer 1

1

Lazy evaluation shouldn't be a problem, it just delay where the error will occur. You can check each step by forcing the execution, typically run a nrow to count the number of rows and it will force the execution of the underlying steps, and will eventually show if there is an error.

For exemple you can do a area_of_interest |> nrow() to see if everything is ok with the copy between databases. For information, to build area_of_interest dbplyr technically download the first table from the first database on your machine and copy it to the second one in a temporary table. You can do it yourself it once for all instead of of copy each time by using copy_to (I don't think you have to 'collect' before, and maybe dbplyr is smart enough to download and upload chunk by chunk if the data is big...). Eventually you can create a definitive table there from this temporary table. Otherwise, you have to accept that there will be a delay each time for the download and upload of the data.

Additionnaly, you can note that you can connect the first database to the second directly between themself using a Foreign Data Wrapper. It appears like a normal table in postgres, but it technically is remote, so you can filter, join, etc... with it directly in postgres, and postgres is the one handling the eventual download of data. Typically in your case you could create a FDW on the second database that links to the first one, because the second one is the one with the bigger tables.

For the join, it should work with dbplyr between 2 tables. It's just a little different that what you did. 'area_of_interest' is the local name of the table, so it doesn't exist in SQL that execute on your database. Fortunatly, 'join' in dbplyr automatically attribute alias (RHS and LHS for right and left sides) on your tables that you can use in SQL, and if you want to directly use SQL you need to use the sql_on argument:

all_obs |>
    dbplyr::inner_join(
        y = area_of_interest,
        sql_on = "ST_Intersects(RHS.the_geom, LHS.epsg28992_centroid)"
    )
2
  • Your answer helped me figure out the solution. For some reason, the suggested code did not work but it did work without specifying 'RHS' and 'LHS'. The key was the parameter 'sql_on' in 'inner_join' Commented Feb 14 at 10:42
  • Good for you but weird. It should work with the alias, but because you don't have conflictuous names you effectively don't need them. Maybe I inverted the 2 tables ? To debug more easily you can look at the generated SQL request with 'show_query()' Commented Feb 14 at 13:20

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.