Tuesday, 20 February 2018

Speeding up spatial analyses by integrating `sf` and `data.table`: a test case

The problem

Last week, I replied to this interesting question posted by @Tim_K over stackoverflow. He was seeking efficient solutions to identify all points falling within a maximum distance of xx meters with respect to each single point in a spatial points dataset.
If you have a look at the thread, you will see that a simple solution based on creating a “buffered” polygon dataset beforehand and then intersecting it with the original points is quite fast for “reasonably sized” datasets, thanks to sf spatial indexing capabilities which reduce the number of the required comparisons to be done (See http://r-spatial.org/r/2017/06/22/spatial-index.html). In practice, something like this:
# create test data: 50000 uniformly distributed points on a "square" of 100000
# metres
maxdist <- 500
pts     <- data.frame(x = runif(50000, 0, 100000),
                      y = runif(50000, 0, 100000),
                      id = 1:50000) %>%
  sf::st_as_sf(coords = c("x", "y"))
# create buffered polygons
pts_buf <- sf::st_buffer(pts, maxdist)
# Find points within 500 meters wrt each point
int <- sf::st_intersects(pts_buf, pts)
int
## Sparse geometry binary predicate list of length 50000, where the predicate was `intersects'
## first 10 elements:
##  1: 1, 11046, 21668, 25417
##  2: 2, 8720, 12595, 23620, 26926, 27169, 39484
##  3: 3, 11782, 20058, 27869, 33151, 47864
##  4: 4, 35665, 45691
##  5: 5, 37093, 37989
##  6: 6, 31487
##  7: 7, 38433, 42009, 45597, 49806
##  8: 8, 12129, 31486
##  9: 9, 27840, 35577, 36797, 40906
##  10: 10, 15482, 16615, 26103, 41417
However, this starts to have problems over really large datasets, because the total number of comparisons to be done still rapidly increase besides the use of spatial indexes. A test done by changing the number of points in the above example in the range 25000 - 475000 shows for example this kind of behavior, for two different values of maxdist (500 and 2000 m):


On the test dataset, the relationships are almost perfectly quadratic (due to the uniform distribution of points). Extrapolating them to the 12 Million points dataset of the OP, we would get an execution time of about 14 hours for maxdist = 500, and a staggering 3.5 days formaxdist = 2000. Still doable, but not ideal…
My suggestion to the OP was therefore to “split” the points in chunks based on the x-coordinate and then work on a per-split basis, eventually assigning each chunk to a different core within a parallellized cycle.
In the end, I got curious and decided to give it a go to see what kind of performance improvement it was possible to obtain with that kind of approach. You can find results of some tests below.

A (possible) solution: Speeding up computation by combining data.table and sf_intersect

The idea here is to use a simple divide-and-conquer approach.
We first split the total spatial extent of the dataset in a certain number of regular quadrants. We then iterate over the quadrants and for each one we:
  1. Extract the points contained into the quadrant and apply a buffer to them;
  2. Extract the points contained in a slightly larger area, computed by expanding the quadrant by an amount equal to the maximum distance for which we want to identify the “neighbors”;
  3. Compute and save the intersection between the buffered points and the points contained in the “expanded” quadrant
“Graphically”, this translates to exploring the dataset like this:

, where the points included in the current "quadrant" are shown in green and the additional points needed to perform the analysis for that quadrant are shown in red.

Provided that the subsetting operations do not introduce an excessive overhead (i.e., they are fast enough…) this should provide a performance boost, because it should consistently reduce the total number of comparisons to be done.
Now, every “R” expert will tell you that if you need to perform fast subsetting over large datasets the way to go is to use properly indexeddata.tables, which provide lightning-speed subsetting capabilities.
So, let’s see how we could code this in a functions:
points_in_distance <- function(in_pts,
                               maxdist,
                               ncuts = 10) {

  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]

  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "x" to speed-up
  # the subsetting
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))]  %>%
    setkey(x)

  results <- list()
  count <- 0
  # start cycling over quadrants
  for (cutx in seq_len(ncuts)) {

    # get the points included in a x-slice extended by `maxdist`, and build
    # an index over y to speed-up subsetting in the inner cycle
    min_x_comp    <- ifelse(cutx == 1,
                            limits_x[cutx],
                            (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)

    for (cuty in seq_len(ncuts)) {
      count <- count + 1

      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_y[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_y[cuty + 1],
                            (limits_y[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]

      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them by maxdist.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>% 
        sf::st_buffer(maxdist)

      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)

      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"
      inters <- sf::st_intersects(subpts_buf, subpts_comp)

      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
    }
  }
  data.table::rbindlist(results)
}
The function takes as input a points sf object, a target distance and a number of “cuts” to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the “ids” of the points within maxdist are reported in the int_ids list column.
Now, let’s see if this works:
pts <- data.frame(x = runif(20000, 0, 100000),
                  y = runif(20000, 0, 100000),
                  id = 1:20000) %>%
  st_as_sf(coords = c("x", "y"), remove = FALSE)
maxdist <- 2000
out <- points_in_distance(pts, maxdist = maxdist, ncut = 10)
out
##           id                              int_ids
##     1: 15119  2054,18031, 9802, 8524, 4107, 7412,
##     2: 14392 12213,10696, 6399,14392, 4610, 1200,
##     3: 14675  2054,18031, 9802, 8524, 4107, 7412,
##     4:  3089 12293,18031, 8524, 4107,12727, 2726,
##     5: 17282       9802,8524,4107,7412,2726,1275,
##    ---                                           
## 19995:   248 16610, 7643, 8059,15998,16680, 1348,
## 19996:  8433 16821,15638,16680, 3876,13851, 1348,
## 19997: 17770 11060, 7643, 8059,19868, 7776,10146,
## 19998: 11963  9948, 9136,15956,18512, 9219, 8925,
## 19999: 15750  5291,18093,14462,15362,12575, 5189,
# get a random point
sel_id  <- sample(pts$id,1)
pt_sel  <- pts[sel_id, ]
pt_buff <- pt_sel %>%  sf::st_buffer(maxdist)
# get ids of points within maxdist
id_inters <- unlist(out[id == sel_id, ]$int_ids)
pt_inters <- pts[id_inters,]

#plot results
ggplot(pt_buff)  + theme_light() +
  geom_point(data = pts, aes(x = x, y = y), size = 1) +
  geom_sf(col = "blue", size = 1.2, fill = "transparent") +
  geom_sf(data = pt_inters, col = "red", size = 1.5) +
  geom_point(data = pt_sel, aes(x = x, y = y), size = 2, col = "green") +
  xlim(st_bbox(pt_buff)[1] - maxdist, st_bbox(pt_buff)[3] + maxdist) +
  ylim(st_bbox(pt_buff)[2] - maxdist, st_bbox(pt_buff)[4] + maxdist) + 
  ggtitle(paste0("id = ", sel_id, " - Number of points within distance = ", length(id_inters)))

So far, so good. Now, let’s do the same exercise with varying number of points to see how it behaves in term of speed:


Already not bad! In particular for the maxdist = 2000 case, we get a quite large speed improvement!
However, a nice thing about the points_in_distance approach is that it is easily parallelizable. All is needed is to change some lines of the function so that the outer loop over the x “chunks” exploits a parallel backend of some kind. (You can find an example implementation exploiting foreachin this gist)
On a not-particularly-fast PC, using a 6-cores parallelization leads to this:


Looking good! Some more skilled programmer could probably squeeze out even more speed from it by some additional data.table magic, but the improvement is very noticeable.
In terms of execution time, extrapolating again to the “infamous” 12 Million points dataset, this would be what we get:

Method Maxdist Expected completion time (hours)
st_intersect 500 15.00
points_in_distance - serial 500 2.50
points_in_distance - parallel 500 0.57
st_intersect 2000 85.00
points_in_distance - serial 2000 15.20
points_in_distance - parallel 2000 3.18

So, we get a 5-6X speed improvement already on the “serial” implementation, and another 5X thanks to parallelization over 6 cores! On themaxdist = 2000 case, this means going from more than 3 days to about 3 hours. And if we had more cores and RAM to throw at it, it would finish in minutes!

Nice!

Final Notes

  • The timings shown here are merely indicative, and related to the particular test-dataset we built. On a less uniformly distributed dataset I would expect a lower speed improvement.
  • Some time is “wasted” because sf does not (yet) extend data.tables, making it necessary to recreate sf objects from thedata.table subsets.
  • The parallel implementation is quick-and-dirty, and it is a bit of a memory-hog! Be careful before throwing at it 25 processors!
  • Speed is influenced in a non-trivial way by the number of “cuts” used to subdivide the spatial extent. There may be a sweet-spot related to points distribution and maxdist allowing reaching maximum speed.
  • A similar approach for parallelization could exploit repeatedly “cropping” the original sf points object over the extent of the chunk/extended chunk. The data.table approach seems however to be faster.

That's all! Hope you liked this (rather long) post!

2 comments:

  1. Great post, very thorough! I ended up implementing something remarkably similar to the grid+buffer approach but with sf::st_make_grid(). Almost like hand-crafting an explicit spatial index. And thanks for the updated answer

    ReplyDelete