Skip to content

Test the performance of alternative methods for S3 file download in R, for possible GDAL configuration tuning or further investigation of implementationsΒ #890

@ctoney

Description

@ctoney

This issue is to move the discussion thread in #883 into its own issue topic. Thanks to @pepijn-devries for reporting the original issue (a bug now fixed in dev), and following up on performance questions with tests for comparing some alternative methods (copied below from that original issue thread).

gdalraster::vsi_copy_file() wraps VSICopyFile() in the GDAL Virtual System Interface API (cpl_vsi.h, GDAL >= 3.7 for VSICopyFile()).

gdalraster::VSIFile is a class interface to a GDAL VSIVirtualHandle which abstracts C binary file I/O for GDAL Virtual File Systems.

The R package paws vendors botocore which provides paws.storage::s3_download_file().

Moved from #883:

I've written a test were I've downloaded the file using 3 different approaches:

  1. vsi_copy_file
  2. read and write using VSIFile
  3. using the paws package

I've repeatedly (5 times) downloaded the same file with the three methods. If recorded the duration of each download and summarised the the results. Method 3 is about 25% faster than method 1. Method 2 also seems to be a bit faster than method 1.

Anyway, some stuff I have to think about. 25% doesn't seem to be a huge difference, but it can be when you need to download a lot of large files πŸ˜‰. If you have any suggestions to boost the performance of gdalraster I'm all πŸ‘‚πŸ‘‚

library(gdalraster)
library(CopernicusDataspace)
library(tidyr)
library(dplyr)

uri <- "s3://eodata/Sentinel-1/SAR/IW_GRDH_1S-COG/2024/11/25/S1A_IW_GRDH_1SDV_20241125T055820_20241125T055845_056707_06F55C_12F9_COG.SAFE/measurement/s1a-iw-grd-vh-20241125t055820-20241125t055845-056707-06f55c-002-cog.tiff"

vsi_f <- gsub("s3://", "/vsis3/", uri)

set_config_option("AWS_REGION", "us-east-1")
set_config_option("AWS_ACCESS_KEY_ID", dse_s3_key())
set_config_option("AWS_SECRET_ACCESS_KEY", dse_s3_secret())
set_config_option("AWS_VIRTUAL_HOSTING", "FALSE")
set_config_option("AWS_S3_ENDPOINT", "eodata.dataspace.copernicus.eu")
target <- file.path(tempdir(), "temp.tiff")

n_repeat <- 5
result <- data.frame(vsi_copy       = rep(NA, n_repeat),
                     vsi_read_write = rep(NA, n_repeat),
                     paws           = rep(NA, n_repeat))

for (i in seq_len(n_repeat)) {
  timing1 <- system.time({
    vsi_copy_file(vsi_f, target, TRUE)
  })
  unlink(target)
  result$vsi_copy[[i]] <- timing1[["elapsed"]]
  
  ## Let's try a 10 Mb buffer
  buffer_size <- 10*1024*1024
  timing2 <- system.time({
    vsi_fl <- new(VSIFile, vsi_f)
    con_out <- file(target, "wb")
    while (!vsi_fl$eof()) {
      buffer <- vsi_fl$read(buffer_size)
      writeBin(buffer, con_out)
    }
    vsi_fl$close()
    close(con_out)
  })
  result$vsi_read_write[[i]] <- timing2[["elapsed"]]
  
  timing3 <- system.time({
    fn <- dse_s3_download(uri, tempdir())
  })
  unlink(file.path(tempdir(), fn))
  result$paws[[i]] <- timing3[["elapsed"]]
  
}

result

## # Numbers are duration in seconds
##
##   vsi_copy vsi_read_write   paws
## 1   245.56         214.27 174.80
## 2   206.56         137.31 136.73
## 3   152.70         176.37 168.51
## 4   317.28         239.13 198.46
## 5   190.39         221.09 159.53

result |>
  pivot_longer(everything()) |>
  group_by(name) |>
  summarise(
    mean_duration = mean(value),
    sd = sd(value),
    n = n()
  )

## # A tibble: 3 Γ— 4
##   name           mean_duration    sd     n
##   <chr>                  <dbl> <dbl> <int>
## 1 paws                    168.  22.5     5
## 2 vsi_copy                222.  62.6     5
## 3 vsi_read_write          198.  40.7     5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions