-
Notifications
You must be signed in to change notification settings - Fork 10
Description
This issue is to move the discussion thread in #883 into its own issue topic. Thanks to @pepijn-devries for reporting the original issue (a bug now fixed in dev), and following up on performance questions with tests for comparing some alternative methods (copied below from that original issue thread).
gdalraster::vsi_copy_file() wraps VSICopyFile() in the GDAL Virtual System Interface API (cpl_vsi.h, GDAL >= 3.7 for VSICopyFile()).
gdalraster::VSIFile is a class interface to a GDAL VSIVirtualHandle which abstracts C binary file I/O for GDAL Virtual File Systems.
The R package paws vendors botocore which provides paws.storage::s3_download_file().
Moved from #883:
I've written a test were I've downloaded the file using 3 different approaches:
- vsi_copy_file
- read and write using VSIFile
- using the paws package
I've repeatedly (5 times) downloaded the same file with the three methods. If recorded the duration of each download and summarised the the results. Method 3 is about 25% faster than method 1. Method 2 also seems to be a bit faster than method 1.
Anyway, some stuff I have to think about. 25% doesn't seem to be a huge difference, but it can be when you need to download a lot of large files π. If you have any suggestions to boost the performance of
gdalrasterI'm all ππlibrary(gdalraster) library(CopernicusDataspace) library(tidyr) library(dplyr) uri <- "s3://eodata/Sentinel-1/SAR/IW_GRDH_1S-COG/2024/11/25/S1A_IW_GRDH_1SDV_20241125T055820_20241125T055845_056707_06F55C_12F9_COG.SAFE/measurement/s1a-iw-grd-vh-20241125t055820-20241125t055845-056707-06f55c-002-cog.tiff" vsi_f <- gsub("s3://", "/vsis3/", uri) set_config_option("AWS_REGION", "us-east-1") set_config_option("AWS_ACCESS_KEY_ID", dse_s3_key()) set_config_option("AWS_SECRET_ACCESS_KEY", dse_s3_secret()) set_config_option("AWS_VIRTUAL_HOSTING", "FALSE") set_config_option("AWS_S3_ENDPOINT", "eodata.dataspace.copernicus.eu") target <- file.path(tempdir(), "temp.tiff") n_repeat <- 5 result <- data.frame(vsi_copy = rep(NA, n_repeat), vsi_read_write = rep(NA, n_repeat), paws = rep(NA, n_repeat)) for (i in seq_len(n_repeat)) { timing1 <- system.time({ vsi_copy_file(vsi_f, target, TRUE) }) unlink(target) result$vsi_copy[[i]] <- timing1[["elapsed"]] ## Let's try a 10 Mb buffer buffer_size <- 10*1024*1024 timing2 <- system.time({ vsi_fl <- new(VSIFile, vsi_f) con_out <- file(target, "wb") while (!vsi_fl$eof()) { buffer <- vsi_fl$read(buffer_size) writeBin(buffer, con_out) } vsi_fl$close() close(con_out) }) result$vsi_read_write[[i]] <- timing2[["elapsed"]] timing3 <- system.time({ fn <- dse_s3_download(uri, tempdir()) }) unlink(file.path(tempdir(), fn)) result$paws[[i]] <- timing3[["elapsed"]] } result ## # Numbers are duration in seconds ## ## vsi_copy vsi_read_write paws ## 1 245.56 214.27 174.80 ## 2 206.56 137.31 136.73 ## 3 152.70 176.37 168.51 ## 4 317.28 239.13 198.46 ## 5 190.39 221.09 159.53 result |> pivot_longer(everything()) |> group_by(name) |> summarise( mean_duration = mean(value), sd = sd(value), n = n() ) ## # A tibble: 3 Γ 4 ## name mean_duration sd n ## <chr> <dbl> <dbl> <int> ## 1 paws 168. 22.5 5 ## 2 vsi_copy 222. 62.6 5 ## 3 vsi_read_write 198. 40.7 5