Skip to content

Weak Documentation Regarding Treatment of Strings and Character Vectors #61

@drag05

Description

@drag05

There is relatively little documentation on character vectors and strings. I understand the focus on numbers however I think it would be beneficial to have some text documenting the differences (however small it may be) between numbers and characters.

My project focuses on splitting a long string (i.e. nchar > 0 and length = 1) in all possible substrings of various lengths under the constraint that these partitions are still parts of the initial string (i.e. can still be located inside the initial string).
For this, I wrote two variants of a custom function for argument FUN inside comboIter() that look like this:

  1. The slow variant
 sift = local({
    z = character()
         function(x, chain, ...) {
                y = paste0(x, collapse = '')
               if (isFALSE(y %in% z)) {
                    z <<- c(y, z)
                   if (isTRUE(grep(y, chain, ...) > 0L)) return(y)
        }}
})

Where y represents each row in nextNIter() matrix after x was paste0d.
Here, z is a container for function sift and accepts only unique values of y. This way the function avoids applying the last condition (involving grep) to duplicates which are many in case of protein chains - to give an example.

The last (i.e. the grep) condition checks if y is still part of the initial long string. I have tried all variants of match and fmatch (instead of %in%) with no real gain in speed.

  1. The faster variant is
sift = function(x, chain, ...) {
                y = paste0(x, collapse = '')
                if (isTRUE(grep(y, chain, ...) > 0L)) return(y)

Which, in benchmarks, proved to be 4-5 times faster than 1. It seems it takes more time to fill and read the contents of container z than it takes to grep all combinations of a given nextNIter() matrix, including the duplicates and the ones not part of the initial string.

Memoisation of 2nd variant of sift did not help either although there were about 60% hits in cache.

I wonder if there is a chance to have a clear provision in documentation regarding the treatment of strings and character vectors.
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions