Skip to content

Optimize Fw::StringUtils::string_length with SWAR algorithm#4789

Open
vsoulgard wants to merge 2 commits intonasa:develfrom
vsoulgard:fw-types-string-length-swar
Open

Optimize Fw::StringUtils::string_length with SWAR algorithm#4789
vsoulgard wants to merge 2 commits intonasa:develfrom
vsoulgard:fw-types-string-length-swar

Conversation

@vsoulgard
Copy link

Related Issue(s) #4788
Has Unit Tests (y/n) y
Documentation Included (y/n) n
Generative AI was used in this contribution (y/n) y

Change Description

This PR implements the performance optimization proposed in #4788. It replaces the naive byte-by-byte string length calculation with a SWAR (SIMD Within A Register) approach using word-sized and bitwise NUL char detection.

Details:

  • Processes the initial unaligned head byte-by-byte to ensure word-alignment before the main loop.
  • Uses standard bitwise operations to detect NUL char in parallel within a register.
  • For short strings (word-sized or smaller) a simple byte-by-byte loop is used to avoid overhead.
  • Added macro NO_ASAN to suppress false positives from AddressSanitizer for trusted functions.
  • Added new edge test cases and alignment assert.

Rationale

SWAR reduces the number of loop iterations and memory load instructions, thus improving performance. This technique is widely used in high-performance standard libraries.

Benchmarks:

(Tested on Linux x86_64 with GCC 14.2.0)

Benchmark Results (aligned) benchmark32 benchmark1024

New version is slightly slower for very short strings, but a lot faster for long ones:
N=2 - 1.28 vs. 2.01 (ns)
N=4 - 1.76 vs. 2.27 (ns)
N=8 - 3.06 vs. 3.01 (ns)
N=16 - 6.05 vs. 3.43 (ns)
N=32 - 19.30 vs. 4.31 (ns)
N=64 - 27.97 vs. 6.37 (ns)
N=128 - 53.45 vs. 10.36 (ns) (x5 faster)
N=256 - 100.27 vs. 22.29 (ns)
N=512 - 191.22 vs. 42.47 (ns)

Benchmark Results (unaligned) benchmark32_unaligned benchmark1024_unaligned

Unaligned calculations are slower than aligned ones, because of head byte-by-byte cycle, but still faster than old version:
N=2 - 1.33 vs. 1.91 (ns)
N=4 - 2.03 vs. 2.26 (ns)
N=8 - 3.55 vs. 3.02 (ns)
N=16 - 6.65 vs. 5.18 (ns)
N=32 - 17.12 vs. 6.10 (ns)
N=64 - 30.34 vs. 8.44 (ns)
N=128 - 56.18 vs. 12.60 (ns)
N=256 - 103.61 vs. 24.40 (ns)
N=512 - 199.45 vs. 44.11 (ns)

Testing/Review Recommendations

Benchmark Source
static void BM_string_length(benchmark::State& state)
{
    const size_t LEN = state.range(0);
    std::string src_str = std::string(LEN, 'A');
    const char* src = src_str.c_str();

    for (auto _ : state)
    {
        auto result = Fw::StringUtils::string_length(src, LEN);
        benchmark::DoNotOptimize(result);
    }
}

static void BM_string_lengthV2(benchmark::State& state)
{
    const size_t LEN = state.range(0);
    std::string src_str = std::string(LEN, 'A');
    const char* src = src_str.c_str();

    for (auto _ : state)
    {
        auto result = Fw::StringUtils::string_lengthV2(src, LEN);
        benchmark::DoNotOptimize(result);
    }
}

BENCHMARK(BM_string_length)->RangeMultiplier(2)->Range(0, 1024);
BENCHMARK(BM_string_lengthV2)->RangeMultiplier(2)->Range(0, 1024);

BENCHMARK_MAIN();

AI Usage

AI was used for research and testing of ideas. All code implementation, algorithm design, benchmarking, and technical decisions were done by me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant