Skip to content

Fix: rate limiter persistence#552

Open
akrainiouk wants to merge 24 commits intoysugimoto:mainfrom
akrainiouk:rate-limiter-fixes
Open

Fix: rate limiter persistence#552
akrainiouk wants to merge 24 commits intoysugimoto:mainfrom
akrainiouk:rate-limiter-fixes

Conversation

@akrainiouk
Copy link
Collaborator

@akrainiouk akrainiouk commented Feb 4, 2026

This PR attempts to fix 2 simulator issues::

  1. currently both ratecounter and penaltybox variables are recreated from scratch for every HTTP request which essentially defeats their purpose.
  2. data collection in rate counter implementation is snapped to the nearest second which is unnecessary and make rate counter behavior hard to predict in testing.

This PR addresses the first issue by copying ratecounter and penaltybox maps from the old context to the newly created one as the context gets re-initialized for the next request. The downside of this approach is that ratecounter and penaltybox instances are never cleaned up even if updated VCL no longer declares them.

@akrainiouk akrainiouk changed the title Rate limiter fixes Fix rate limiter persistence Feb 4, 2026
@akrainiouk akrainiouk marked this pull request as ready for review February 4, 2026 18:31
@akrainiouk akrainiouk requested a review from jedisct1 February 4, 2026 18:31
@akrainiouk akrainiouk changed the title Fix rate limiter persistence Fix: rate limiter persistence Feb 4, 2026
@jedisct1
Copy link
Collaborator

jedisct1 commented Feb 4, 2026

I think snapping should also be removed from the bucket.

The documentation says windows slide by 1 second at whole-second boundaries, not 10-second boundaries.

@jedisct1
Copy link
Collaborator

jedisct1 commented Feb 4, 2026

What about something like this:

  func calculateBucketWithTime(now int64, entries []rateEntry, window
  time.Duration) int64 {
        // Window slides by 1 second at each whole-second boundary.
        // At time T, the window covers [T - window + 1, T] inclusive.
        // See: https://developer.fastly.com/reference/vcl/declarations/ratecounter
        windowSec := int64(window.Seconds())
        from := now - windowSec + 1

        var bucket int64
        for _, entry := range entries {
                if entry.Timestamp >= from && entry.Timestamp <= now {
                        bucket += entry.Count
                }
        }
        return bucket
  }

  func calculateRateWithTime(now int64, entries []rateEntry, window time.Duration)
  float64 {
        bucket := calculateBucketWithTime(now, entries, window)
        if bucket == 0 {
                return 0
        }
        return math.Floor(float64(bucket) / window.Seconds())
  }

(and adjust the bucket tests accordingly)

@akrainiouk
Copy link
Collaborator Author

What about something like this:

  func calculateBucketWithTime(now int64, entries []rateEntry, window
  time.Duration) int64 {
        // Window slides by 1 second at each whole-second boundary.
        // At time T, the window covers [T - window + 1, T] inclusive.
        // See: https://developer.fastly.com/reference/vcl/declarations/ratecounter
        windowSec := int64(window.Seconds())
        from := now - windowSec + 1

        var bucket int64
        for _, entry := range entries {
                if entry.Timestamp >= from && entry.Timestamp <= now {
                        bucket += entry.Count
                }
        }
        return bucket
  }

  func calculateRateWithTime(now int64, entries []rateEntry, window time.Duration)
  float64 {
        bucket := calculateBucketWithTime(now, entries, window)
        if bucket == 0 {
                return 0
        }
        return math.Floor(float64(bucket) / window.Seconds())
  }

(and adjust the bucket tests accordingly)

@jedisct1 Implemented your suggestion and fixed the tests.
I have some doubts about practicality of snapping to the whole second however.
The problem is that in testing scenario this snapping introduces an issue similar to #547
Also from the actual behavior of ratecounter.increment function it looks like the counting windows are snapped to the whole 10 second intervals rather than 1 sec as it is documented.

@jedisct1
Copy link
Collaborator

Oh, indeed, the Fastly documentation doesn't seem to be correct.

Here's something that should be closer:

func calculateBucketWithTime(now int64, entries []rateEntry, window time.Duration) int64 {
        currentEpoch := now / 10
        numSlots := int64(window.Seconds()) / 10

        var bucket int64
        for _, entry := range entries {
                entryEpoch := entry.Timestamp / 10
                if entryEpoch > currentEpoch-numSlots && entryEpoch <= currentEpoch {
                        bucket += entry.Count
                }
        }
        return bucket
}

func calculateRateWithTime(now int64, entries []rateEntry, window time.Duration) float64 {
        windowSec := int64(window.Seconds())
        cutoff := now - windowSec

        var total int64
        for _, entry := range entries {
                if entry.Timestamp > cutoff {
                        total += entry.Count
                }
        }
        if total == 0 {
                return 0
        }
        return math.Floor(float64(total) / float64(windowSec))
}

And the Increment fix:

func (r *Ratecounter) Increment(entry string, delta int64) {
        if _, ok := r.Clients[entry]; !ok {
                r.Clients[entry] = []rateEntry{}
        }
        r.Clients[entry] = append(r.Clients[entry], rateEntry{
                Count:     delta,
                Timestamp: time.Now().Unix(),
        })
        r.IsAccessible = true
}

@akrainiouk
Copy link
Collaborator Author

@jedisct1 sorry for the delay, I am still testing you code behavior.
ratecounter_increment works very closely to Fastly (within the difference incurred by clustering)
ratelimit.check_rate is different.

I will report the details when I finish. Unfortunately other priorities are taking over.

// Increment() increments access entry manually.
// This function should be called via ratelimit.ratecounter_increment() VCL function
func (r *Ratecounter) Increment(entry string, delta int64, window time.Duration) {
func (r *Ratecounter) Increment(entry string, delta int64) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having window value passed to this function doesn't make sense. The effect is that instead of incrementing the latest entry, the code increments older entry which immediately goes away.
I really have hard time to see where window is beneficial

… most recently updated group rather than hardcoded client.ip
@akrainiouk
Copy link
Collaborator Author

@jedisct1 After some testing I see that 10 sec snapping is only applicable to bucket variables (ratecounter.{NAME}.bucketXXs) for these the underlying model is pretty clear and easy to simulate.

There is no evidence of any snapping in rate computation. Current implementation of the rate computation simply
uses sliding window of appropriate size with no snapping other than due to 1ms time precision.
Comparing to this model, Fastly rate values show some time lag the wider the window the more the delay in reaction to changes.
On the other hand what is implemented is easy to understand and predict which simplifies using it for testing.

// Compare the rate and limit
rate := rc.Rate(entry, time.Duration(window)*time.Second)
return rate >= float64(limit), nil
return rate > float64(limit), nil
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fastly behavior is to trigger rate limit when calculated rate exceed specified limit. The rate just reaching the limit is OK

type rateEntry struct {
Count int64
Timestamp int64 // unix time second
Timestamp int64 // unix time millisecond
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing with Fastly I found no evidence that their rate counting for check_rate is snapped to the whole second. Picked 1ms precision because it seems sufficient for all practical purposes (1000 times than minimal rate computation window).

IsAccessible bool
// Last incremented entry (used to estimate values ratecounter.{NAME} rate and bucket variables)
// Also LastIncremented != nil serves as IsAccessible flag
LastIncremented *string
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serves dual purpose:

  1. Keeps track of last updated client key
  2. Replaces IsAccessible flag

@akrainiouk
Copy link
Collaborator Author

akrainiouk commented Feb 24, 2026

Comparison results between falco (left side) and fastly (right side)
The counter was primed with delta=0, then a single delta=100000 was used and afterwards
it was followed with delta=0

Columns:

Test environment:

  • cnt-test - request count on the test side
  • t-test - elapsed time on test side
  • delta - increment value used in check_rate call

Falco simulator:

  • t-falco - absolute time modulus 60sec on falco server
  • cnt-falco - request count as it is seen by falco service
  • 10s-60s - estimated rates

Fastly service:

  • t-fastly - absolute time modulus 60sec on fastly server
  • cnt-fastly - request count as it is seen by fastly service
  • 10s-60s - estimated rates on fastly server

Test results:

cnt-test t-test t-falco delta cnt-falco 1s 10s 60s t-fastly cnt-fastly 1s 10s 60s
1 0.797 14.023 0 1 0 0 0 14.139 1 0 0 0
2 0.939 14.144 0 2 0 0 0 14.213 1 0 0 0
3 1.012 14.220 0 3 0 0 0 14.294 2 0 0 0
...
238 19.833 33.039 0 238 0 0 0 33.116 237 0 0 0
239 19.914 33.120 0 239 0 0 0 33.197 237 0 0 0
240 19.995 33.201 0 240 0 0 0 33.279 239 0 0 0
241 20.152 33.282 100000 241 100000 10000 1666 33.354 240 93132 5820 1455
242 20.152 33.358 0 242 100000 10000 1666 33.433 241 0 0 0
243 20.231 33.437 0 243 100000 10000 1666 33.513 242 93132 5820 1455
244 20.310 33.516 0 244 100000 10000 1666 33.592 243 93132 5820 1455
245 20.389 33.595 0 245 100000 10000 1666 33.666 244 93132 5820 1455
246 20.464 33.671 0 246 100000 10000 1666 33.742 244 93132 5820 1455
247 20.540 33.745 0 247 100000 10000 1666 33.821 246 93132 5820 1455
248 20.619 33.825 0 248 100000 10000 1666 33.899 246 93132 5820 1455
249 20.698 33.905 0 249 100000 10000 1666 33.982 247 93132 5820 1455
250 20.779 33.985 0 250 100000 10000 1666 34.059 248 93132 5820 1455
251 20.857 34.062 0 251 100000 10000 1666 34.137 250 93132 5820 1455
252 20.935 34.141 0 252 100000 10000 1666 34.214 251 93132 5820 1455
253 21.011 34.216 0 253 100000 10000 1666 34.291 251 93132 5820 1455
254 21.088 34.295 0 254 0 10000 1666 34.370 253 93132 5820 1455
255 21.167 34.373 0 255 0 10000 1666 34.449 254 93132 5820 1455
256 21.247 34.454 0 256 0 10000 1666 34.522 254 93132 5820 1455
257 21.321 34.527 0 257 0 10000 1666 34.599 255 0 5820 1455
258 21.397 34.603 0 258 0 10000 1666 34.683 256 0 5820 1455
259 21.482 34.687 0 259 0 10000 1666 34.761 258 0 5820 1455
...
366 29.858 43.064 0 366 0 10000 1666 43.137 365 0 5820 1455
367 29.935 43.140 0 367 0 10000 1666 43.217 365 0 5820 1455
368 30.015 43.221 0 368. 0 10000 1666 43.301 367 0 5820 1455
369 30.099 43.305 0 369 0 0 1666 43.377 368 0 5820 1455
370 30.174 43.380 0 370 0 0 1666 43.458 369 0 5820 1455
371 30.255 43.461 0 371 0 0 1666 43.543 369 0 5820 1455
...
462 37.493 50.699 0 462 0 0 1666 50.775 460 0 5820 1455
463 37.572 50.778 0 463 0 0 1666 50.856 462 0 5820 1455
464 37.654 50.860 0 464 0 0 1666 50.934 462 0 5820 1455
465 37.732 50.938 0 465 0 0 1666 51.013 464 0 0 1455
466 37.811 51.016 0 466 0 0 1666 51.088 465 0 0 1455
467 37.886 51.092 0 467 0 0 1666 51.164 465 0 0 1455

Corresponding VCL code deployed on both Fastly and Falco

ratecounter rcInc {}
ratecounter rc {}
penaltybox pb {}

sub vcl_recv {
  declare local var.delta INTEGER;
  set var.delta = std.atoi(req.http.x-delta);
  declare local var.window INTEGER;
  set var.window = std.atoi(req.http.x-window);
  if (var.window == 0) {
    set var.window = 1;
  }
  declare local var.limit INTEGER;
  set var.limit = std.atoi(req.http.x-limit);
  if (var.limit == 0) {
    set var.limit = 10;
  }
  set req.http.x-time = time.start.msec;
  set req.http.x-req-count = ratelimit.ratecounter_increment(rcInc, "group", 1);
  set req.http.x-outcome = ratelimit.check_rate("group", rc, var.delta, var.window, var.limit, pb, 1m);
  set req.http.x-penalized = ratelimit.penaltybox_has(pb, "group");

  set req.http.x-bucket10s = ratecounter.rc.bucket.10s;
  set req.http.x-bucket20s = ratecounter.rc.bucket.20s;
  set req.http.x-bucket30s = ratecounter.rc.bucket.30s;
  set req.http.x-bucket40s = ratecounter.rc.bucket.40s;
  set req.http.x-bucket50s = ratecounter.rc.bucket.50s;
  set req.http.x-bucket60s = ratecounter.rc.bucket.60s;

  set req.http.x-rate1s = ratecounter.rc.rate.1s;
  set req.http.x-rate10s = ratecounter.rc.rate.10s;
  set req.http.x-rate60s = ratecounter.rc.rate.60s;
  return (error);
}
sub vcl_error {
  set obj.status = 200;
  set obj.http.x-time        = req.http.x-time;
  set obj.http.x-req-count   = req.http.x-req-count;
  set obj.http.x-outcome     = req.http.x-outcome;
  set obj.http.x-penalized   = req.http.x-penalized;
  set obj.http.x-vcl-version = req.vcl.version;

  set obj.http.x-bucket10s = req.http.x-bucket10s;
  set obj.http.x-bucket20s = req.http.x-bucket20s;
  set obj.http.x-bucket30s = req.http.x-bucket30s;
  set obj.http.x-bucket40s = req.http.x-bucket40s;
  set obj.http.x-bucket50s = req.http.x-bucket50s;
  set obj.http.x-bucket60s = req.http.x-bucket60s;

  set obj.http.x-rate1s  = req.http.x-rate1s;
  set obj.http.x-rate10s = req.http.x-rate10s;
  set obj.http.x-rate60s = req.http.x-rate60s;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants