[Opt](cloud) Add rate limit for BE to MS rpc by bobhan1 · Pull Request #60344 · apache/doris

bobhan1 · 2026-01-29T06:55:27Z

What problem does this PR solve?

Problem Summary:

This PR implements a two-layer MS (Meta Service) RPC rate limiting system for Doris cloud mode BE:

Host-level rate limiting — Token-bucket based QPS limiter for all 21 MS RPC types, preventing a single BE from overwhelming the MS with burst traffic.
Table-level adaptive backpressure — When MS returns MS_BUSY error code, BE dynamically identifies and throttles the top-k highest-QPS tables using a state machine, and automatically relaxes limits after the pressure subsides.

Part 1: BE Host-Level Rate Limiting

Problem

In cloud mode, all BE nodes send RPCs (get_tablet, prepare_rowset, commit_rowset, etc.) to a shared Meta Service. A single BE experiencing load spikes (e.g., large batch imports, compaction storms) can send excessive RPC traffic that overwhelms MS, degrading service for all BEs.

Solution

Introduce HostLevelMSRpcRateLimiters — a per-BE, per-RPC-type rate limiter using token bucket algorithm.

Architecture:

All 21 MS RPC types are enumerated in MetaServiceRPC enum (defined via X-macro for maintainability)
Each RPC type has an independent TokenBucketRateLimiterHolder with its own QPS limit
QPS limits are configured per CPU core: actual_qps = config_value × num_cores
Thread-safe design using atomic_shared_ptr<RpcRateLimiter> array for lock-free concurrent access during limit() calls
Each rate limiter includes a bvar::LatencyRecorder to monitor sleep durations caused by rate limiting

Configuration:

Config	Default	Description
`enable_ms_rpc_host_level_rate_limit`	`true`	Global enable/disable switch
`ms_rpc_qps_default`	`100`	Default per-core QPS for all RPCs
`ms_rpc_qps_<rpc_name>`	`-1`	Per-RPC override (`-1` = use default, `0` = disabled)

All QPS configs are mutable (DEFINE_mInt32), allowing runtime adjustment without restart. reset_all() re-reads configs and recreates rate limiters.

Integration:

Rate limiting is applied inside the retry_rpc() template function in cloud_meta_mgr.cpp, which wraps all MS RPC calls. The RpcRateLimitCtx struct carries the rate limiter reference. Rate limiting executes before each RPC attempt (including retries), with the call to apply_rate_limit() performing a bthread_usleep if the token bucket requires waiting.

New files:

be/src/cloud/cloud_ms_rpc_rate_limiters.h / .cpp
be/test/cloud/cloud_ms_rpc_rate_limiters_test.cpp

Part 2: BE Table-Level Adaptive Backpressure

Problem

Host-level rate limiting applies uniformly across all tables. When MS reports overload (MAX_QPS_LIMIT), it's often caused by a small number of high-traffic tables (e.g., tables with many concurrent stream load jobs). A uniform rate limit would unnecessarily penalize all tables, while the hot tables continue to dominate the RPC traffic.

Solution

Implement table-level adaptive throttling for load-related RPCs. When MS returns MAX_QPS_LIMIT, BE identifies the top-k highest-QPS tables and progressively reduces their QPS limits, while leaving other tables unaffected.

Scope: Only 5 load-related RPC types participate in table-level throttling:

PREPARE_ROWSET
COMMIT_ROWSET
UPDATE_TMP_ROWSET
UPDATE_PACKED_FILE_INFO
UPDATE_DELETE_BITMAP

Architecture (4 components with clear separation of concerns):

MS_BUSY signal (MAX_QPS_LIMIT)
    │
    ▼
┌─────────────────────────┐       ┌──────────────────────────┐
│  RpcThrottleCoordinator │──────▶│  RpcThrottleStateMachine │
│  (timing control)       │       │  (pure state logic)      │
│  - upgrade cooldown     │       │  - upgrade history stack │
│  - downgrade trigger    │       │  - limit calculation     │
└─────────────────────────┘       └──────────┬───────────────┘
                                             │ Actions
                                             ▼
┌─────────────────────────┐       ┌──────────────────────────┐
│  TableRpcQpsRegistry    │       │  TableRpcThrottler       │
│  (QPS statistics)       │       │  (limit enforcement)     │
│  - per-table bvar       │       │  - StrictQpsLimiter      │
│  - top-k query          │       │  - per (rpc, table)      │
└─────────────────────────┘       └──────────────────────────┘

Component details:

TableRpcQpsRegistry — Tracks per-(rpc_type, table_id) QPS using bvar::PerSecond<bvar::Adder>. Supports efficient top-k query via min-heap. Configurable time window via ms_rpc_table_qps_window_sec (immutable, default 10s).
RpcThrottleStateMachine — Pure state machine with no time awareness or side effects. Maintains upgrade history as a stack for clean rollback.
- on_upgrade(snapshot): For each top-k table in the QPS snapshot, calculates new_limit = current_qps × ratio (first time) or current_limit × ratio (already limited), with a floor of ms_rpc_table_qps_limit_floor. Returns SET_LIMIT actions.
- on_downgrade(): Pops the most recent upgrade from history. If the table had a prior limit, restores it (SET_LIMIT). If no prior limit, removes it (REMOVE_LIMIT).
RpcThrottleCoordinator — Timing control layer using tick counts (1 tick = 1 ms).
- report_ms_busy(): Returns true if enough ticks have passed since last upgrade (cooldown).
- tick(n): Advances time by n ticks. Returns true if downgrade should trigger (no MS_BUSY for downgrade_after_ticks).
TableRpcThrottler — Enforces QPS limits using StrictQpsLimiter (strict fixed-interval, no burst allowed). Each (rpc_type, table_id) pair has its own limiter. Returns the time point when the request may execute; the caller sleeps until then.
MSBackpressureHandler — Orchestrator that wires all components together:
- on_ms_busy(): Called when retry_rpc receives MAX_QPS_LIMIT. Consults coordinator for cooldown, builds QPS snapshot from registry, feeds to state machine, applies resulting actions to throttler.
- before_rpc() / after_rpc(): Called around each load-related RPC for throttle enforcement and QPS recording.
- Background tick thread: Runs every 1 second, advances coordinator by 1000 ticks. Triggers downgrade when enough time has passed without MS_BUSY.

Upgrade/Downgrade lifecycle example:

Time 0s:   MS returns MAX_QPS_LIMIT
           → Upgrade level 1: top-2 tables (A: 100 qps, B: 80 qps)
             → A limited to 50 qps, B limited to 40 qps

Time 2s:   MS returns MAX_QPS_LIMIT again (cooldown 5s not passed)
           → Skipped

Time 6s:   MS returns MAX_QPS_LIMIT (cooldown passed)
           → Upgrade level 2: top-2 tables now (A: 50 qps, C: 60 qps)
             → A limited to 25 qps, C limited to 30 qps

Time 11s:  No MS_BUSY for 5s
           → Downgrade: undo level 2
             → A restored to 50 qps, C limit removed

Time 16s:  No MS_BUSY for 5s
           → Downgrade: undo level 1
             → A limit removed, B limit removed

Configuration:

Config	Default	Mutable	Description
`enable_ms_backpressure_handling`	`false`	Yes	Global enable/disable switch
`ms_rpc_table_qps_window_sec`	`3`	No	bvar time window for QPS calculation
`ms_backpressure_upgrade_interval_ms`	`3000`	Yes	Minimum cooldown between upgrades
`ms_backpressure_upgrade_top_k`	`2`	Yes	Number of top tables to throttle per upgrade
`ms_backpressure_throttle_ratio`	`0.75`	Yes	QPS decay ratio on upgrade
`ms_rpc_table_qps_limit_floor`	`1.0`	Yes	Minimum QPS limit (won't throttle below this)
`ms_backpressure_downgrade_interval_ms`	`3000`	Yes	Time without MS_BUSY before downgrade

Observability (bvar metrics):

ms_rpc_backpressure_upgrade_count / _60s — Upgrade event counts
ms_rpc_backpressure_downgrade_count / _60s — Downgrade event counts
ms_rpc_backpressure_ms_busy_count / _60s — MS_BUSY signal counts
ms_rpc_backpressure_throttle_wait_<rpc_name> — Per-RPC-type throttle wait latency
ms_rpc_backpressure_throttled_tables_<rpc_name> — Number of throttled tables per RPC type

New files:

be/src/cloud/cloud_throttle_state_machine.h / .cpp
be/src/cloud/cloud_ms_backpressure_handler.h / .cpp
be/test/cloud/cloud_throttle_state_machine_test.cpp
be/test/cloud/cloud_ms_backpressure_handler_test.cpp

Also renamed (not part of the feature, cleanup):

common/cpp/s3_rate_limiter.h/.cpp → common/cpp/token_bucket_rate_limiter.h/.cpp (more general naming since it's now used beyond S3)

Part 3: System Table for Table-Level Throttler Observability

Problem

The table-level backpressure system operates transparently inside BE. When issues arise, users and DBAs have no way to inspect which tables are being throttled, what their QPS limits are, or what their current QPS is — beyond checking raw bvar metrics.

Solution

Add a new system table information_schema.backend_ms_rpc_table_throttlers that exposes the real-time state of the TableRpcThrottler on each BE. This table is a Backend-Partitioned Schema Table, meaning each BE reports its own throttling data, and queries are
distributed to all alive BEs and aggregated.

Schema:

Column	Type	Description
`BE_ID`	BIGINT	Backend ID
`TABLE_ID`	BIGINT	Table ID being throttled
`RPC_TYPE`	VARCHAR(64)	RPC type name (e.g., `PREPARE_ROWSET`, `COMMIT_ROWSET`)
`QPS_LIMIT`	DOUBLE	Current QPS limit enforced on this (table, rpc) pair
`CURRENT_QPS`	DOUBLE	Current observed QPS for this (table, rpc) pair

Usage examples:

-- View all currently throttled tables across all BEs
SELECT * FROM information_schema.backend_ms_rpc_table_throttlers;

-- View throttled tables on a specific BE
SELECT * FROM information_schema.backend_ms_rpc_table_throttlers WHERE BE_ID = 10001;

-- Find the most severely throttled tables
SELECT * FROM information_schema.backend_ms_rpc_table_throttlers ORDER BY QPS_LIMIT ASC;

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
  - Added host-level token-bucket rate limiting for all MS RPCs (enabled by default via enable_ms_rpc_host_level_rate_limit)
  - Added table-level adaptive backpressure handling triggered by MS MAX_QPS_LIMIT response (disabled by default via enable_ms_backpressure_handling)
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-01-29T06:55:32Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

bobhan1 · 2026-02-11T11:30:07Z

run buildall

doris-robot · 2026-02-11T11:58:13Z

Cloud UT Coverage Report

Increment line coverage 9.68% (3/31) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	79.29% (1796/2265)
Line Coverage	64.86% (32023/49369)
Region Coverage	65.56% (15992/24394)
Branch Coverage	56.07% (8505/15168)

hello-stephen · 2026-02-11T12:34:02Z

FE UT Coverage Report

Increment line coverage 100.00% (10/10) 🎉
Increment coverage report
Complete coverage report

doris-robot · 2026-02-11T13:19:35Z

TPC-H: Total hot run time: 30385 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bdd55d6cb3a5862cc5ec61e49dca8fe680e0e3aa, data reload: false

------ Round 1 ----------------------------------
q1	17679	4467	4272	4272
q2	2039	390	244	244
q3	10109	1308	749	749
q4	10207	793	319	319
q5	7551	2209	1958	1958
q6	205	184	150	150
q7	882	748	619	619
q8	9263	1420	1155	1155
q9	4769	4700	4617	4617
q10	6827	1937	1546	1546
q11	494	265	237	237
q12	345	383	229	229
q13	17782	4100	3220	3220
q14	231	233	223	223
q15	906	821	818	818
q16	701	679	604	604
q17	713	816	547	547
q18	7109	5985	5815	5815
q19	1106	1011	627	627
q20	513	503	393	393
q21	2575	1838	1794	1794
q22	327	289	249	249
Total cold run time: 102333 ms
Total hot run time: 30385 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4383	4324	4319	4319
q2	280	336	259	259
q3	2125	2720	2212	2212
q4	1368	1746	1307	1307
q5	4336	4218	4183	4183
q6	220	193	137	137
q7	1879	1830	1702	1702
q8	2673	2631	2489	2489
q9	7645	7429	7470	7429
q10	2810	3278	2678	2678
q11	501	439	433	433
q12	682	728	644	644
q13	3895	4764	3545	3545
q14	308	348	302	302
q15	849	810	807	807
q16	698	740	678	678
q17	1146	1339	1393	1339
q18	8283	7969	8028	7969
q19	887	884	859	859
q20	2104	2189	2065	2065
q21	4771	4677	4393	4393
q22	527	469	485	469
Total cold run time: 52370 ms
Total hot run time: 50218 ms

doris-robot · 2026-02-11T13:36:17Z

ClickBench: Total hot run time: 28.3 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bdd55d6cb3a5862cc5ec61e49dca8fe680e0e3aa, data reload: false

query1	0.05	0.04	0.04
query2	0.10	0.05	0.05
query3	0.25	0.08	0.08
query4	1.61	0.12	0.11
query5	0.27	0.26	0.25
query6	1.16	0.68	0.66
query7	0.03	0.03	0.03
query8	0.06	0.04	0.04
query9	0.56	0.52	0.49
query10	0.55	0.54	0.54
query11	0.14	0.10	0.10
query12	0.14	0.10	0.10
query13	0.64	0.61	0.62
query14	1.06	1.06	1.05
query15	0.88	0.86	0.87
query16	0.41	0.39	0.40
query17	1.07	1.05	1.14
query18	0.22	0.22	0.21
query19	2.10	2.02	2.01
query20	0.02	0.02	0.01
query21	15.39	0.26	0.15
query22	5.43	0.06	0.05
query23	16.17	0.28	0.11
query24	1.49	0.34	0.23
query25	0.09	0.07	0.08
query26	0.15	0.14	0.15
query27	0.11	0.09	0.05
query28	3.59	1.17	0.97
query29	12.56	3.95	3.19
query30	0.27	0.14	0.12
query31	2.81	0.65	0.41
query32	3.25	0.58	0.50
query33	3.24	3.25	3.27
query34	16.04	5.41	4.73
query35	4.80	4.80	4.80
query36	0.65	0.50	0.48
query37	0.12	0.07	0.07
query38	0.08	0.04	0.05
query39	0.05	0.03	0.03
query40	0.22	0.16	0.14
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 98.02 s
Total hot run time: 28.3 s

hello-stephen · 2026-02-11T15:14:45Z

BE UT Coverage Report

Increment line coverage 61.31% (618/1008) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.83% (19544/36995)
Line Coverage	36.31% (182073/501480)
Region Coverage	32.72% (141422/432179)
Branch Coverage	33.70% (61194/181565)

bobhan1 · 2026-02-27T06:53:45Z

run buildall

doris-robot · 2026-02-27T07:34:29Z

Cloud UT Coverage Report

Increment line coverage 9.68% (3/31) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	79.30% (1797/2266)
Line Coverage	64.83% (32066/49462)
Region Coverage	65.50% (16012/24447)
Branch Coverage	55.96% (8513/15212)

doris-robot · 2026-03-30T10:27:21Z

TPC-DS: Total hot run time: 169708 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e11f6697a68de51291f93879176dc7822361e9ed, data reload: false

query5	4352	652	520	520
query6	347	229	234	229
query7	4225	470	263	263
query8	345	240	230	230
query9	8735	2762	2738	2738
query10	535	365	334	334
query11	7016	5120	4860	4860
query12	180	133	126	126
query13	1279	477	353	353
query14	5790	3679	3488	3488
query14_1	2966	2877	2834	2834
query15	204	201	179	179
query16	991	461	459	459
query17	906	730	620	620
query18	2459	465	362	362
query19	214	227	183	183
query20	127	127	124	124
query21	218	136	111	111
query22	13302	14098	14231	14098
query23	17162	16565	16129	16129
query23_1	16433	16142	15890	15890
query24	7176	1631	1220	1220
query24_1	1239	1253	1251	1251
query25	617	487	410	410
query26	1240	273	154	154
query27	2764	485	299	299
query28	4523	1850	1855	1850
query29	809	562	478	478
query30	289	226	195	195
query31	1013	958	888	888
query32	88	72	71	71
query33	536	342	283	283
query34	893	868	505	505
query35	637	687	606	606
query36	1085	1126	993	993
query37	141	97	82	82
query38	2959	2943	2898	2898
query39	850	834	808	808
query39_1	792	797	807	797
query40	234	157	137	137
query41	63	64	61	61
query42	260	256	255	255
query43	243	255	227	227
query44	
query45	199	190	180	180
query46	942	989	603	603
query47	2135	2184	2073	2073
query48	321	321	230	230
query49	639	482	396	396
query50	711	303	212	212
query51	4090	4098	4057	4057
query52	263	270	257	257
query53	289	337	289	289
query54	305	284	279	279
query55	94	88	82	82
query56	300	319	311	311
query57	1915	1743	1764	1743
query58	285	272	271	271
query59	2823	2938	2737	2737
query60	346	331	329	329
query61	161	150	155	150
query62	627	589	553	553
query63	314	286	271	271
query64	5010	1293	1024	1024
query65	
query66	1455	453	359	359
query67	24289	24526	24274	24274
query68	
query69	409	318	287	287
query70	933	941	922	922
query71	347	305	306	305
query72	2851	2474	2486	2474
query73	552	558	328	328
query74	9671	9614	9420	9420
query75	2887	2768	2473	2473
query76	2283	1044	663	663
query77	369	378	312	312
query78	10961	11132	10516	10516
query79	1700	763	564	564
query80	1360	636	541	541
query81	561	257	223	223
query82	987	157	121	121
query83	340	275	238	238
query84	307	115	109	109
query85	951	490	449	449
query86	402	309	307	307
query87	3202	3120	3019	3019
query88	3589	2669	2661	2661
query89	417	383	346	346
query90	2043	187	176	176
query91	170	171	139	139
query92	81	78	72	72
query93	1029	879	501	501
query94	655	328	298	298
query95	612	406	332	332
query96	656	524	231	231
query97	2441	2455	2380	2380
query98	235	220	217	217
query99	1008	990	911	911
Total cold run time: 252156 ms
Total hot run time: 169708 ms

bobhan1 · 2026-04-01T03:44:30Z

run buildall

doris-robot · 2026-04-01T05:49:01Z

TPC-H: Total hot run time: 35362 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3e89cf3a7ca718c2d8a504573d9e974edf607060, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17648	4540	4325	4325
q2	q3	10856	985	683	683
q4	4809	501	345	345
q5	7791	880	550	550
q6	238	225	175	175
q7	924	871	625	625
q8	9511	1000	771	771
q9	9051	6544	7610	6544
q10	9016	4135	3728	3728
q11	594	343	357	343
q12	835	679	564	564
q13	18099	4931	4013	4013
q14	498	477	446	446
q15	q16	581	551	461	461
q17	977	963	761	761
q18	6618	6054	5783	5783
q19	1259	1335	886	886
q20	1538	1411	1465	1411
q21	4876	3008	2576	2576
q22	476	396	372	372
Total cold run time: 106195 ms
Total hot run time: 35362 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5391	5352	5262	5262
q2	q3	1978	2404	2049	2049
q4	1037	1408	907	907
q5	4584	4897	4769	4769
q6	231	210	157	157
q7	8050	7885	7736	7736
q8	3264	3672	3338	3338
q9	28047	27844	27710	27710
q10	4772	5114	4571	4571
q11	1100	1028	991	991
q12	623	705	538	538
q13	4646	5248	4254	4254
q14	507	503	476	476
q15	q16	579	594	561	561
q17	3948	4046	3910	3910
q18	7580	7308	7045	7045
q19	1233	1259	1232	1232
q20	2361	2439	2282	2282
q21	14066	13755	13613	13613
q22	596	550	518	518
Total cold run time: 94593 ms
Total hot run time: 91919 ms

doris-robot · 2026-04-01T07:37:36Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.99% (20064/37867)
Line Coverage	36.54% (188291/515240)
Region Coverage	32.82% (146214/445472)
Branch Coverage	33.94% (63963/188451)

hello-stephen · 2026-04-01T08:28:19Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.57% (27286/37086)
Line Coverage	57.14% (293528/513696)
Region Coverage	54.21% (243731/449588)
Branch Coverage	56.02% (105890/189017)

bobhan1 · 2026-04-01T08:31:08Z

run cloudut

bobhan1 · 2026-04-01T08:31:45Z

run external

hello-stephen · 2026-04-01T09:05:03Z

Cloud UT Coverage Report

Increment line coverage 9.68% (3/31) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.46% (1799/2293)
Line Coverage	64.18% (32322/50365)
Region Coverage	65.01% (16215/24942)
Branch Coverage	55.46% (8640/15580)

hello-stephen · 2026-04-01T10:11:42Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.57% (27283/37086)
Line Coverage	57.15% (293568/513696)
Region Coverage	54.24% (243841/449588)
Branch Coverage	56.04% (105923/189017)

bobhan1 · 2026-04-01T10:28:42Z

run external

hello-stephen · 2026-04-01T12:04:56Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.57% (27285/37086)
Line Coverage	57.15% (293553/513696)
Region Coverage	54.26% (243944/449588)
Branch Coverage	56.05% (105946/189017)

liaoxin01

LGTM

github-actions · 2026-04-01T17:04:41Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-04-01T17:04:44Z

PR approved by anyone and no changes requested.

bobhan1 · 2026-05-06T12:57:30Z

run buildall

fix [improvement](be) Hook dynamic MS throttle configs to update callbacks Issue Number: None Related PR: None Problem Summary: Newly added BE configs for per-RPC MS QPS limits and MS backpressure throttle upgrade/downgrade only changed config values at runtime, but did not propagate those changes into the in-memory rate limiter and backpressure handler state. This commit registers DEFINE_ON_UPDATE callbacks for those configs and refreshes the corresponding runtime objects only when the new value differs from the old value. None - Test: No need to test (code change committed without rerunning build in this step) - Behavior changed: Yes (runtime config updates now take effect on the corresponding in-memory MS throttling state) - Does this need documentation: No update fix sync rowset retry and fix MSBackpressureHandler state transition is not atomic fix wrong substitution [fix](be) Log actual throttle ticks on transition Issue Number: None Related PR: None Problem Summary: Capture the actual elapsed tick counters before resetting them so the ms-throttle upgrade and downgrade logs report real values instead of reset counters. None - Test: No need to test (log-only change; attempted targeted BE UT but sandbox blocked submodule update) - Behavior changed: Yes (INFO logs now print the actual elapsed ticks for upgrade and downgrade triggers) - Does this need documentation: No [fix](be) Disable MS backpressure handling by default Issue Number: None Related PR: None Problem Summary: Change the default value of enable_ms_backpressure_handling to false so MS backpressure response handling is opt-in instead of enabled by default. MS backpressure handling is now disabled by default. - Test: No need to test (single default-config change only) - Behavior changed: Yes (enable_ms_backpressure_handling defaults to false) - Does this need documentation: No format change enable_ms_rpc_host_level_rate_limit default to falase

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: ExecEnv forward-declared doris::cloud MS RPC limiter types, which exposed doris::cloud through common include paths and made older headers resolve global cloud protobuf types incorrectly. ### Release note None ### Check List (For Author) - Test: Manual test - ./build.sh --be -j100 - Behavior changed: No - Does this need documentation: No

bobhan1 · 2026-05-07T02:16:31Z

run buildall

bobhan1 force-pushed the be-ms-rpc-rate-limit branch 3 times, most recently from 28843ee to 6e33c81 Compare January 29, 2026 08:26

bobhan1 force-pushed the be-ms-rpc-rate-limit branch 11 times, most recently from 82ed77c to a8239ac Compare February 11, 2026 11:25

bobhan1 marked this pull request as ready for review February 11, 2026 11:26

bobhan1 requested review from dataroaring, gavinchou and w41ter as code owners February 11, 2026 11:26

bobhan1 force-pushed the be-ms-rpc-rate-limit branch from bdd55d6 to b345ed1 Compare February 27, 2026 06:52

bobhan1 force-pushed the be-ms-rpc-rate-limit branch 2 times, most recently from 93f86d5 to bb1452b Compare February 27, 2026 07:25

liaoxin01 added dev/4.0.x dev/4.1.x labels Mar 31, 2026

bobhan1 force-pushed the be-ms-rpc-rate-limit branch from c255612 to 7fb080a Compare April 1, 2026 03:33

liaoxin01 previously approved these changes Apr 1, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 1, 2026

github-actions Bot added the reviewed label Apr 1, 2026

wyxxxcat approved these changes May 6, 2026

View reviewed changes

bobhan1 force-pushed the be-ms-rpc-rate-limit branch from 3e89cf3 to fb5dbff Compare May 6, 2026 10:48

bobhan1 dismissed liaoxin01’s stale review via 5eedb99 May 6, 2026 11:04

bobhan1 force-pushed the be-ms-rpc-rate-limit branch from fb5dbff to 5eedb99 Compare May 6, 2026 11:04

github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 6, 2026

bobhan1 added 3 commits May 7, 2026 10:05

fix

a79fdb1

bobhan1 force-pushed the be-ms-rpc-rate-limit branch from 2e825e8 to a79fdb1 Compare May 7, 2026 02:16

Conversation

bobhan1 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Part 1: BE Host-Level Rate Limiting

Problem

Solution

Part 2: BE Table-Level Adaptive Backpressure

Problem

Solution

Part 3: System Table for Table-Level Throttler Observability

Problem

Solution

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jan 29, 2026

Uh oh!

bobhan1 commented Feb 11, 2026

Uh oh!

doris-robot commented Feb 11, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Feb 11, 2026

FE UT Coverage Report

Uh oh!

doris-robot commented Feb 11, 2026

Uh oh!

doris-robot commented Feb 11, 2026

Uh oh!

hello-stephen commented Feb 11, 2026

BE UT Coverage Report

Uh oh!

bobhan1 commented Feb 27, 2026

Uh oh!

doris-robot commented Feb 27, 2026

Cloud UT Coverage Report

Uh oh!

doris-robot commented Mar 30, 2026

Uh oh!

bobhan1 commented Apr 1, 2026

Uh oh!

doris-robot commented Apr 1, 2026

Uh oh!

doris-robot commented Apr 1, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 1, 2026

BE Regression && UT Coverage Report

Uh oh!

bobhan1 commented Apr 1, 2026

Uh oh!

bobhan1 commented Apr 1, 2026

Uh oh!

hello-stephen commented Apr 1, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Apr 1, 2026

BE Regression && UT Coverage Report

Uh oh!

bobhan1 commented Apr 1, 2026

Uh oh!

hello-stephen commented Apr 1, 2026

BE Regression && UT Coverage Report

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

bobhan1 commented May 6, 2026

Uh oh!

bobhan1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

bobhan1 commented Jan 29, 2026 •

edited

Loading