Skip to content

[fix](streaming) Fix NPE in StreamingInsertJob when MetricRepo is not initialized during replay#61253

Merged
JNSimba merged 3 commits intoapache:masterfrom
JNSimba:fix_s3_metric_restart
Mar 13, 2026
Merged

[fix](streaming) Fix NPE in StreamingInsertJob when MetricRepo is not initialized during replay#61253
JNSimba merged 3 commits intoapache:masterfrom
JNSimba:fix_s3_metric_restart

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Mar 12, 2026

What problem does this PR solve?

Problem

StreamingInsertJob.replayOnCommitted() throws a NullPointerException during FE replay:

  java.lang.NullPointerException: Cannot invoke
  "org.apache.doris.metric.LongCounterMetric.increase(java.lang.Long)"
    because "org.apache.doris.metric.MetricRepo.COUNTER_STREAMING_JOB_TOTAL_ROWS" is null
      at StreamingInsertJob.updateJobStatisticAndOffset(StreamingInsertJob.java:634)
      at StreamingInsertJob.replayOnCommitted(StreamingInsertJob.java:1020)
      at TransactionState.replaySetTransactionStatus(TransactionState.java:589)
      at DatabaseTransactionMgr.replayUpsertTransactionState(DatabaseTransactionMgr.java:2636)
      at GlobalTransactionMgr.replayUpsertTransactionState(GlobalTransactionMgr.java:952)

The root cause is that MetricRepo may not be initialized when FE replays transaction logs during startup,
but StreamingInsertJob unconditionally calls metric update methods, leading to NPE.

Fix

Two separate changes are applied to StreamingInsertJob:

  1. Skip metric updates during replay: updateJobStatisticAndOffset and
    updateCloudJobStatisticAndOffset now accept an isReplay boolean parameter. Call sites in
    replayOnCommitted and replayOnCloudMode pass true, while afterCommitted passes false.

  2. Guard all metric calls with MetricRepo.isInit: All remaining MetricRepo.COUNTER_STREAMING_JOB_*
    call sites are wrapped with if (MetricRepo.isInit && !isReplay) or if (MetricRepo.isInit) to prevent NPE
    if MetricRepo has not been fully initialized.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

/review

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

@github-actions
Copy link
Contributor

Code Review Summary

PR: fix Fix NPE in StreamingInsertJob when MetricRepo is not initialized during replay

This PR fixes a NullPointerException during FE replay when MetricRepo is not yet initialized. The fix adds MetricRepo.isInit guards and introduces an isReplay boolean parameter to skip metric updates during replay paths. The approach follows the well-established pattern used elsewhere in the codebase (e.g., RoutineLoadJob.java).

Critical Checkpoint Conclusions

  1. Goal and correctness: The goal is to prevent NPE during replay. The fix correctly addresses all metric call sites on the replay path by (a) guarding with MetricRepo.isInit and (b) passing isReplay=true from replay callers. All call sites are updated consistently.

  2. Modification scope: The change is minimal and focused — single file, single concern. +33/-18 lines, all in StreamingInsertJob.java.

  3. Concurrency: No new concurrency concerns. MetricRepo.isInit is declared volatile, ensuring cross-thread visibility. The existing locking patterns are unchanged.

  4. Lifecycle management: No lifecycle changes. N/A.

  5. Configuration items: No new config. N/A.

  6. Incompatible changes / format changes: None. N/A.

  7. Parallel code paths: All 13 MetricRepo.COUNTER_STREAMING_JOB_* usages in the file are now guarded. The two replay paths (replayOnCommitted and replayOnCloudMode) correctly pass isReplay=true. The non-replay path (afterCommitted) correctly passes isReplay=false.

  8. Special conditional checks: The MetricRepo.isInit && !isReplay pattern is clear and follows the convention in RoutineLoadJob.

  9. Test coverage: No new tests added. This is acceptable for a defensive NPE guard — the bug manifests only during FE startup replay when MetricRepo is uninitialized, which is difficult to reproduce in unit tests. The fix is straightforward enough to verify by inspection.

  10. Observability: The change skips metrics during replay, which is correct — replay should not double-count metrics that were already recorded on the master.

  11. EditLog / persistence: No changes to EditLog or persistence logic. The updateJobStatisticAndOffset methods still correctly update in-memory statistics and offsets during replay; only the metric counter updates are skipped.

  12. Performance: No performance impact. The isInit check is a single volatile boolean read.

Minor Observations (Non-blocking)

  • updateCloudJobStatisticAndOffset is only ever called with isReplay=true (from replayOnCloudMode), making the !isReplay guard always evaluate to true (skip metrics). This is not a bug — just slightly redundant defensiveness. If a non-replay cloud call path is added in the future, the guard will correctly kick in.

  • updateJobStatisticAndOffset(CommitOffsetRequest) (line 666) appears to be dead code — no callers exist. Not introduced by this PR.

Verdict: No issues found. The fix is correct, focused, and follows established patterns.

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 12, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run feut

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/24) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 27769 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit dbb4670ecfb9e8e7f4568aa99daf4e832562e143, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17628	4481	4268	4268
q2	q3	10644	813	519	519
q4	4689	373	254	254
q5	7561	1224	997	997
q6	180	180	146	146
q7	798	850	662	662
q8	9302	1514	1404	1404
q9	5012	4791	4762	4762
q10	6283	1921	1664	1664
q11	474	249	243	243
q12	691	571	470	470
q13	18045	3046	2199	2199
q14	237	235	214	214
q15	925	804	809	804
q16	740	721	689	689
q17	720	907	396	396
q18	6011	5435	5264	5264
q19	1126	1008	608	608
q20	528	500	393	393
q21	4336	2193	1549	1549
q22	440	367	264	264
Total cold run time: 96370 ms
Total hot run time: 27769 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4678	4495	4483	4483
q2	q3	3933	4395	3849	3849
q4	876	1234	785	785
q5	4098	4356	4335	4335
q6	207	181	144	144
q7	1797	1645	1521	1521
q8	2499	2745	2738	2738
q9	7508	7287	7241	7241
q10	3802	4063	3626	3626
q11	530	457	426	426
q12	492	590	449	449
q13	2803	3134	2319	2319
q14	285	297	288	288
q15	852	797	789	789
q16	708	784	698	698
q17	1221	1474	1410	1410
q18	7175	6795	6630	6630
q19	939	903	986	903
q20	2164	2270	2037	2037
q21	4162	3539	3470	3470
q22	440	441	386	386
Total cold run time: 51169 ms
Total hot run time: 48527 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153672 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit dbb4670ecfb9e8e7f4568aa99daf4e832562e143, data reload: false

query5	4327	670	534	534
query6	336	228	229	228
query7	4218	478	277	277
query8	354	256	241	241
query9	8716	2785	2776	2776
query10	505	386	344	344
query11	7591	5798	5652	5652
query12	188	130	128	128
query13	1278	478	354	354
query14	5754	3872	3563	3563
query14_1	2847	2847	2777	2777
query15	207	193	181	181
query16	992	484	478	478
query17	932	722	615	615
query18	2437	459	354	354
query19	230	212	211	211
query20	133	129	125	125
query21	232	138	124	124
query22	4819	4856	4816	4816
query23	15999	15614	15307	15307
query23_1	15513	16310	15930	15930
query24	7381	1761	1353	1353
query24_1	1297	1296	1302	1296
query25	565	511	448	448
query26	1422	301	160	160
query27	2926	512	305	305
query28	5008	1997	1920	1920
query29	846	579	473	473
query30	314	255	215	215
query31	1354	1285	1253	1253
query32	79	75	70	70
query33	501	337	277	277
query34	959	926	569	569
query35	631	694	593	593
query36	1096	1173	948	948
query37	134	98	84	84
query38	2927	2922	2889	2889
query39	886	874	831	831
query39_1	826	818	832	818
query40	233	156	136	136
query41	61	61	60	60
query42	301	305	294	294
query43	238	253	221	221
query44	
query45	197	191	192	191
query46	909	996	625	625
query47	2159	2161	2084	2084
query48	321	334	231	231
query49	634	480	380	380
query50	704	292	217	217
query51	4273	4082	4238	4082
query52	288	291	279	279
query53	300	354	283	283
query54	295	270	259	259
query55	92	86	84	84
query56	326	325	317	317
query57	1405	1340	1279	1279
query58	292	279	280	279
query59	1360	1441	1260	1260
query60	337	342	346	342
query61	146	148	146	146
query62	628	592	540	540
query63	311	281	277	277
query64	5008	1264	991	991
query65	
query66	1458	462	356	356
query67	16478	16339	16305	16305
query68	
query69	384	311	303	303
query70	1009	1001	951	951
query71	362	311	303	303
query72	2815	2694	2722	2694
query73	559	560	328	328
query74	10000	9949	9820	9820
query75	2904	2777	2460	2460
query76	2278	1049	687	687
query77	391	397	337	337
query78	11167	11328	10671	10671
query79	1150	822	602	602
query80	736	658	588	588
query81	490	281	247	247
query82	1344	159	125	125
query83	386	270	261	261
query84	302	124	116	116
query85	925	597	445	445
query86	384	303	313	303
query87	3227	3169	3005	3005
query88	3585	2685	2684	2684
query89	434	373	343	343
query90	1996	186	185	185
query91	172	165	136	136
query92	87	75	72	72
query93	952	855	506	506
query94	465	322	306	306
query95	597	340	324	324
query96	656	544	231	231
query97	2467	2495	2433	2433
query98	233	229	222	222
query99	1014	996	940	940
Total cold run time: 233026 ms
Total hot run time: 153672 ms

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run p0

@JNSimba JNSimba merged commit 382664f into apache:master Mar 13, 2026
27 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 13, 2026
… initialized during replay (#61253)

### What problem does this PR solve?

#### Problem
                  
`StreamingInsertJob.replayOnCommitted()` throws a `NullPointerException`
during FE replay:
                  
```
  java.lang.NullPointerException: Cannot invoke
  "org.apache.doris.metric.LongCounterMetric.increase(java.lang.Long)"
    because "org.apache.doris.metric.MetricRepo.COUNTER_STREAMING_JOB_TOTAL_ROWS" is null
      at StreamingInsertJob.updateJobStatisticAndOffset(StreamingInsertJob.java:634)
      at StreamingInsertJob.replayOnCommitted(StreamingInsertJob.java:1020)
      at TransactionState.replaySetTransactionStatus(TransactionState.java:589)
      at DatabaseTransactionMgr.replayUpsertTransactionState(DatabaseTransactionMgr.java:2636)
      at GlobalTransactionMgr.replayUpsertTransactionState(GlobalTransactionMgr.java:952)
```

The root cause is that `MetricRepo` may not be initialized when FE
replays transaction logs during startup,
but `StreamingInsertJob` unconditionally calls metric update methods,
leading to NPE.

#### Fix

  Two separate changes are applied to `StreamingInsertJob`:

1. **Skip metric updates during replay**: `updateJobStatisticAndOffset`
and
`updateCloudJobStatisticAndOffset` now accept an `isReplay` boolean
parameter. Call sites in
`replayOnCommitted` and `replayOnCloudMode` pass `true`, while
`afterCommitted` passes `false`.

2. **Guard all metric calls with `MetricRepo.isInit`**: All remaining
`MetricRepo.COUNTER_STREAMING_JOB_*`
call sites are wrapped with `if (MetricRepo.isInit && !isReplay)` or `if
(MetricRepo.isInit)` to prevent NPE
   if `MetricRepo` has not been fully initialized.
yiguolei pushed a commit that referenced this pull request Mar 13, 2026
…cRepo is not initialized during replay #61253 (#61295)

Cherry-picked from #61253

Co-authored-by: wudi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.5-merged dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants