Skip to content

[Fix](agent task) avoid nullptr dereference in create_tablet_callback#61240

Open
uchenily wants to merge 2 commits intoapache:masterfrom
uchenily:check-tablet
Open

[Fix](agent task) avoid nullptr dereference in create_tablet_callback#61240
uchenily wants to merge 2 commits intoapache:masterfrom
uchenily:check-tablet

Conversation

@uchenily
Copy link
Contributor

We get the following error (version 2.1.5):

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1772767549 (unix time) try "date -d @1772767549" if you are using GNU date ***
*** Current BE git commitID: 654acde ***
*** SIGSEGV address not mapped to object (@0x40) received by PID 79914 (TID 81118 OR 0x7fc5b7fc8700) from PID 64; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 4# 0x00007FC86CD24400 in /lib64/libc.so.6
 5# doris::create_tablet_callback(doris::StorageEngine&, doris::TAgentTaskRequest const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/agent/task_worker_pool.cpp:1398
 6# std::_Function_handler<void (), doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest const&)::$_0::operator()<doris::TAgentTaskRequest const&>(doris::TAgentTaskRequest const&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
 7# doris::ThreadPool::dispatch_thread() in /usr/local/doris-be/lib/doris_be
 8# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/util/thread.cpp:499
 9# start_thread in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6

it not easy to reproduce, but since get_tablet may return nullptr, it's better to do a check before continuing to avoid BE coredump.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

We get the following error (version 2.1.5):

```
*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1772767549 (unix time) try "date -d @1772767549" if you are using GNU date ***
*** Current BE git commitID: 654acde ***
*** SIGSEGV address not mapped to object (@0x40) received by PID 79914 (TID 81118 OR 0x7fc5b7fc8700) from PID 64; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 4# 0x00007FC86CD24400 in /lib64/libc.so.6
 5# doris::create_tablet_callback(doris::StorageEngine&, doris::TAgentTaskRequest const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/agent/task_worker_pool.cpp:1398
 6# std::_Function_handler<void (), doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest const&)::$_0::operator()<doris::TAgentTaskRequest const&>(doris::TAgentTaskRequest const&) const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
 7# doris::ThreadPool::dispatch_thread() in /usr/local/doris-be/lib/doris_be
 8# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/util/thread.cpp:499
 9# start_thread in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6
```

it not easy to reproduce, but since get_tablet may return nullptr, it's better to do a check before continuing to avoid BE coredump.
@uchenily
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27674 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 817e2f9a8392d3b70a0593f989b036f87a0f11c1, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17639	4489	4294	4294
q2	q3	10650	796	521	521
q4	4689	367	248	248
q5	7695	1209	1048	1048
q6	173	175	144	144
q7	816	837	674	674
q8	10125	1443	1336	1336
q9	5688	4794	4674	4674
q10	6322	1919	1643	1643
q11	459	271	242	242
q12	774	575	471	471
q13	18037	2971	2192	2192
q14	227	230	215	215
q15	938	802	813	802
q16	746	716	667	667
q17	710	843	435	435
q18	5993	5439	5256	5256
q19	1473	986	647	647
q20	495	492	394	394
q21	4743	1983	1484	1484
q22	393	341	287	287
Total cold run time: 98785 ms
Total hot run time: 27674 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4749	4626	4509	4509
q2	q3	3894	4351	3865	3865
q4	845	1185	762	762
q5	4012	4303	4321	4303
q6	178	171	143	143
q7	1722	1630	1486	1486
q8	2489	2699	2556	2556
q9	7511	7550	7283	7283
q10	3706	4002	3641	3641
q11	508	431	465	431
q12	507	592	444	444
q13	2917	3181	2315	2315
q14	277	288	274	274
q15	854	801	792	792
q16	723	777	719	719
q17	1183	1444	1451	1444
q18	7287	6722	6667	6667
q19	859	857	879	857
q20	2052	2189	2024	2024
q21	4025	3495	3336	3336
q22	479	417	365	365
Total cold run time: 50777 ms
Total hot run time: 48216 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153483 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 817e2f9a8392d3b70a0593f989b036f87a0f11c1, data reload: false

query5	4327	645	517	517
query6	324	240	208	208
query7	4235	469	270	270
query8	344	243	231	231
query9	8701	2746	2743	2743
query10	521	402	360	360
query11	7319	5898	5528	5528
query12	184	129	128	128
query13	1288	460	349	349
query14	5750	3832	3573	3573
query14_1	2813	2817	2778	2778
query15	210	197	176	176
query16	996	454	446	446
query17	1114	721	628	628
query18	2527	434	334	334
query19	211	202	176	176
query20	137	130	129	129
query21	223	143	125	125
query22	4987	4958	4840	4840
query23	16616	16009	15894	15894
query23_1	15908	15863	15825	15825
query24	7818	1736	1260	1260
query24_1	1278	1358	1281	1281
query25	559	480	438	438
query26	2575	275	159	159
query27	2778	476	290	290
query28	4509	1871	1847	1847
query29	852	594	464	464
query30	308	244	218	218
query31	1373	1274	1213	1213
query32	77	75	67	67
query33	510	328	278	278
query34	924	918	568	568
query35	633	663	594	594
query36	1106	1161	1032	1032
query37	140	94	83	83
query38	2938	2916	2869	2869
query39	903	859	860	859
query39_1	826	816	847	816
query40	236	153	135	135
query41	63	58	58	58
query42	297	299	288	288
query43	237	255	210	210
query44	
query45	193	196	187	187
query46	891	968	632	632
query47	2149	2164	2067	2067
query48	306	319	230	230
query49	630	454	385	385
query50	671	273	214	214
query51	4102	4090	4014	4014
query52	286	293	280	280
query53	291	335	287	287
query54	290	290	270	270
query55	93	92	88	88
query56	311	323	340	323
query57	1375	1358	1280	1280
query58	292	280	271	271
query59	1317	1464	1246	1246
query60	331	334	342	334
query61	154	143	154	143
query62	618	580	531	531
query63	310	276	278	276
query64	5118	1262	1013	1013
query65	
query66	1451	469	357	357
query67	16328	16365	16245	16245
query68	
query69	386	308	277	277
query70	1043	1035	981	981
query71	329	309	294	294
query72	2864	2886	2685	2685
query73	554	560	322	322
query74	9957	9906	9775	9775
query75	2889	2777	2481	2481
query76	2318	1034	682	682
query77	365	388	319	319
query78	11184	11241	10652	10652
query79	2695	803	599	599
query80	1807	653	574	574
query81	565	284	248	248
query82	1026	151	122	122
query83	338	276	254	254
query84	251	127	109	109
query85	990	483	455	455
query86	436	308	348	308
query87	3152	3110	3007	3007
query88	3578	2683	2666	2666
query89	424	366	342	342
query90	2014	184	180	180
query91	167	165	140	140
query92	77	72	73	72
query93	1159	808	498	498
query94	645	324	291	291
query95	573	335	330	330
query96	644	517	228	228
query97	2488	2506	2420	2420
query98	247	226	219	219
query99	991	993	936	936
Total cold run time: 237404 ms
Total hot run time: 153483 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.56% (19649/37384)
Line Coverage 36.18% (183318/506709)
Region Coverage 32.31% (141567/438152)
Branch Coverage 33.48% (61781/184543)

yiguolei
yiguolei previously approved these changes Mar 12, 2026
@yiguolei yiguolei added usercase Important user case type label dev/4.0.x dev/4.1.x labels Mar 12, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 12, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

.tag("signature", req.signature)
.tag("tablet_id", create_tablet_req.tablet_id)
.error(status);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a detailed explanation for why engine.create_tablet succeeds, but the data cannot be retrieved here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally this wouldn't happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log of the issue is no longer available. The following is an analysis of the possible causes from the code:

case1. tablet has been deleted
another task is executing drop_tablet_callback (maybe related to TabletScheduler or ReportHandler), and before executing get_tablet, the tablet has already been deleted.

case2. tablet is in an unavailable state (tablet->is_used() == false)
two specific situations for this:

  1. The tablet is already in a bad state (_is_bad == true).
  2. An IO_ERROR occurred during a health check in the directory (DataDir::health_check).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will log the error message, easier to identify the root cause if the same issue occurs in the future

@uchenily
Copy link
Contributor Author

run cloud_p0

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.11% (20907/36606)
Line Coverage 40.17% (202915/505168)
Region Coverage 36.66% (162149/442300)
Branch Coverage 37.50% (69422/185119)

&error_msg);
}

if (tablet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it may be nullptr? some comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is not very clear, and the specific reason can only be further located after adding the log. The most likely reason is the nullptr caused by IO_ERROR (DataDir::check_health).

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.11% (20907/36606)
Line Coverage 40.17% (202915/505168)
Region Coverage 36.66% (162149/442300)
Branch Coverage 37.50% (69422/185119)

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 13, 2026
@uchenily
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27733 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ac5f3ea8d2bc0685540fb2389430fe0a718e4250, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17628	4490	4305	4305
q2	q3	10640	817	523	523
q4	4690	370	258	258
q5	7669	1215	1041	1041
q6	187	179	149	149
q7	831	853	676	676
q8	10118	1478	1377	1377
q9	5388	4789	4744	4744
q10	6356	1932	1653	1653
q11	498	271	233	233
q12	753	568	471	471
q13	18075	2933	2186	2186
q14	234	229	216	216
q15	931	809	829	809
q16	748	727	702	702
q17	714	855	425	425
q18	6033	5457	5170	5170
q19	1469	987	635	635
q20	507	500	399	399
q21	4745	1965	1481	1481
q22	397	349	280	280
Total cold run time: 98611 ms
Total hot run time: 27733 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4730	4617	4680	4617
q2	q3	4152	4334	3838	3838
q4	912	1230	775	775
q5	4111	4432	4413	4413
q6	188	182	147	147
q7	1787	1655	1573	1573
q8	2504	2742	2577	2577
q9	7549	7486	7335	7335
q10	3814	4101	3606	3606
q11	512	437	426	426
q12	494	591	452	452
q13	2904	3442	2321	2321
q14	280	303	279	279
q15	866	849	830	830
q16	739	783	711	711
q17	1179	1477	1401	1401
q18	7106	6818	6575	6575
q19	897	910	900	900
q20	2089	2235	1984	1984
q21	4012	3613	3351	3351
q22	448	444	388	388
Total cold run time: 51273 ms
Total hot run time: 48499 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153574 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ac5f3ea8d2bc0685540fb2389430fe0a718e4250, data reload: false

query5	4322	656	537	537
query6	329	225	204	204
query7	4210	473	276	276
query8	358	238	226	226
query9	8742	2717	2703	2703
query10	522	374	343	343
query11	7349	5838	5646	5646
query12	188	128	128	128
query13	1256	456	355	355
query14	5677	3879	3579	3579
query14_1	2845	2859	2808	2808
query15	212	193	180	180
query16	979	479	473	473
query17	891	720	613	613
query18	2435	456	359	359
query19	213	215	187	187
query20	137	128	131	128
query21	229	144	130	130
query22	4903	5041	4966	4966
query23	16076	15650	15374	15374
query23_1	15484	16412	16038	16038
query24	7519	1906	1296	1296
query24_1	1287	1275	1334	1275
query25	612	559	507	507
query26	1404	334	176	176
query27	3181	515	308	308
query28	4857	1929	1947	1929
query29	1526	627	530	530
query30	325	265	233	233
query31	1426	1432	1273	1273
query32	107	75	89	75
query33	509	345	318	318
query34	963	1013	584	584
query35	667	716	644	644
query36	1189	1253	1086	1086
query37	143	105	92	92
query38	2973	2954	2839	2839
query39	900	851	885	851
query39_1	814	822	845	822
query40	227	148	134	134
query41	61	62	57	57
query42	299	303	307	303
query43	238	252	221	221
query44	
query45	203	194	185	185
query46	892	1031	623	623
query47	2114	2169	2039	2039
query48	308	312	231	231
query49	639	459	375	375
query50	668	284	222	222
query51	4134	4040	4010	4010
query52	288	289	280	280
query53	286	329	319	319
query54	293	282	264	264
query55	91	86	89	86
query56	323	335	316	316
query57	1374	1338	1258	1258
query58	297	276	286	276
query59	1296	1448	1290	1290
query60	339	333	319	319
query61	152	147	152	147
query62	627	590	538	538
query63	309	277	275	275
query64	4962	1275	1015	1015
query65	
query66	1463	456	366	366
query67	16249	16332	16267	16267
query68	
query69	395	335	293	293
query70	964	980	1002	980
query71	353	313	306	306
query72	2741	2667	2389	2389
query73	552	548	330	330
query74	10022	9899	9768	9768
query75	2855	2782	2486	2486
query76	2282	1018	685	685
query77	364	422	318	318
query78	11114	11298	10642	10642
query79	2284	792	609	609
query80	1726	616	545	545
query81	558	282	248	248
query82	973	150	116	116
query83	332	259	243	243
query84	294	126	106	106
query85	886	484	427	427
query86	429	301	304	301
query87	3170	3121	3007	3007
query88	3573	2661	2641	2641
query89	429	374	347	347
query90	2006	184	183	183
query91	162	170	130	130
query92	74	75	76	75
query93	1005	853	509	509
query94	629	326	283	283
query95	600	336	328	328
query96	627	522	233	233
query97	2494	2480	2392	2392
query98	245	229	221	221
query99	1012	1014	907	907
Total cold run time: 235711 ms
Total hot run time: 153574 ms

@uchenily uchenily requested review from yiguolei and zclllyybb March 13, 2026 06:32
@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.66% (19725/37456)
Line Coverage 36.23% (184223/508417)
Region Coverage 32.38% (142309/439530)
Branch Coverage 33.57% (62186/185253)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.30% (26873/36663)
Line Coverage 56.62% (286914/506712)
Region Coverage 53.98% (239468/443592)
Branch Coverage 55.63% (103330/185745)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev/4.0.x dev/4.1.x reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants