[Fix](agent task) avoid nullptr dereference in create_tablet_callback#61240
[Fix](agent task) avoid nullptr dereference in create_tablet_callback#61240uchenily wants to merge 2 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
We get the following error (version 2.1.5): ``` *** Query id: 0-0 *** *** is nereids: 0 *** *** tablet id: 0 *** *** Aborted at 1772767549 (unix time) try "date -d @1772767549" if you are using GNU date *** *** Current BE git commitID: 654acde *** *** SIGSEGV address not mapped to object (@0x40) received by PID 79914 (TID 81118 OR 0x7fc5b7fc8700) from PID 64; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so 4# 0x00007FC86CD24400 in /lib64/libc.so.6 5# doris::create_tablet_callback(doris::StorageEngine&, doris::TAgentTaskRequest const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/agent/task_worker_pool.cpp:1398 6# std::_Function_handler<void (), doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest const&)::$_0::operator()<doris::TAgentTaskRequest const&>(doris::TAgentTaskRequest const&) const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 7# doris::ThreadPool::dispatch_thread() in /usr/local/doris-be/lib/doris_be 8# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/util/thread.cpp:499 9# start_thread in /lib64/libpthread.so.0 10# clone in /lib64/libc.so.6 ``` it not easy to reproduce, but since get_tablet may return nullptr, it's better to do a check before continuing to avoid BE coredump.
|
run buildall |
TPC-H: Total hot run time: 27674 ms |
TPC-DS: Total hot run time: 153483 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
| .tag("signature", req.signature) | ||
| .tag("tablet_id", create_tablet_req.tablet_id) | ||
| .error(status); | ||
| } |
There was a problem hiding this comment.
Is there a detailed explanation for why engine.create_tablet succeeds, but the data cannot be retrieved here?
There was a problem hiding this comment.
Normally this wouldn't happen.
There was a problem hiding this comment.
The log of the issue is no longer available. The following is an analysis of the possible causes from the code:
case1. tablet has been deleted
another task is executing drop_tablet_callback (maybe related to TabletScheduler or ReportHandler), and before executing get_tablet, the tablet has already been deleted.
case2. tablet is in an unavailable state (tablet->is_used() == false)
two specific situations for this:
- The tablet is already in a bad state (_is_bad == true).
- An IO_ERROR occurred during a health check in the directory (DataDir::health_check).
There was a problem hiding this comment.
We will log the error message, easier to identify the root cause if the same issue occurs in the future
|
run cloud_p0 |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| &error_msg); | ||
| } | ||
|
|
||
| if (tablet) { |
There was a problem hiding this comment.
why it may be nullptr? some comment?
There was a problem hiding this comment.
The reason is not very clear, and the specific reason can only be further located after adding the log. The most likely reason is the nullptr caused by IO_ERROR (DataDir::check_health).
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 27733 ms |
TPC-DS: Total hot run time: 153574 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
We get the following error (version 2.1.5):
it not easy to reproduce, but since get_tablet may return nullptr, it's better to do a check before continuing to avoid BE coredump.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)