TDB-184 : Broken fsync error handling#410
TDB-184 : Broken fsync error handling#410george-lorch wants to merge 1 commit intopercona:masterfrom
Conversation
|
https://jenkins.percona.com/view/TokuDB/job/PerconaFT-param/122/ The second 5.7 jenkins is a matrix reload on 4 failures due to some odd git state weirdness, they failed to clone/checkout my branch. |
portability/file.cc
Outdated
| } | ||
|
|
||
|
|
||
| void report_io_problem(int fd, const char *fmt, ...) __attribute__((format(printf, 2, 3))); |
There was a problem hiding this comment.
Cannot attach attribute to definition to save a declaration (in front?)
There was a problem hiding this comment.
Ohh, good catch, I forgot attribute could be used like that.
portability/file.cc
Outdated
| va_list ap; | ||
| va_start(ap, fmt); | ||
|
|
||
| const int tstr_length = 26; |
There was a problem hiding this comment.
static constexpr
| buf); | ||
| #endif | ||
| fflush(stderr); | ||
| } |
portability/file.cc
Outdated
| report_io_problem( | ||
| fd, "Failed write of [%" PRIu64 "] bytes.", (uint64_t)len); | ||
| int out_of_disk_space = 1; | ||
| assert(!out_of_disk_space); // Give an error message that might |
There was a problem hiding this comment.
Huh. bool at least?
There was a problem hiding this comment.
This was original FT code and I saw no reason to change it, funky as it is.
There was a problem hiding this comment.
oh, and static constexpr yada yada if you decide to touch it
portability/file.cc
Outdated
| report_io_problem(fd, | ||
| "Write of [%" PRIu64 | ||
| "] bytes interrupted. Retrying.", | ||
| (uint64_t)len); |
There was a problem hiding this comment.
similar, slightly modified FT code, but yeah
portability/file.cc
Outdated
| case ENOSPC: { | ||
| if (toku_assert_on_write_enospc) { | ||
| report_io_problem( | ||
| fd, "Failed write of [%" PRIu64 "] bytes.", (uint64_t)len); |
portability/file.cc
Outdated
| // be useful if this is the only one | ||
| // that survives. | ||
| } else { | ||
| toku_sync_fetch_and_add(&toku_write_enospc_total, 1); |
There was a problem hiding this comment.
Would it be too much work to convert to std::atomic?
There was a problem hiding this comment.
probably not as these are pretty local to file.cc
portability/file.cc
Outdated
| // fallthrough | ||
| case EINTR: | ||
| // fallthrough | ||
| case EIO: |
There was a problem hiding this comment.
But the issue being fixed here is that EIO must fail at once, exactly the opposite of EINTR/ENOLCK
There was a problem hiding this comment.
That is not what I came to understand about EIO, which could be incorrect. https://stackoverflow.com/questions/42434872/writing-programs-to-cope-with-i-o-errors-causing-lost-writes-on-linux
Otherwise, what is the point of this exercise at all where FT would generally assert on all errors that were not EINTR or ENOSPC.
There was a problem hiding this comment.
That's what I was referring to with my comment on the TDB issue: "I didn't realise that was a release-build assert too - in which case the bug is not nearly as serious"
There was a problem hiding this comment.
Actually, yeah, the discussion meanders around from retry indefinitely, to retry exactly once, to fail immediately as the write is lost forever. So yeah, now I think the original suggestion reported in 90296 is correct, abort immediately as a write is lost.
portability/file.cc
Outdated
| } | ||
|
|
||
| if (r == -1) { | ||
| int what_errno = get_error_errno(); |
There was a problem hiding this comment.
Mostly duplicated code with the above
- Factored out i/o error reporting to be common and try to obtain actual filename from descriptor. - Factored out write, close and fsync error handling into a common function that handles special cases sch as ENOLCK, EINTR, ENOSPC, and EIO. See upstream MySQL issues 34823 and 90296 for details on ENOLCK and EIO. - Fixed write, close, and fsync retry logic to use new common handler. - Touched up 'slow fsync' tracking and reporting to use new i/o error reporting. - Converted metric trackers to std::atomic where appropriate. - Various bits of code reformatting and conversions to std::atomic and constexpr, #include style fixes.
|
OK, I reworked it a bit more and factored out the write error handing as well as it wasn't really doing what the function names might indicate. Here is a new set of jenkins runs. Please take another look and approach it as if it were a new review and not just the items commented. https://jenkins.percona.com/view/TokuDB/job/PerconaFT-param/123/ |
laurynas-biveinis
left a comment
There was a problem hiding this comment.
Looks much better now!
| va_list ap; | ||
| va_start(ap, fmt); | ||
|
|
||
| constexpr int tstr_length = 26; |
There was a problem hiding this comment.
static constexpr (but only if there are other comments to address)
filename from descriptor.
and upstream mysql issues 34823 and 90296. Retry will print message every 100
attempts in a loop.