chore: add timing logs for file index partition and file listing#18417
chore: add timing logs for file index partition and file listing#18417suryaprasanna wants to merge 2 commits intoapache:masterfrom
Conversation
voonhous
left a comment
There was a problem hiding this comment.
Main concern here is that the logs are too noisy.
- Are these logs only used for debugging?
- Should we use
log.debug(...)for routine timing logs instead oflog.info(...)? - We can consider logging at info level when something is slow, i.e.
if (elapsed > threshold) log.info(...)or unexpected (e.g., large cache miss ratio)
| .collect(Collectors.toList()) | ||
| )); | ||
| } finally { | ||
| log.info("On {} with query instant as {}, it took {}ms to filter {} files into file slices across {} partitions", |
There was a problem hiding this comment.
My main concern here is that the logs generated here will be too noisy.
This fires on every single query, even if it took 1ms. For a busy Spark application hitting multiple Hudi tables, this alone generates one info line per query.
There was a problem hiding this comment.
Moved to debug.
| log.info("On {}, out of {} partition paths, {} are missing from cache. Loading them.", | ||
| metaClient.getTableConfig().getTableName(), partitionPaths.size(), missingPartitionPaths.size()); |
There was a problem hiding this comment.
uard the "missing from cache" log with if (missingPartitionPaths.size() > 0) there is no point logging when everything is cached.
There was a problem hiding this comment.
Added a condition to handle, please review it again.
6b4efb3 to
d846af1
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18417 +/- ##
============================================
- Coverage 68.26% 68.25% -0.01%
- Complexity 27761 27769 +8
============================================
Files 2439 2439
Lines 134355 134374 +19
Branches 16210 16209 -1
============================================
+ Hits 91715 91722 +7
- Misses 35543 35554 +11
- Partials 7097 7098 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| return result; | ||
| } catch (IOException e) { | ||
| throw new HoodieIOException("Failed to list partition paths", e); | ||
| throw new HoodieIOException("On " + metaClient.getTableConfig().getTableName() + " Failed to list partition paths", e); |
There was a problem hiding this comment.
Nit:
| throw new HoodieIOException("On " + metaClient.getTableConfig().getTableName() + " Failed to list partition paths", e); | |
| throw new HoodieIOException("On " + metaClient.getTableConfig().getTableName() + " Failed to list partition paths", e);``` |
| matchedPartitionPaths = tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths, | ||
| partitionFields, partitionColumnPredicates); | ||
| } catch (IOException e) { | ||
| throw new HoodieIOException("Error fetching partition paths", e); |
There was a problem hiding this comment.
Nit: Inconsistent error message enrichment, debug has table name, we should also include table name here too for better debuggability. (Even though the exception might contain some clues to this / have partition path.)
Ideally, let's make it the same exception message in listPartitionPathFiles.
| } catch (IOException e) { | ||
| throw new HoodieIOException("Error fetching partition paths", e); | ||
| } finally { | ||
| log.debug("On {}, it took {} ms to list partition paths with {} relativePartitionPaths and partition predicates", |
There was a problem hiding this comment.
Did you mean to log partitionColumnPredicates here too after and partition predicates.
| if (queryType == HoodieTableQueryType.INCREMENTAL && incrementalQueryStartTime.isPresent() && !isBeforeTimelineStarts()) { | ||
| HoodieTimeline timelineToQuery = findInstantsInRange(); | ||
| matchedPartitionPaths = TimelineUtils.getWrittenPartitions(timelineToQuery); | ||
| } else { |
There was a problem hiding this comment.
Is this intentional? Only the getPartitionPathWithPathPrefixes branch is timed.
The TimelineUtils.getWrittenPartitions(timelineToQuery) branch (incremental queries) gets no timing.
IIUC, the incremental-query partition listing is the not likely to be slow?
There was a problem hiding this comment.
🤖 The asymmetry is reasonable IMO — TimelineUtils.getWrittenPartitions reads from the already-loaded in-memory timeline, so it's typically O(commits) with no filesystem I/O, while getPartitionPathWithPathPrefixes does actual path listing against the filesystem or metadata table. That said, for very large timelines (many commits, long retention) it could be non-trivial, so adding timing there for completeness wouldn't hurt — but I wouldn't call it a blocker.
| if (missingPartitionPaths.isEmpty()) { | ||
| return cachedPartitionPaths.values().stream() | ||
| .flatMap(Collection::stream) | ||
| .collect(Collectors.toList()); | ||
| } | ||
|
|
There was a problem hiding this comment.
Early return is a behavioral change, this should be called out.
Good optimization, but the PR description says "No functional behavior is intended to change".
Please update the description accordingly.
| throw new HoodieIOException("On " + metaClient.getTableConfig().getTableName() + " Failed to list partition paths", e); | ||
| } finally { | ||
| log.debug("On {}, it took {} ms to fetch files for {} uncached partitions via getAllFilesInPartitions", | ||
| metaClient.getTableConfig().getTableName(), timer.endTimer(), missingPartitionPaths.size()); |
There was a problem hiding this comment.
Nit: feel free to ignore, timer includes close time here. I assume closing is cheap, so this doesn't matter.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The approach is solid and the observability improvements are useful. CI is passing and the change is low-risk. A couple of small observations in the inline comments — the main open threads from @voonhous around the getPartitionPathWithPathPrefixes branch not being timed and the missing table name in the listPartitionPaths exception message seem worth addressing before merge.
| .collect(Collectors.toList()) | ||
| )); | ||
| } finally { | ||
| log.debug("On {} with query instant as {}, it took {}ms to filter {} files into file slices across {} partitions", |
There was a problem hiding this comment.
🤖 The finally here has no corresponding catch, so this log fires with the same wording whether the filter succeeded or threw an exception. During debugging, seeing "it took Xms to filter Y files into file slices" right before a stack trace could be confusing. Would it be worth omitting the log (or using a different message) on the exception path, similar to how you guard the cache-miss log?
| @@ -502,7 +527,10 @@ | |||
|
|
|||
There was a problem hiding this comment.
🤖 Same pattern here — on the IOException path the finally logs "it took Xms to fetch files for N uncached partitions via getAllFilesInPartitions", which implies the fetch completed. Since the exception message is also logged by the caller, the timing line could mislead someone into thinking the fetch succeeded. Could this message be written in a way that's unambiguous on both success and failure paths, e.g. "...time spent in getAllFilesInPartitions attempt..."?
| if (queryType == HoodieTableQueryType.INCREMENTAL && incrementalQueryStartTime.isPresent() && !isBeforeTimelineStarts()) { | ||
| HoodieTimeline timelineToQuery = findInstantsInRange(); | ||
| matchedPartitionPaths = TimelineUtils.getWrittenPartitions(timelineToQuery); | ||
| } else { |
There was a problem hiding this comment.
🤖 The asymmetry is reasonable IMO — TimelineUtils.getWrittenPartitions reads from the already-loaded in-memory timeline, so it's typically O(commits) with no filesystem I/O, while getPartitionPathWithPathPrefixes does actual path listing against the filesystem or metadata table. That said, for very large timelines (many commits, long retention) it could be non-trivial, so adding timing there for completeness wouldn't hurt — but I wouldn't call it a blocker.
Describe the issue this Pull Request addresses
File index operations currently do not provide enough visibility into how much time is spent listing partitions, fetching files from metadata, and filtering files into file slices. This makes it harder to debug slow queries and identify where file-index time is being spent.
Summary and Changelog
Adds timing and context logs in
BaseHoodieTableFileIndexfor key file-index stages.getAllFilesInPartitionsfilterFileswhile building file slicesImpact
No public API or user-facing behavior change. This improves observability for file-index execution and helps diagnose performance issues in partition and file listing paths.
Risk Level
low
This change only adds logging and slightly updates an internal exception message with table context. No functional behavior is intended to change.
Documentation Update
none
Contributor's checklist