Skip to content

Conversation

@BhargaviGudi
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a race condition in the JSON enricher that causes audit log entries to be missing or incomplete when commands are executed in rapid succession (e.g., touch test1 && touch test2 && touch test3).

Problem:
When short-lived processes execute quickly, audit events arrive at the JSON enricher before the corresponding BPF events have been processed from the ring buffer. This results in:

  • Empty cmdLine fields
  • Missing requestUID
  • Null resource information (pod/namespace/container)

Solution:
Implements a retry mechanism in dispatchSeccompLine() that attempts to fetch missing information multiple times with progressive delays (0ms, 10ms, 50ms):

  1. BPF Cache Retry: Retries cmdLine and requestUID lookup from BPF process cache, giving the BPF ring buffer poller time to process events
  2. Container Info Retry: Retries container information lookup if initially null

Since LogBuckets remain in cache for 60 seconds before being written to the audit log, there's ample time for retries to succeed while BPF events are processed asynchronously.

Testing Results:

  • Before fix: ~33% success rate for rapid consecutive commands
  • After fix: 100% success rate - all commands captured with complete information

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-62269

Does this PR have test?

Yes - The fix was validated through:

  • Manual testing with rapid consecutive commands (oc exec and oc rsh)
  • Existing unit tests continue to pass
  • Build verification completed successfully

No new automated tests added as this is a timing-dependent race condition that's difficult to reliably reproduce in unit tests.

Special notes for your reviewer:

  • This fix is scoped exclusively to jsonenricher.go - no modifications to BPF components or ring buffer configuration
  • The retry mechanism only activates when information is missing, avoiding unnecessary delays for normal cases
  • Progressive delays (0ms → 10ms → 50ms) balance responsiveness with reliability
  • Early break on success minimizes latency impact

Does this PR introduce a user-facing change?

Fixed race condition in JSON enricher causing audit log entries to be missing or incomplete when commands are executed in rapid succession. Audit logs now reliably capture cmdLine, requestUID, and resource
information for all executed commands, even during bursts of activity.

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi
Once this PR has been reviewed and has the lgtm label, please assign ccojocar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @BhargaviGudi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 27, 2025
@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 28.88889% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.18%. Comparing base (11d77f4) to head (613d628).
⚠️ Report is 1055 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3052       +/-   ##
===========================================
- Coverage   45.50%   24.18%   -21.32%     
===========================================
  Files          79      125       +46     
  Lines        7782    17818    +10036     
===========================================
+ Hits         3541     4309      +768     
- Misses       4099    13225     +9126     
- Partials      142      284      +142     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@saschagrunert
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2025
@BhargaviGudi BhargaviGudi force-pushed the OCPBUGS-62269-fix branch 2 times, most recently from 96a2136 to 69ba88b Compare November 28, 2025 07:29
@BhargaviGudi BhargaviGudi marked this pull request as ready for review December 1, 2025 06:07
@BhargaviGudi BhargaviGudi changed the title WIP OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands Dec 1, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025
@BhargaviGudi
Copy link
Contributor Author

/retest-required

@saschagrunert
Copy link
Member

@BhargaviGudi do you mind a rebase?

e.retryBpfRequestUID(logBucket)

// Retry container info lookup if it's still missing
e.retryContainerInfo(ctx, logBucket, processID, nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BhargaviGudi Thanks for the PR. Can you try to test if e.retryContainerInfo(ctx, logBucket, processID, nodeName) alone can fix the issue without the retryBpfCmdLine and retryBpfRequestUID? I feel missing container info is the cause of the issue.

…for rapid commands

OCPBUGS-62269 - Fix GitHub Actions linting issues

OCPBUGS-62269 - Fix GitHub Actions linting issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants