Skip to content

zebra: Fix incorrect update of 'nhe_received' in route_entry_update_nhe()#231

Open
mwinter-osr wants to merge 1 commit intoopensourcerouting:masterfrom
mwinter-osr:PR21104
Open

zebra: Fix incorrect update of 'nhe_received' in route_entry_update_nhe()#231
mwinter-osr wants to merge 1 commit intoopensourcerouting:masterfrom
mwinter-osr:PR21104

Conversation

@mwinter-osr
Copy link
Copy Markdown
Member

The variable 'nhe_received' is used to store the NHG received from the upper-level protocol.
It should not be updated once it's set during early route processing, as we want to preserve the original one.

Updating it in 'route_entry_update_nhe()' causes the loss of the original NHG.
Let's fix it.

Original PR21104 by GaladrielZhao

…he()

The variable 'nhe_received' is used to store the NHG
received from the upper-level protocol. It should not
be updated once it's set during early route processing,
as we want to preserve the original one.
Updating it in 'route_entry_update_nhe()' causes the loss
of the original NHG.

Remove the nhe_received update from route_entry_update_nhe().
Instead, handle it in rib_handle_nhg_replace() where old_entry
is about to be force-freed: migrate nhe_received to new_entry
only when it still points to old_entry, preventing a dangling
pointer. When nhe_received already points to a different NHG
(the true original received NHG), it is left untouched.

Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR fixes a refcount and pointer-correctness bug in route_entry_update_nhe() where re->nhe_received was unconditionally overwritten whenever re->nhe was updated, causing loss of the original received NHG and a ref-count leak on it. The fix guards the update behind if (re->nhe_received == old_nhg), adds a pre-migration step in rib_handle_nhg_replace() to avoid dangling pointers, and adds a safety cleanup in rib_re_nhg_free() for the case where nhe_received and nhe diverge. Topotest refcount expectations are doubled to account for the dual re->nhe / re->nhe_received refs now properly held, and a new test_zebra_received_nhe_kept_nexthop_change test verifies correct ref release on nexthop change.

Confidence Score: 5/5

  • Safe to merge; the C fix is correct and all remaining findings are P2 naming/style issues in the topotest.
  • The core ref-count fix in zebra_rib.c is logically sound across all three affected call paths (route_entry_update_nhe NULL-clear, non-NULL replace, and rib_handle_nhg_replace pre-migration). The topotest updates correctly double the expected refcounts to reflect the dual nhe/nhe_received refs now held. Only P2 findings remain: two helper functions have misleading suffixes that don't match the expected values they verify, and a redundant loop with a confusingly-worded comment was added. None of these affect correctness or CI outcomes.
  • tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py — helper function names check_nhg_refcount_2 and check_nhg_refcount_10 should be renamed to match their actual expected values (4 and 20).

Important Files Changed

Filename Overview
zebra/zebra_rib.c Core fix: restricts nhe_received updates in route_entry_update_nhe() to only when nhe_received == old_nhg; adds pre-migration in rib_handle_nhg_replace() and a safety cleanup in rib_re_nhg_free(). Ref-count arithmetic is correct in all three code paths.
tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py Updates refcount expectations to 2× (5→10, 2→4, 10→20) to account for dual nhe/nhe_received refs; adds new nexthop-change test. Helper function names check_nhg_refcount_2 and check_nhg_refcount_10 are misleadingly named (actual expected values are 4 and 20 respectively). Logic is correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["route_entry_update_nhe(re, new_nhghe)"] --> B{new_nhghe == NULL?}

    B -- yes --> C["old_nhg = re->nhe"]
    C --> D{re->nhe_received\n== re->nhe?}
    D -- yes --> E["decrement_ref(nhe_received)\nnhe_received = NULL"]
    D -- no --> F["leave nhe_received untouched\n(different NHG preserved)"]
    E --> G["re->nhe = NULL, goto done"]
    F --> G

    B -- no --> H{re->nhe != new_nhghe\nand re->nhe != NULL?}
    H -- yes --> I["old_nhg = re->nhe\nattach_ref(re, new_nhghe)"]
    H -- no\n(first attach) --> J["attach_ref(re, new_nhghe)\nold_nhg = NULL"]

    I --> K["done: old_nhg != NULL"]
    G --> K

    K --> L{re->nhe_received\n== old_nhg?}
    L -- yes --> M["decrement_ref(old_nhg)\nif new_nhghe: increment_ref(new_nhghe)\nnhe_received = new_nhghe"]
    L -- no\n(original NHG preserved!) --> N["nhe_received unchanged"]
    M --> O["decrement_ref(old_nhg)\n(for re->nhe ref)"]
    N --> O

    J --> P["done: old_nhg == NULL\n(no cleanup needed)"]

    style F fill:#90EE90
    style N fill:#90EE90
    style M fill:#FFE4B5
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 173-178

Comment:
**Misleading helper function name**

The function is renamed to `check_nhg_refcount_2` but the expected value passed to `run_and_expect` is still `4`. The suffix `_2` implies the test is checking for a refcount of 2, which is incorrect. The old name `check_nhg_refcount_4` was accurate. Rename to `check_nhg_refcount_4` (or better, `check_nhg_refcount`) to avoid misleading future readers.

```suggestion
    def check_nhg_refcount_4():
        nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
        nhg_json = json.loads(nhg_info)
        return nhg_json.get(str(nhg_id), {}).get("refCount")

    _, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
    assert result == 4, "NHG refcount is not 4"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 239-244

Comment:
**Misleading helper function name**

Same pattern as the remove-routes test: the function is renamed to `check_nhg_refcount_10` but the expected refcount being checked is `20`. The suffix `_10` implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name `check_nhg_refcount_20` was accurate.

```suggestion
    def check_nhg_refcount_20():
        nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
        nhg_json = json.loads(nhg_info)
        return nhg_json.get(str(nhg_id), {}).get("refCount")

    _, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1)
    assert result == 20, "NHG refcount is not 20"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 103-114

Comment:
**Redundant loop and contradictory comment**

The loop at lines 109–114 re-checks `receivedNexthopGroupId` existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals `nhg_id`. The second loop adds no new coverage.

More importantly, the comment at lines 104–108 states `receivedNexthopGroupId may equal nexthopGroupId after resolution` and calls it "expected behavior." This appears to contradict the PR description, which states `nhe_received` should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), `nhe` and `nhe_received` legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "zebra: Fix incorrect update of 'nhe_rece..." | Re-trigger Greptile

Comment on lines +173 to +178
def check_nhg_refcount_2():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")

_, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
_, result = topotest.run_and_expect(check_nhg_refcount_2, 4, count=30, wait=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading helper function name

The function is renamed to check_nhg_refcount_2 but the expected value passed to run_and_expect is still 4. The suffix _2 implies the test is checking for a refcount of 2, which is incorrect. The old name check_nhg_refcount_4 was accurate. Rename to check_nhg_refcount_4 (or better, check_nhg_refcount) to avoid misleading future readers.

Suggested change
def check_nhg_refcount_2():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
_, result = topotest.run_and_expect(check_nhg_refcount_2, 4, count=30, wait=1)
def check_nhg_refcount_4():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
assert result == 4, "NHG refcount is not 4"
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 173-178

Comment:
**Misleading helper function name**

The function is renamed to `check_nhg_refcount_2` but the expected value passed to `run_and_expect` is still `4`. The suffix `_2` implies the test is checking for a refcount of 2, which is incorrect. The old name `check_nhg_refcount_4` was accurate. Rename to `check_nhg_refcount_4` (or better, `check_nhg_refcount`) to avoid misleading future readers.

```suggestion
    def check_nhg_refcount_4():
        nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
        nhg_json = json.loads(nhg_info)
        return nhg_json.get(str(nhg_id), {}).get("refCount")

    _, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
    assert result == 4, "NHG refcount is not 4"
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 239 to 244
route["receivedNexthopGroupId"] == nhg_id
), f"Route {prefix} has different NHG ID"

step("Verify NHG {} has refcount of 10".format(nhg_id))
step("Verify NHG {} has refcount of 20".format(nhg_id))

# Get the NHG information
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading helper function name

Same pattern as the remove-routes test: the function is renamed to check_nhg_refcount_10 but the expected refcount being checked is 20. The suffix _10 implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name check_nhg_refcount_20 was accurate.

Suggested change
route["receivedNexthopGroupId"] == nhg_id
), f"Route {prefix} has different NHG ID"
step("Verify NHG {} has refcount of 10".format(nhg_id))
step("Verify NHG {} has refcount of 20".format(nhg_id))
# Get the NHG information
def check_nhg_refcount_20():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1)
assert result == 20, "NHG refcount is not 20"
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 239-244

Comment:
**Misleading helper function name**

Same pattern as the remove-routes test: the function is renamed to `check_nhg_refcount_10` but the expected refcount being checked is `20`. The suffix `_10` implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name `check_nhg_refcount_20` was accurate.

```suggestion
    def check_nhg_refcount_20():
        nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
        nhg_json = json.loads(nhg_info)
        return nhg_json.get(str(nhg_id), {}).get("refCount")

    _, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1)
    assert result == 20, "NHG refcount is not 20"
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +103 to +114
# Verify that receivedNexthopGroupId exists for all routes.
#
# Note: The current implementation updates nhe_received during route
# modification (including route resolution), so receivedNexthopGroupId
# may equal nexthopGroupId after resolution. This is expected behavior.
# We only verify that receivedNexthopGroupId exists and is shared correctly.
for i in range(1, 6):
prefix = f"10.0.0.{i}/32"
route = route_json[prefix][0]
assert (
"receivedNexthopGroupId" in route
), f"Route {prefix} missing receivedNexthopGroupId"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant loop and contradictory comment

The loop at lines 109–114 re-checks receivedNexthopGroupId existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals nhg_id. The second loop adds no new coverage.

More importantly, the comment at lines 104–108 states receivedNexthopGroupId may equal nexthopGroupId after resolution and calls it "expected behavior." This appears to contradict the PR description, which states nhe_received should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), nhe and nhe_received legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 103-114

Comment:
**Redundant loop and contradictory comment**

The loop at lines 109–114 re-checks `receivedNexthopGroupId` existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals `nhg_id`. The second loop adds no new coverage.

More importantly, the comment at lines 104–108 states `receivedNexthopGroupId may equal nexthopGroupId after resolution` and calls it "expected behavior." This appears to contradict the PR description, which states `nhe_received` should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), `nhe` and `nhe_received` legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants