zebra: Fix incorrect update of 'nhe_received' in route_entry_update_nhe()#231
zebra: Fix incorrect update of 'nhe_received' in route_entry_update_nhe()#231mwinter-osr wants to merge 1 commit intoopensourcerouting:masterfrom
Conversation
…he() The variable 'nhe_received' is used to store the NHG received from the upper-level protocol. It should not be updated once it's set during early route processing, as we want to preserve the original one. Updating it in 'route_entry_update_nhe()' causes the loss of the original NHG. Remove the nhe_received update from route_entry_update_nhe(). Instead, handle it in rib_handle_nhg_replace() where old_entry is about to be force-freed: migrate nhe_received to new_entry only when it still points to old_entry, preventing a dangling pointer. When nhe_received already points to a different NHG (the true original received NHG), it is left untouched. Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>
Greptile SummaryThis PR fixes a refcount and pointer-correctness bug in
|
| Filename | Overview |
|---|---|
| zebra/zebra_rib.c | Core fix: restricts nhe_received updates in route_entry_update_nhe() to only when nhe_received == old_nhg; adds pre-migration in rib_handle_nhg_replace() and a safety cleanup in rib_re_nhg_free(). Ref-count arithmetic is correct in all three code paths. |
| tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py | Updates refcount expectations to 2× (5→10, 2→4, 10→20) to account for dual nhe/nhe_received refs; adds new nexthop-change test. Helper function names check_nhg_refcount_2 and check_nhg_refcount_10 are misleadingly named (actual expected values are 4 and 20 respectively). Logic is correct. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["route_entry_update_nhe(re, new_nhghe)"] --> B{new_nhghe == NULL?}
B -- yes --> C["old_nhg = re->nhe"]
C --> D{re->nhe_received\n== re->nhe?}
D -- yes --> E["decrement_ref(nhe_received)\nnhe_received = NULL"]
D -- no --> F["leave nhe_received untouched\n(different NHG preserved)"]
E --> G["re->nhe = NULL, goto done"]
F --> G
B -- no --> H{re->nhe != new_nhghe\nand re->nhe != NULL?}
H -- yes --> I["old_nhg = re->nhe\nattach_ref(re, new_nhghe)"]
H -- no\n(first attach) --> J["attach_ref(re, new_nhghe)\nold_nhg = NULL"]
I --> K["done: old_nhg != NULL"]
G --> K
K --> L{re->nhe_received\n== old_nhg?}
L -- yes --> M["decrement_ref(old_nhg)\nif new_nhghe: increment_ref(new_nhghe)\nnhe_received = new_nhghe"]
L -- no\n(original NHG preserved!) --> N["nhe_received unchanged"]
M --> O["decrement_ref(old_nhg)\n(for re->nhe ref)"]
N --> O
J --> P["done: old_nhg == NULL\n(no cleanup needed)"]
style F fill:#90EE90
style N fill:#90EE90
style M fill:#FFE4B5
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 173-178
Comment:
**Misleading helper function name**
The function is renamed to `check_nhg_refcount_2` but the expected value passed to `run_and_expect` is still `4`. The suffix `_2` implies the test is checking for a refcount of 2, which is incorrect. The old name `check_nhg_refcount_4` was accurate. Rename to `check_nhg_refcount_4` (or better, `check_nhg_refcount`) to avoid misleading future readers.
```suggestion
def check_nhg_refcount_4():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
assert result == 4, "NHG refcount is not 4"
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 239-244
Comment:
**Misleading helper function name**
Same pattern as the remove-routes test: the function is renamed to `check_nhg_refcount_10` but the expected refcount being checked is `20`. The suffix `_10` implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name `check_nhg_refcount_20` was accurate.
```suggestion
def check_nhg_refcount_20():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1)
assert result == 20, "NHG refcount is not 20"
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 103-114
Comment:
**Redundant loop and contradictory comment**
The loop at lines 109–114 re-checks `receivedNexthopGroupId` existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals `nhg_id`. The second loop adds no new coverage.
More importantly, the comment at lines 104–108 states `receivedNexthopGroupId may equal nexthopGroupId after resolution` and calls it "expected behavior." This appears to contradict the PR description, which states `nhe_received` should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), `nhe` and `nhe_received` legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "zebra: Fix incorrect update of 'nhe_rece..." | Re-trigger Greptile
| def check_nhg_refcount_2(): | ||
| nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id)) | ||
| nhg_json = json.loads(nhg_info) | ||
| return nhg_json.get(str(nhg_id), {}).get("refCount") | ||
|
|
||
| _, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1) | ||
| _, result = topotest.run_and_expect(check_nhg_refcount_2, 4, count=30, wait=1) |
There was a problem hiding this comment.
Misleading helper function name
The function is renamed to check_nhg_refcount_2 but the expected value passed to run_and_expect is still 4. The suffix _2 implies the test is checking for a refcount of 2, which is incorrect. The old name check_nhg_refcount_4 was accurate. Rename to check_nhg_refcount_4 (or better, check_nhg_refcount) to avoid misleading future readers.
| def check_nhg_refcount_2(): | |
| nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id)) | |
| nhg_json = json.loads(nhg_info) | |
| return nhg_json.get(str(nhg_id), {}).get("refCount") | |
| _, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1) | |
| _, result = topotest.run_and_expect(check_nhg_refcount_2, 4, count=30, wait=1) | |
| def check_nhg_refcount_4(): | |
| nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id)) | |
| nhg_json = json.loads(nhg_info) | |
| return nhg_json.get(str(nhg_id), {}).get("refCount") | |
| _, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1) | |
| assert result == 4, "NHG refcount is not 4" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 173-178
Comment:
**Misleading helper function name**
The function is renamed to `check_nhg_refcount_2` but the expected value passed to `run_and_expect` is still `4`. The suffix `_2` implies the test is checking for a refcount of 2, which is incorrect. The old name `check_nhg_refcount_4` was accurate. Rename to `check_nhg_refcount_4` (or better, `check_nhg_refcount`) to avoid misleading future readers.
```suggestion
def check_nhg_refcount_4():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_4, 4, count=30, wait=1)
assert result == 4, "NHG refcount is not 4"
```
How can I resolve this? If you propose a fix, please make it concise.| route["receivedNexthopGroupId"] == nhg_id | ||
| ), f"Route {prefix} has different NHG ID" | ||
|
|
||
| step("Verify NHG {} has refcount of 10".format(nhg_id)) | ||
| step("Verify NHG {} has refcount of 20".format(nhg_id)) | ||
|
|
||
| # Get the NHG information |
There was a problem hiding this comment.
Misleading helper function name
Same pattern as the remove-routes test: the function is renamed to check_nhg_refcount_10 but the expected refcount being checked is 20. The suffix _10 implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name check_nhg_refcount_20 was accurate.
| route["receivedNexthopGroupId"] == nhg_id | |
| ), f"Route {prefix} has different NHG ID" | |
| step("Verify NHG {} has refcount of 10".format(nhg_id)) | |
| step("Verify NHG {} has refcount of 20".format(nhg_id)) | |
| # Get the NHG information | |
| def check_nhg_refcount_20(): | |
| nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id)) | |
| nhg_json = json.loads(nhg_info) | |
| return nhg_json.get(str(nhg_id), {}).get("refCount") | |
| _, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1) | |
| assert result == 20, "NHG refcount is not 20" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 239-244
Comment:
**Misleading helper function name**
Same pattern as the remove-routes test: the function is renamed to `check_nhg_refcount_10` but the expected refcount being checked is `20`. The suffix `_10` implies it checks for 10, but the comment above correctly states "10 routes × 2 refs = 20". The old name `check_nhg_refcount_20` was accurate.
```suggestion
def check_nhg_refcount_20():
nhg_info = r1.vtysh_cmd("show nexthop-group rib {} json".format(nhg_id))
nhg_json = json.loads(nhg_info)
return nhg_json.get(str(nhg_id), {}).get("refCount")
_, result = topotest.run_and_expect(check_nhg_refcount_20, 20, count=30, wait=1)
assert result == 20, "NHG refcount is not 20"
```
How can I resolve this? If you propose a fix, please make it concise.| # Verify that receivedNexthopGroupId exists for all routes. | ||
| # | ||
| # Note: The current implementation updates nhe_received during route | ||
| # modification (including route resolution), so receivedNexthopGroupId | ||
| # may equal nexthopGroupId after resolution. This is expected behavior. | ||
| # We only verify that receivedNexthopGroupId exists and is shared correctly. | ||
| for i in range(1, 6): | ||
| prefix = f"10.0.0.{i}/32" | ||
| route = route_json[prefix][0] | ||
| assert ( | ||
| "receivedNexthopGroupId" in route | ||
| ), f"Route {prefix} missing receivedNexthopGroupId" |
There was a problem hiding this comment.
Redundant loop and contradictory comment
The loop at lines 109–114 re-checks receivedNexthopGroupId existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals nhg_id. The second loop adds no new coverage.
More importantly, the comment at lines 104–108 states receivedNexthopGroupId may equal nexthopGroupId after resolution and calls it "expected behavior." This appears to contradict the PR description, which states nhe_received should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), nhe and nhe_received legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_received_nhe_kept/test_zebra_received_nhe_kept.py
Line: 103-114
Comment:
**Redundant loop and contradictory comment**
The loop at lines 109–114 re-checks `receivedNexthopGroupId` existence for all 5 prefixes, but this is already asserted (more strictly) in the loop at lines 84–93 which checks both existence AND that the value equals `nhg_id`. The second loop adds no new coverage.
More importantly, the comment at lines 104–108 states `receivedNexthopGroupId may equal nexthopGroupId after resolution` and calls it "expected behavior." This appears to contradict the PR description, which states `nhe_received` should NOT be updated after initial route processing. It is worth clarifying: for routes whose nexthop is already directly connected (no recursive resolution), `nhe` and `nhe_received` legitimately share the same NHG at all times. The comment should say this explicitly to avoid implying the fix is weaker than it is.
How can I resolve this? If you propose a fix, please make it concise.
The variable 'nhe_received' is used to store the NHG received from the upper-level protocol.
It should not be updated once it's set during early route processing, as we want to preserve the original one.
Updating it in 'route_entry_update_nhe()' causes the loss of the original NHG.
Let's fix it.
Original PR21104 by GaladrielZhao