Prefer most recent start time on duplicate Pod metrics#1778
Prefer most recent start time on duplicate Pod metrics#1778dippynark wants to merge 1 commit intokubernetes-sigs:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dippynark The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
This issue is currently awaiting triage. If metrics-server contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @dippynark! |
|
Hi @dippynark. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @RainbowMango |
|
/assign @stevehipwell |
|
/unassign @RainbowMango |
89a76bb to
0b15f49
Compare
What this PR does / why we need it:
There are situations when Metrics Server can scrape duplicate Pod metrics from different nodes (e.g. kubelet metrics have become stale on a particular node due to some garbage collection bug). In this case Metrics Server ignores the duplicate:
metrics-server/pkg/scraper/scraper.go
Lines 171 to 174 in 78192ed
However, because Metrics Server scrapes Pod metrics from each node in parallel, the metrics that we end up using for calculating utilisation is fairly random and can change from scrape to scrape due to small request latency differences. This can lead to utilisation changing dramatically from scrape to scrape (e.g. in the case of CPU we get utilisation of 0 if we scrape the stale node first twice in a row, or we don't get any metrics at all if we scrape the current node and then the stale node due to this PR).
Instead, this PR uses the metrics corresponding to the Pod with the latest container start time.
I am running this PR in our dev environment and it is allowing Metrics Server to carry on providing accurate information despite kubelet serving stale metrics.