-
Notifications
You must be signed in to change notification settings - Fork 4.7k
TRT-2260: Create High CPU intervals for nodes #30152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@xueqzhan: This pull request references TRT-2260 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 025845d
|
Step: 30 * time.Second, // Sample every 30 seconds for better granularity | ||
} | ||
|
||
// Query for CPU usage percentage per instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an existing monitor test that detects if metrics are down.
I think generally when we hit high CPU load, the metrics become unreliable or even down.
Should we consider that if the metrics are down we emit an interval here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm off VPN but I'd line the bug I am slowly looking at around MetricsEndPointDown which is flaky on over 75% of our tests. So I think this happens quite often so wanted to bring this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kevin for the review. We want to clearly correlate high CPU with e2e tests. So having this interval created directly from the metric is preferred.
Drilling into one of these jobs: I think this is working correctly. It took me a little time to figure out it is in NodeMonitor. |
We decided to put this in its own source section. |
Job Failure Risk Analysis for sha: 51fc0e8
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 51fc0e8
|
@xueqzhan: This pull request references TRT-2260 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through this to see what I could learn and I think I understand it :)
Just some questions and comments
/lgtm
/hold but fire away if you're ready
if apierrors.IsNotFound(err) { | ||
return []monitorapi.Interval{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to handle any other errors?
intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient) | ||
if err != nil { | ||
return intervals, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be clearer to me that the intervals only matter if there is an error:
intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient) | |
if err != nil { | |
return intervals, err | |
} | |
if intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient); err != nil { | |
return intervals, err | |
} |
msg := monitorapi.NewMessage(). | ||
Reason(monitorapi.IntervalReason("HighCPUUsage")). | ||
HumanMessage(fmt.Sprintf("CPU usage above %.1f%% threshold on instance %s", w.cpuThreshold, instance)). | ||
WithAnnotation("cpu_threshold", fmt.Sprintf("%.1f", w.cpuThreshold)) | ||
|
||
intervalTmpl := monitorapi.NewInterval(monitorapi.SourceNodeMonitor, monitorapi.Warning). | ||
Locator(lb). | ||
Message(msg). | ||
Display() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems a little convoluted to build this up when the only values needed in createCPUInterval
are the instance name and locator. it would be useful if you could directly use it to create an interval with an added interval, but it seems you can't do that? i would just pass the values directly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blame any convoluted logic on cursor. Anything smart is from me. :)
// Filter for high CPU intervals | ||
highCPUIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool { | ||
return eventInterval.Source == monitorapi.SourceNodeMonitor && | ||
eventInterval.Message.Reason == monitorapi.IntervalReason("HighCPUUsage") | ||
}) | ||
|
||
logger.Infof("collected %d high CPU intervals for analysis", len(highCPUIntervals)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this here because the collection might be called more than once? otherwise seems like it would be simpler to log at the time of collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be clearer to me if this could just be combined into highcpumetriccollector
but i guess that may be more trouble than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be too much for this story.
/test okd-scos-images |
/lgtm |
@xueqzhan: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: ceb078d
|
Please include these in the spyglass intervals files we render by default, I think it's in the BelongsInSpyglass function. My vision for this was always that you could create new intervals and they'd just be visible, that's true if the user knows to load the huge "all" events file, but everyone typically works with the spyglass files. I almost wonder if that function should include everything with Display true, but that should be a separate PR for another day. My weak attempt to correlate high cpu tests is picking up almost too much to be useful, but maybe it will help find something. I'm curious to see how this goes with #30171 just merged. Will be interesting to see. Anyhow if we can get these into the default files I think this looks good to go, super high value Ken thank you, and I love the inclusion of the peak CPU. |
That function is indeed allowing most by default. But that is the not the function to add a section to the spyglass view. Let's hope my latest commit work. :) |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, sosiouxme, xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Assisted by cursor