TRT-2260: Create High CPU intervals for nodes #30152

xueqzhan · 2025-08-21T19:09:55Z

Assisted by cursor

openshift-ci-robot · 2025-08-21T19:10:00Z

@xueqzhan: This pull request references TRT-2260 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

/hold for testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-trt · 2025-08-22T01:12:23Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 025845d

"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector cleanup" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector collection" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector interval construction" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector preparation" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector setup" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector test evaluation" [Total: 38, Pass: 38, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector writing to storage" [Total: 38, Pass: 38, Fail: 0, Flake: 0]

kannon92 · 2025-08-24T19:52:15Z

pkg/monitortests/node/highcpumetriccollector/monitortest.go

+		Step:  30 * time.Second, // Sample every 30 seconds for better granularity
+	}
+
+	// Query for CPU usage percentage per instance


We have an existing monitor test that detects if metrics are down.

I think generally when we hit high CPU load, the metrics become unreliable or even down.

Should we consider that if the metrics are down we emit an interval here?

I'm off VPN but I'd line the bug I am slowly looking at around MetricsEndPointDown which is flaky on over 75% of our tests. So I think this happens quite often so wanted to bring this up.

Thanks Kevin for the review. We want to clearly correlate high CPU with e2e tests. So having this interval created directly from the metric is preferred.

kannon92 · 2025-08-25T13:25:29Z

Drilling into one of these jobs:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/1958624156723449856/pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview/openshift_origin/30152/intervals?end=2025-08-21T23%3A30%3A54Z&filterText=&intervalFile=e2e-events_20250821-222003.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=NodeState&selectedSources=MetricsEndpointDown&selectedSources=NodeMonitor&start=2025-08-21T21%3A39%3A13Z

I think this is working correctly.

It took me a little time to figure out it is in NodeMonitor.

xueqzhan · 2025-08-25T18:40:39Z

Drilling into one of these jobs:

https://sippy.dptools.openshift.org/sippy-ng/job_runs/1958624156723449856/pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview/openshift_origin/30152/intervals?end=2025-08-21T23%3A30%3A54Z&filterText=&intervalFile=e2e-events_20250821-222003.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=NodeState&selectedSources=MetricsEndpointDown&selectedSources=NodeMonitor&start=2025-08-21T21%3A39%3A13Z

I think this is working correctly.

It took me a little time to figure out it is in NodeMonitor.

We decided to put this in its own source section.

openshift-trt · 2025-08-26T01:15:24Z

Job Failure Risk Analysis for sha: 51fc0e8

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive	Medium [sig-cli][OCPFeatureGate:UpgradeStatus] oc amd upgrade status never fails Potential external regression detected for High Risk Test analysis --- [sig-node] node-lifecycle detects unexpected not ready node Potential external regression detected for High Risk Test analysis --- Job run should complete before timeout This test has passed 93.64% of 4481 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-gcp-csi	IncompleteTests Tests for this run (22) are below the historical average (1684): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 51fc0e8

"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector cleanup" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector collection" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector interval construction" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector preparation" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector setup" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector test evaluation" [Total: 26, Pass: 26, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector writing to storage" [Total: 26, Pass: 26, Fail: 0, Flake: 0]

openshift-ci-robot · 2025-08-26T13:07:11Z

@xueqzhan: This pull request references TRT-2260 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Assisted by cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

xueqzhan · 2025-08-26T13:07:18Z

/hold cancel

sosiouxme

I looked through this to see what I could learn and I think I understand it :)
Just some questions and comments
/lgtm
/hold but fire away if you're ready

sosiouxme · 2025-08-26T20:21:16Z

pkg/monitortests/node/highcpumetriccollector/monitortest.go

+	if apierrors.IsNotFound(err) {
+		return []monitorapi.Interval{}, nil
+	}


no need to handle any other errors?

sosiouxme · 2025-08-26T20:23:04Z

pkg/monitortests/node/highcpumetriccollector/monitortest.go

+	intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient)
+	if err != nil {
+		return intervals, err
+	}


would be clearer to me that the intervals only matter if there is an error:

Suggested change

intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient)

if err != nil {

return intervals, err

}

if intervals, err := prometheus.EnsureThanosQueriersConnectedToPromSidecars(ctx, prometheusClient); err != nil {

return intervals, err

}

sosiouxme · 2025-08-26T21:26:14Z

pkg/monitortests/node/highcpumetriccollector/monitortest.go

+			msg := monitorapi.NewMessage().
+				Reason(monitorapi.IntervalReason("HighCPUUsage")).
+				HumanMessage(fmt.Sprintf("CPU usage above %.1f%% threshold on instance %s", w.cpuThreshold, instance)).
+				WithAnnotation("cpu_threshold", fmt.Sprintf("%.1f", w.cpuThreshold))
+
+			intervalTmpl := monitorapi.NewInterval(monitorapi.SourceNodeMonitor, monitorapi.Warning).
+				Locator(lb).
+				Message(msg).
+				Display()


it seems a little convoluted to build this up when the only values needed in createCPUInterval are the instance name and locator. it would be useful if you could directly use it to create an interval with an added interval, but it seems you can't do that? i would just pass the values directly...

Blame any convoluted logic on cursor. Anything smart is from me. :)

sosiouxme · 2025-08-26T21:30:24Z

pkg/monitortests/node/highcpumetriccollector/monitortest.go

+	// Filter for high CPU intervals
+	highCPUIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
+		return eventInterval.Source == monitorapi.SourceNodeMonitor &&
+			eventInterval.Message.Reason == monitorapi.IntervalReason("HighCPUUsage")
+	})
+
+	logger.Infof("collected %d high CPU intervals for analysis", len(highCPUIntervals))


do we need to do this here because the collection might be called more than once? otherwise seems like it would be simpler to log at the time of collection.

sosiouxme · 2025-08-26T21:37:59Z

pkg/monitortests/testframework/highcputestanalyzer/monitortest.go

would be clearer to me if this could just be combined into highcpumetriccollector but i guess that may be more trouble than it's worth.

This will be too much for this story.

sosiouxme · 2025-08-27T11:30:38Z

/test okd-scos-images

sosiouxme · 2025-08-27T21:35:06Z

/lgtm

openshift-ci · 2025-08-28T00:54:26Z

@xueqzhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure	`ceb078d`	link	false	`/test e2e-azure`
ci/prow/e2e-aws-disruptive	`ceb078d`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-aws-ovn	`ceb078d`	link	false	`/test e2e-aws-ovn`
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway	`ceb078d`	link	false	`/test e2e-metal-ipi-ovn-dualstack-local-gateway`
ci/prow/e2e-gcp-csi	`ceb078d`	link	false	`/test e2e-gcp-csi`
ci/prow/e2e-gcp-ovn-techpreview	`ceb078d`	link	false	`/test e2e-gcp-ovn-techpreview`
ci/prow/e2e-aws-ovn-single-node-upgrade	`ceb078d`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2	`ceb078d`	link	false	`/test e2e-gcp-ovn-techpreview-serial-2of2`
ci/prow/e2e-gcp-ovn	`ceb078d`	link	true	`/test e2e-gcp-ovn`
ci/prow/e2e-openstack-ovn	`ceb078d`	link	false	`/test e2e-openstack-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2025-08-28T01:14:33Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: ceb078d

"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector cleanup" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector collection" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector interval construction" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector preparation" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector setup" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector test evaluation" [Total: 39, Pass: 39, Fail: 0, Flake: 0]
"[Jira:"Node / Kubelet"] monitor test high-cpu-metric-collector writing to storage" [Total: 39, Pass: 39, Fail: 0, Flake: 0]

dgoodwin · 2025-08-28T11:47:53Z

Please include these in the spyglass intervals files we render by default, I think it's in the BelongsInSpyglass function.

My vision for this was always that you could create new intervals and they'd just be visible, that's true if the user knows to load the huge "all" events file, but everyone typically works with the spyglass files. I almost wonder if that function should include everything with Display true, but that should be a separate PR for another day.

My weak attempt to correlate high cpu tests is picking up almost too much to be useful, but maybe it will help find something. I'm curious to see how this goes with #30171 just merged. Will be interesting to see.

Anyhow if we can get these into the default files I think this looks good to go, super high value Ken thank you, and I love the inclusion of the peak CPU.

xueqzhan · 2025-08-28T15:51:16Z

Please include these in the spyglass intervals files we render by default, I think it's in the BelongsInSpyglass function.

My vision for this was always that you could create new intervals and they'd just be visible, that's true if the user knows to load the huge "all" events file, but everyone typically works with the spyglass files. I almost wonder if that function should include everything with Display true, but that should be a separate PR for another day.

My weak attempt to correlate high cpu tests is picking up almost too much to be useful, but maybe it will help find something. I'm curious to see how this goes with #30171 just merged. Will be interesting to see.

Anyhow if we can get these into the default files I think this looks good to go, super high value Ken thank you, and I love the inclusion of the peak CPU.

That function is indeed allowing most by default. But that is the not the function to add a section to the spyglass view. Let's hope my latest commit work. :)

dgoodwin · 2025-08-28T15:56:23Z

/lgtm

openshift-ci · 2025-08-28T15:56:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, sosiouxme, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin,sosiouxme,xueqzhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Create High CPU intervals for nodes

8bc9772

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 21, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2025

openshift-ci bot requested review from deads2k and sjenning August 21, 2025 19:12

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025

Revert a test change

025845d

kannon92 reviewed Aug 24, 2025

View reviewed changes

Use a unique source for CPU monitor

51fc0e8

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2025

sosiouxme approved these changes Aug 26, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2025

openshift-ci bot assigned sosiouxme Aug 26, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 26, 2025

Address review comments

ceb078d

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2025

Add HighCPU section to default spyglass view

59d0aa3

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025

openshift-ci bot assigned dgoodwin Aug 28, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025

TRT-2260: Create High CPU intervals for nodes #30152

Are you sure you want to change the base?

TRT-2260: Create High CPU intervals for nodes #30152

Conversation

xueqzhan commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 21, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-trt bot commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Aug 25, 2025

Uh oh!

xueqzhan commented Aug 25, 2025

Uh oh!

openshift-trt bot commented Aug 26, 2025

Uh oh!

openshift-ci-robot commented Aug 26, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xueqzhan commented Aug 26, 2025

Uh oh!

sosiouxme left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sosiouxme commented Aug 27, 2025

Uh oh!

sosiouxme commented Aug 27, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

openshift-trt bot commented Aug 28, 2025

Uh oh!

dgoodwin commented Aug 28, 2025

Uh oh!

xueqzhan commented Aug 28, 2025

Uh oh!

dgoodwin commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

Uh oh!

xueqzhan commented Aug 21, 2025 •

edited

Loading

openshift-ci-robot commented Aug 21, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 26, 2025 •

edited by openshift-ci bot

Loading