Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff mechanism for ProvReq retry #7182

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yaroslava-serdiuk
Copy link
Contributor

What type of PR is this?

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 19, 2024
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/cc @aleksandra-malinowska

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2024
defaultRetryTime = 10 * time.Minute
defaultRetryTime = 1 * time.Minute
maxBackoffTime = 10 * time.Minute
maxCacheSize = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if that'll be sufficient long-term. It doesn't sound like it'll scale very well, since it effectively disables any backoff when there are more than 1k failing ProvReqs - i.e., precisely when a backoff would be most useful for preventing starvation of other requests.

Can you add a todo to clean up ProvReqs when they're resolved (succeed, time out, or are deleted)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm OK with merging it as is, but I think we should leave a note that it's a stopgap solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also removed elements that are provisioned or failed

@yaroslava-serdiuk yaroslava-serdiuk force-pushed the provreq-retry branch 2 times, most recently from c7dbebc to 242edc5 Compare August 20, 2024 08:58
@yaroslava-serdiuk
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2024
Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Please add a test case for this.

@@ -117,7 +117,8 @@ func TestProvisioningRequestPodsInjector(t *testing.T) {
}
for _, tc := range testCases {
client := provreqclient.NewFakeProvisioningRequestClient(context.Background(), t, tc.provReqs...)
injector := ProvisioningRequestPodsInjector{client, clock.NewFakePassiveClock(now)}
backoffTime := map[string]time.Duration{key(notProvisionedRecentlyProvReqB): 2 * time.Minute}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a new test case for backed off request scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put backed off request scenario in a separate case

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 20, 2024
@aleksandra-malinowska
Copy link
Contributor

/approve cancel

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 20, 2024
"k8s.io/client-go/rest"
"k8s.io/klog/v2"
"k8s.io/utils/clock"
)

const (
defaultRetryTime = 10 * time.Minute
defaultRetryTime = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it a flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initial retry time. What the reasoning to make it configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case let's make maxBackoffTime a flag too. And possibly maxCacheSize as well, so the cluster admin can override it as a mitigation if they run into issues with overflowing cache.

Not that I like adding lots of flags to an already impressive collection, but Kueue depends on this feature and CA has a much higher fix-to-release latency and cost (or so it seems), so let's not risk getting stuck with hardcoded values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is context there: kubernetes-sigs/kueue#2931
Kueue got some requests for this retry time.

Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaroslava-serdiuk I believe it's about kubernetes-sigs/kueue#2931 (reply in thread). But my previous comment stands, let's make it configurable and easy to mitigate. It's not like we have any data supporting this choice of hardcoded values AFAIK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "right" initial value depends on (at least) 2 factors:

  • size of the cluster and, so, the number of incoming provisioning requests.
  • the performance/throughput of provrequ processing.

We have no control over the first item, and the second item is being improved right now. So there is no easy way of knowing what value is good now, and, more importantly what value will be OK in 3 months. Having it hardcoded makes any adjustment/finetuning much harder.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 10, 2024
for _, pr := range provReqs {
if !isSupportedClass(pr) {
klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to log the warning here. This method will be used in batch processing of check-capacity requests with a custom isSupportedClass function, and we'd end up logging it for best-effort-atomic-scale-up requests.

Can we move logging this to the isSupportedClass function defined in Process() instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't anchor comment there but what I mean is:

		func(pr *provreqwrapper.ProvisioningRequest) bool {
			_, found := provisioningrequest.SupportedProvisioningClasses[pr.Spec.ProvisioningClassName]
			if !found {
				klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name)
			}
			return found
		})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aleksandra-malinowska
Copy link
Contributor

Thanks for moving this to use LRU cache, looks much better! One small comment, otherwise it's good to go. Feel free to unhold when you're ready to merge this.

/lgtm
/approve

/hold

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants