Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker: report a good clear error if a websocket message timesout #764

Open
josephjclark opened this issue Sep 5, 2024 · 2 comments
Open

Comments

@josephjclark
Copy link
Collaborator

If any worker -> lightning message times out on the websocket (ie because it took 10 seconds to reply), the run right now will be lost.

We can do better than this! We should surely be able to report the timeout somwhere, or continue retrying.

We may need help on the lightning side to recognise that message responses are slow.

Everyone will understand if the system is under load and running slow - so long as the work does get done eventually.

Probably the answer here is just to retry the message, or backoff and retry.

@josephjclark
Copy link
Collaborator Author

TD +1 on retry forever

@josephjclark
Copy link
Collaborator Author

TD maybe bump the claim queue backoff of something when messages start timing out.

So if you only have 1 job in progress, stop sending claim requests to lightning. Because lightning is busy! So back off and let the work finish.

I wonder if this is something like: take the average lightning reply time, and if that exceeds some threshold, multiply the claim backoff by it. In a trivial case, if the average message round trip is 9 seconds, then your backoff is +9 seconds.

That would help decrease load when Lightning is struggling and reduce the chance of lost runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant