-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is probably due to a discrepancy between bazel and cargo tests.
Can you try changing:
nativelink-metric/Cargo.toml
:
async-lock = { version = "3.3.0", default-features = false }
to:
async-lock = { version = "3.3.0", features = ["std"], default-features = false }
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved
nativelink-service/src/bytestream_server.rs
line 325 at r1 (raw file):
// EOF. // Wait for get_part_fut to finish match state.get_part_fut.await{ // Once get_part_fut finishes, we check for errors
We can't quite do it this way. If you look below we call:
state.get_part_fut = Box::pin(pending());
If this was to happen, this future would never finish. What we want to do instead is check to see if that call was made and only await the future if it was not called (you should be able to detect this with state.maybe_get_part_result
).
nativelink-service/src/bytestream_server.rs
line 337 at r1 (raw file):
} event!(Level::ERROR, response = ?e); return Some((Err(e.into()), None));
We probably want to merge these two responses together (like done below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't see this error at all when I run cargo test
. What OS are you using?
Also try using cargo test --all
, this is the way to run all packaged tests.
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved
Thanks Blaise for this info and sorry for the delay in getting back to you (today was extra wild ...). I'm using MacOS on Apple silicon. I was able to run the tests with To ensure the tests were successful with the original code, I reverted my code and ran I'll start working on your suggestion but before I amend my commit and ask for another review, I'm curious – are the tests supposed to succeed despite some errors? A sample error statement that got printed was:
|
@Evaan2001 in which context did you receive this error: This also could be a good question for our Slack with a bit more context. The error can happen for a lot of reasons. This error would typically occur when a worker fails to run an action that was routed to it, as defined here. I'm not sure what about your change should cause this error just yet. |
@MarcusSorealheis, I was in the root directory and ran This error popped up on my screen while running
Let me know if you want me to provide more context from the terminal. I want to point out that this wasn't the only error message that got printed on my screen. There were in fact about 30 error messages, yet not one test failed. If it'll be helpful, I can also log the results of running |
I wonder if this is related to the recent scheduler changes cc @adam-singer We recently introduced the ability to run with a distributed scheduler or a single scheduler, but I don't believe these changes should have added more errors. I need to run the tests on Mac. I will do it tonight. For the last couple weeks I have been on Linux for development. |
@MarcusSorealheis, let me know if you see the error messages while testing on Mac. I saw 29 errors; sharing 5 I randomly selected here (ignore if not helpful):
|
@MarcusSorealheis @Evaan2001 the printing of error messages isn't always an indication of a failed test passing, its an indication that the test executed a line of code that logs an event as error, that could be by the design of the tests intention. The most obvious case is (5) where the test for empty storage and a fetch is tried, we expect an error message to be printed. Unfortunately rust doesn't have the jvm level of log message interception and validation to verify error messages in tests (at least I'm not aware of any reflective test features like that), guessing we could try and hook the event listeners and verify we expect the message to be error level depending on the test context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the cargo test
errors you were getting I was finally able to reproduce. A PR has already landed with the fix:
#1236
So if you rebase you can use it like normal now.
As @adam-singer said, those error messages are not a problem unless the test fails. We have some tests that check to ensure NativeLink handles failures properly and some of these are what you are seeing, but the test actually passes.
As long as the tests pass, you are good.
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved
d9055e3
to
da9a4a4
Compare
@adam-singer, that makes sense. Thanks for clarifying. @allada, I used rebase to incorporate your changes to the toml files. However, running
The good thing is that |
86b62d2
to
826c00b
Compare
Hey @allada, just finished updating the commit. Your feedback helped me realize that my understanding of the code was slightly wrong, so I thank you for that :).
I had a question regarding one of your earlier suggestions. Regarding error handling, you mentioned "merging the 2 responses". I'm afraid I didn't quite understand that. If 1 response is the error message, what's the other response we're talking about? For now, I just returned the error:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code again, I realized response
is not a Result
, so we don't need to merge the two potential error cases.
Just a few more minor issues and it should be ready.
Thanks and great job!
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved
nativelink-service/src/bytestream_server.rs
line 331 at r2 (raw file):
// right before setting state.get_part_fut to Box::pin(pending()) // In other words, the only time we wait for get_part_fut to complete is // when maybe_get_part_result = None
nit: Can we change this comment to be more direct and short... something like:
// If `maybe_get_part_result` has a value, it means `get_part_fut` will never complete,
// so only await it, if it is not `pending()`.
nativelink-service/src/bytestream_server.rs
line 333 at r2 (raw file):
// when maybe_get_part_result = None let get_part_result = if let Some(result) = state.maybe_get_part_result { // if maybe_get_part_result is not None, get_part_fut either is done or is set to Box::pin(pending())
nit: I suggest removing this comment and the one below.
nativelink-service/src/bytestream_server.rs
line 341 at r2 (raw file):
}; // Now, get_part_result has a future that resulted in either Result<()> or Error.
nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.
826c00b
to
74c14f6
Compare
@allada, shortened the comments; have a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to wait for Blaise here but
Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i personally like more comments
Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 3 discussions need to be resolved
@allada although I have 2 LGTMs, I still need your final good-to-go as the PR needs you to approve the requested changes. Think you can have a look soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's good, but just that one last comment and it's ready.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 1 discussions need to be resolved
74c14f6
to
b21ffd9
Compare
@allada, done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 1 discussions need to be resolved
nativelink-service/src/bytestream_server.rs
line 341 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.
can we remove this comment? This is the only blocker. The comment is just describing the line below it, which is self-explanatory.
Issue TraceMachina#490 mentions that ByteStreamServer::inner_read drops future a bit too aggressively when EOF is received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is received. The code also checks if get_part_fut returns an error and handles it appropriately. Edit – this commit updates the code commited earlier based on Blaise's feedback on the PR.
b21ffd9
to
4ed726d
Compare
@allada it's done now. I thought you wanted me to delete just |
Hey @allada, pinging you in case you missed my last review request. The extra comments are removed. Think you can give your LGTM and merge this PR? |
Description
Issue #490 mentions that ByteStreamServer::inner_read drops the future a bit too aggressively when EOF is received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is received. The code also checks if get_part_fut returns an error and handles it appropriately.
Currently, the code simply waits for a Result enum to be returned by get_part_fut to finish and checks if there's an error in Result.Err(E). In case there is no error, the code does NOT do anything with Result.Ok(T). I made an assumption that get_part_fut does important work in the background and updates accordingly. If we want to do something with the return value of get_part_fut, let me know how to proceed. :)
Fixes # 490
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
The bytestream_server.rs file is located under nativelink/nativelink-service. This directory also has a test sub-directory. I wanted to run these tests; however,
cargo check
&cargo test
both give the following output:This is what Rust had to say about E0599:
I went to the root directory, where
cargo check
worked butcargo test
had 0 tests to conduct. I expected this behavior as folks usually avoid creating tests in the root directory of a Rust project. Since I couldn't conduct the tests in nativelink/nativelink-service/tests, this code has NOT been tested.This change is