Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Evaan2001
Copy link

@Evaan2001 Evaan2001 commented Aug 5, 2024

Description

Issue #490 mentions that ByteStreamServer::inner_read drops the future a bit too aggressively when EOF is received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is received. The code also checks if get_part_fut returns an error and handles it appropriately.

Currently, the code simply waits for a Result enum to be returned by get_part_fut to finish and checks if there's an error in Result.Err(E). In case there is no error, the code does NOT do anything with Result.Ok(T). I made an assumption that get_part_fut does important work in the background and updates accordingly. If we want to do something with the return value of get_part_fut, let me know how to proceed. :)

Fixes # 490

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

The bytestream_server.rs file is located under nativelink/nativelink-service. This directory also has a test sub-directory. I wanted to run these tests; however, cargo check & cargo test both give the following output:

error[E0599]: no method named `lock_blocking` found for reference `&async_lock::Mutex<T>` in the current scope
   --> nativelink-metric/src/lib.rs:445:25
    |
445 |         let lock = self.lock_blocking();
    |                         ^^^^^^^^^^^^^ method not found in `&Mutex<T>`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `nativelink-metric` (lib) due to 1 previous error

This is what Rust had to say about E0599:

`~ rustc --explain E0599`
This error occurs when a method is used on a type which doesn't implement it:

Erroneous code example:

struct Mouth;

let x = Mouth;
x.chocolate(); // error: no method named `chocolate` found for type `Mouth`
               //        in the current scope

I went to the root directory, where cargo check worked but cargo test had 0 tests to conduct. I expected this behavior as folks usually avoid creating tests in the root directory of a Rust project. Since I couldn't conduct the tests in nativelink/nativelink-service/tests, this code has NOT been tested.


This change is Reviewable

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is probably due to a discrepancy between bazel and cargo tests.

Can you try changing:
nativelink-metric/Cargo.toml:

async-lock = { version = "3.3.0", default-features = false }

to:

async-lock = { version = "3.3.0", features = ["std"], default-features = false }

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved


nativelink-service/src/bytestream_server.rs line 325 at r1 (raw file):

                                        // EOF.
                                        // Wait for get_part_fut to finish
                                        match state.get_part_fut.await{ // Once get_part_fut finishes, we check for errors

We can't quite do it this way. If you look below we call:

state.get_part_fut = Box::pin(pending());

If this was to happen, this future would never finish. What we want to do instead is check to see if that call was made and only await the future if it was not called (you should be able to detect this with state.maybe_get_part_result).


nativelink-service/src/bytestream_server.rs line 337 at r1 (raw file):

                                                }
                                                event!(Level::ERROR, response = ?e);
                                                return Some((Err(e.into()), None));

We probably want to merge these two responses together (like done below).

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't see this error at all when I run cargo test. What OS are you using?

Also try using cargo test --all, this is the way to run all packaged tests.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

@Evaan2001
Copy link
Author

Evaan2001 commented Aug 6, 2024

Thanks Blaise for this info and sorry for the delay in getting back to you (today was extra wild ...). I'm using MacOS on Apple silicon. I was able to run the tests with cargo test --all. You're right, the execution of the tests with my commit never terminated, and I'm guessing the future in question never finished.

To ensure the tests were successful with the original code, I reverted my code and ran cargo test --all. I was surprised that even though some error statements were printed, almost all of the tests passed; only 3 were ignored with 0 failures/
measured/filtered-out.

I'll start working on your suggestion but before I amend my commit and ask for another review, I'm curious – are the tests supposed to succeed despite some errors? A sample error statement that got printed was:

"ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: [\"Worker disconnected\"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }"

@MarcusSorealheis
Copy link
Collaborator

@Evaan2001 in which context did you receive this error:
"ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: ["Worker disconnected"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }"

This also could be a good question for our Slack with a bit more context. The error can happen for a lot of reasons.

This error would typically occur when a worker fails to run an action that was routed to it, as defined here. I'm not sure what about your change should cause this error just yet.

@Evaan2001
Copy link
Author

Evaan2001 commented Aug 6, 2024

@MarcusSorealheis, I was in the root directory and ran cargo test --all. Given that @allada pointed out a flaw in my coding logic, I wanted to eliminate the possibility of my code breaking anything and simply see what the correct test results looked like. Consequently, I reverted all of my changes before running cargo test --all. In other words, I ran the tests with my branch being the same as upstream.

This error popped up on my screen while running tests/simple_scheduler_test.rs (target/debug/deps/simple_scheduler_test-c3cf20b07e504040). There were 18 tests to be conducted and this is a snippet from my terminal:

`test worker_retries_on_internal_error_and_fails_test ... ok
  2024-08-06T01:15:06.972790Z ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: [\"Worker disconnected\"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }
    at nativelink-scheduler/src/simple_scheduler.rs:339
    in nativelink_scheduler::simple_scheduler::simple_scheduler_task_worker_matching

test worker_timesout_reschedules_running_job_test ... ok`

Let me know if you want me to provide more context from the terminal. I want to point out that this wasn't the only error message that got printed on my screen. There were in fact about 30 error messages, yet not one test failed. If it'll be helpful, I can also log the results of running cargo test --all on my computer and share it with you all.

@MarcusSorealheis
Copy link
Collaborator

I wonder if this is related to the recent scheduler changes cc @adam-singer

We recently introduced the ability to run with a distributed scheduler or a single scheduler, but I don't believe these changes should have added more errors. I need to run the tests on Mac. I will do it tonight. For the last couple weeks I have been on Linux for development.

@Evaan2001
Copy link
Author

@MarcusSorealheis, let me know if you see the error messages while testing on Mac. I saw 29 errors; sharing 5 I randomly selected here (ignore if not helpful):

  1. Running tests/running_actions_manager_test.rs (target/debug/deps/running_actions_manager_test-daacda8377e1371d)
.
.
test unix_executable_file_test ... ok
  2024-08-06T01:15:10.651632Z ERROR nativelink_worker::running_actions_manager: Command returned non-zero exit code, exit_code: 9, stdout: "", stderr: ""
    at nativelink-worker/src/running_actions_manager.rs:1160

test worker_times_out ... ok
.
.
  1. Running tests/local_worker_test.rs (target/debug/deps/local_worker_test-52765e08b4ec9177)
  2024-08-06T01:15:09.938890Z ERROR nativelink_worker::local_worker: Error, err: Error { code: Internal, messages: ["Got EOF expected UpdateForWorker"] }
    at nativelink-worker/src/local_worker.rs:534
    in local_worker_test::utils::local_worker_test_utils::local_worker_spawn
  1. Running tests/bytestream_server_test.rs (target/debug/deps/bytestream_server_test-8a4014f213431079)
.
.
test out_of_sequence_write ... ok
  2024-08-06T01:15:07.001937Z ERROR nativelink_service::bytestream_server: error: status: InvalidArgument, message: "Expected WriteRequest struct in stream : In ByteStreamServer::write", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/bytestream_server.rs:640
.
.
  1. Running tests/filesystem_store_test.rs (target/debug/deps/filesystem_store_test-1f7e97c7e6febd9e)
.
.
test file_continues_to_stream_on_content_replace_test ... ok
  2024-08-06T01:15:07.394663Z ERROR nativelink_store::filesystem_store: Failed to touch file, err: Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/var/folders/rh/bz_crfln23q5qc4h149dz_r40000gn/T/nix-shell.F4XFEp/11411840072976465328/content_path/0123456789abcdef000000000000000000010000000000000123456789abcdef-10\""] }
    at nativelink-store/src/filesystem_store.rs:342
.
.
  1. Running tests/cas_server_test.rs (target/debug/deps/cas_server_test-f85a530c68a87e66)
.
.
test empty_store ... ok
  2024-08-06T01:15:07.070854Z ERROR nativelink_service::cas_server: error: status: InvalidArgument, message: "Invalid string length : Invalid sha256 hash: BAD_HASH : Failed on find_missing_blobs() command", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/cas_server.rs:302
.
.

@adam-singer
Copy link
Member

adam-singer commented Aug 6, 2024

@MarcusSorealheis @Evaan2001 the printing of error messages isn't always an indication of a failed test passing, its an indication that the test executed a line of code that logs an event as error, that could be by the design of the tests intention. The most obvious case is (5) where the test for empty storage and a fetch is tried, we expect an error message to be printed. Unfortunately rust doesn't have the jvm level of log message interception and validation to verify error messages in tests (at least I'm not aware of any reflective test features like that), guessing we could try and hook the event listeners and verify we expect the message to be error level depending on the test context.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the cargo test errors you were getting I was finally able to reproduce. A PR has already landed with the fix:
#1236

So if you rebase you can use it like normal now.

As @adam-singer said, those error messages are not a problem unless the test fails. We have some tests that check to ensure NativeLink handles failures properly and some of these are what you are seeing, but the test actually passes.

As long as the tests pass, you are good.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

@Evaan2001
Copy link
Author

@adam-singer, that makes sense. Thanks for clarifying.

@allada, I used rebase to incorporate your changes to the toml files. However, running cargo test inside the nativelink-service directory failed as the code couldn't compile. There was just one error that showed up 13 times, here:

Compiling nativelink-service v0.4.0 (/Users/evaanahmed/Documents/GitHub/nativelink/nativelink-service)
error[E0599]: no variant or associated item named `Gzip` found for enum `CompressionEncoding` in the current scope
   --> nativelink-service/tests/bytestream_server_test.rs:197:75
    |
197 |         make_stream_and_writer_spawn(bs_server, Some(CompressionEncoding::Gzip));
    |                                                                           ^^^^ variant or associated item not found in `CompressionEncoding`
    |
note: if you're trying to build a new `CompressionEncoding`, consider using `CompressionEncoding::from_accept_encoding_header` which returns `Option<CompressionEncoding>`
   --> /Users/evaanahmed/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tonic-0.12.1/src/codec/compression.rs:101:5
    |
101 | /     pub(crate) fn from_accept_encoding_header(
102 | |         map: &http::HeaderMap,
103 | |         enabled_encodings: EnabledCompressionEncodings,
104 | |     ) -> Option<Self> {
    | |_____________________^
help: there is a method `vzip` with a similar name
    |
197 |         make_stream_and_writer_spawn(bs_server, Some(CompressionEncoding::vzip));
    |                                                                           ~~~~

The good thing is that cargo test --all works – the code compiles and the tests run – so I can be on my way to work on your feedback, testing my code by running all the tests.

@Evaan2001 Evaan2001 force-pushed the work_after_EOF branch 2 times, most recently from 86b62d2 to 826c00b Compare August 7, 2024 01:32
@Evaan2001
Copy link
Author

Evaan2001 commented Aug 7, 2024

Hey @allada, just finished updating the commit. Your feedback helped me realize that my understanding of the code was slightly wrong, so I thank you for that :).

cargo test --all conducted 316 tests and my current code passed all of them.

I had a question regarding one of your earlier suggestions. Regarding error handling, you mentioned "merging the 2 responses". I'm afraid I didn't quite understand that. If 1 response is the error message, what's the other response we're talking about? For now, I just returned the error:

return Some((Err(err.into()), None));

@Evaan2001 Evaan2001 requested a review from allada August 7, 2024 01:41
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code again, I realized response is not a Result, so we don't need to merge the two potential error cases.

Just a few more minor issues and it should be ready.

Thanks and great job!

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved


nativelink-service/src/bytestream_server.rs line 331 at r2 (raw file):

                                        // right before setting state.get_part_fut to Box::pin(pending())
                                        // In other words, the only time we wait for get_part_fut to complete is
                                        // when maybe_get_part_result = None

nit: Can we change this comment to be more direct and short... something like:

// If `maybe_get_part_result` has a value, it means `get_part_fut` will never complete,
// so only await it, if it is not `pending()`.

nativelink-service/src/bytestream_server.rs line 333 at r2 (raw file):

                                        // when maybe_get_part_result = None
                                        let get_part_result = if let Some(result) = state.maybe_get_part_result {
                                            // if maybe_get_part_result is not None, get_part_fut either is done or is set to Box::pin(pending())

nit: I suggest removing this comment and the one below.


nativelink-service/src/bytestream_server.rs line 341 at r2 (raw file):

                                        };

                                        // Now, get_part_result has a future that resulted in either Result<()> or Error.

nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.

@Evaan2001
Copy link
Author

@allada, shortened the comments; have a look!

@Evaan2001 Evaan2001 requested a review from allada August 7, 2024 16:17
Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wait for Blaise here but
:lgtm:

Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i personally like more comments

Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 3 discussions need to be resolved

@Evaan2001
Copy link
Author

@allada although I have 2 LGTMs, I still need your final good-to-go as the PR needs you to approve the requested changes. Think you can have a look soon?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's good, but just that one last comment and it's ready.

Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 1 discussions need to be resolved

@Evaan2001
Copy link
Author

@allada, done!

@Evaan2001 Evaan2001 requested a review from allada August 9, 2024 17:24
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 1 discussions need to be resolved


nativelink-service/src/bytestream_server.rs line 341 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.

can we remove this comment? This is the only blocker. The comment is just describing the line below it, which is self-explanatory.

Issue TraceMachina#490 mentions that ByteStreamServer::inner_read drops future a bit too aggressively when EOF is
received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is
received. The code also checks if get_part_fut returns an error and handles it appropriately.

Edit – this commit updates the code commited earlier based on Blaise's feedback on the PR.
@Evaan2001
Copy link
Author

@allada it's done now. I thought you wanted me to delete just // Now, get_part_result has a future that resulted in either Result<()> or Error. but I guess the entire comment block needed to go; my bad for misunderstanding you!

@Evaan2001 Evaan2001 requested a review from allada August 12, 2024 19:18
@Evaan2001
Copy link
Author

Hey @allada, pinging you in case you missed my last review request. The extra comments are removed. Think you can give your LGTM and merge this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants