Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Evaan2001 · 2024-08-05T15:09:16Z

Description

Issue #490 mentions that ByteStreamServer::inner_read drops the future a bit too aggressively when EOF is received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is received. The code also checks if get_part_fut returns an error and handles it appropriately.

Currently, the code simply waits for a Result enum to be returned by get_part_fut to finish and checks if there's an error in Result.Err(E). In case there is no error, the code does NOT do anything with Result.Ok(T). I made an assumption that get_part_fut does important work in the background and updates accordingly. If we want to do something with the return value of get_part_fut, let me know how to proceed. :)

Fixes # 490

Type of change

Please delete options that aren't relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

The bytestream_server.rs file is located under nativelink/nativelink-service. This directory also has a test sub-directory. I wanted to run these tests; however, cargo check & cargo test both give the following output:

error[E0599]: no method named `lock_blocking` found for reference `&async_lock::Mutex<T>` in the current scope
   --> nativelink-metric/src/lib.rs:445:25
    |
445 |         let lock = self.lock_blocking();
    |                         ^^^^^^^^^^^^^ method not found in `&Mutex<T>`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `nativelink-metric` (lib) due to 1 previous error

This is what Rust had to say about E0599:

`~ rustc --explain E0599`
This error occurs when a method is used on a type which doesn't implement it:

Erroneous code example:

struct Mouth;

let x = Mouth;
x.chocolate(); // error: no method named `chocolate` found for type `Mouth`
               //        in the current scope

I went to the root directory, where cargo check worked but cargo test had 0 tests to conduct. I expected this behavior as folks usually avoid creating tests in the root directory of a Rust project. Since I couldn't conduct the tests in nativelink/nativelink-service/tests, this code has NOT been tested.

This change is

allada

Oh, this is probably due to a discrepancy between bazel and cargo tests.

Can you try changing:
nativelink-metric/Cargo.toml:

async-lock = { version = "3.3.0", default-features = false }

to:

async-lock = { version = "3.3.0", features = ["std"], default-features = false }

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

nativelink-service/src/bytestream_server.rs line 325 at r1 (raw file):

                                        // EOF.
                                        // Wait for get_part_fut to finish
                                        match state.get_part_fut.await{ // Once get_part_fut finishes, we check for errors

We can't quite do it this way. If you look below we call:

state.get_part_fut = Box::pin(pending());

If this was to happen, this future would never finish. What we want to do instead is check to see if that call was made and only await the future if it was not called (you should be able to detect this with state.maybe_get_part_result).

nativelink-service/src/bytestream_server.rs line 337 at r1 (raw file):

                                                }
                                                event!(Level::ERROR, response = ?e);
                                                return Some((Err(e.into()), None));

We probably want to merge these two responses together (like done below).

allada

Actually, I don't see this error at all when I run cargo test. What OS are you using?

Also try using cargo test --all, this is the way to run all packaged tests.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

Evaan2001 · 2024-08-06T01:30:49Z

Thanks Blaise for this info and sorry for the delay in getting back to you (today was extra wild ...). I'm using MacOS on Apple silicon. I was able to run the tests with cargo test --all. You're right, the execution of the tests with my commit never terminated, and I'm guessing the future in question never finished.

To ensure the tests were successful with the original code, I reverted my code and ran cargo test --all. I was surprised that even though some error statements were printed, almost all of the tests passed; only 3 were ignored with 0 failures/
measured/filtered-out.

I'll start working on your suggestion but before I amend my commit and ask for another review, I'm curious – are the tests supposed to succeed despite some errors? A sample error statement that got printed was:

"ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: [\"Worker disconnected\"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }"

MarcusSorealheis · 2024-08-06T03:08:14Z

@Evaan2001 in which context did you receive this error:
"ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: ["Worker disconnected"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }"

This also could be a good question for our Slack with a bit more context. The error can happen for a lot of reasons.

This error would typically occur when a worker fails to run an action that was routed to it, as defined here. I'm not sure what about your change should cause this error just yet.

Evaan2001 · 2024-08-06T04:05:55Z

@MarcusSorealheis, I was in the root directory and ran cargo test --all. Given that @allada pointed out a flaw in my coding logic, I wanted to eliminate the possibility of my code breaking anything and simply see what the correct test results looked like. Consequently, I reverted all of my changes before running cargo test --all. In other words, I ran the tests with my branch being the same as upstream.

This error popped up on my screen while running tests/simple_scheduler_test.rs (target/debug/deps/simple_scheduler_test-c3cf20b07e504040). There were 18 tests to be conducted and this is a snippet from my terminal:

`test worker_retries_on_internal_error_and_fails_test ... ok
  2024-08-06T01:15:06.972790Z ERROR nativelink_scheduler::simple_scheduler: Error while running do_try_match, err: Error { code: Internal, messages: ["Worker command failed, removing worker 54af2725-8f83-438b-8919-a14d0039f2a0 -- Err(Error { code: Internal, messages: [\"Worker disconnected\"] })", "Failed to run worker_notify_run_action in SimpleScheduler::do_try_match"] }
    at nativelink-scheduler/src/simple_scheduler.rs:339
    in nativelink_scheduler::simple_scheduler::simple_scheduler_task_worker_matching

test worker_timesout_reschedules_running_job_test ... ok`

Let me know if you want me to provide more context from the terminal. I want to point out that this wasn't the only error message that got printed on my screen. There were in fact about 30 error messages, yet not one test failed. If it'll be helpful, I can also log the results of running cargo test --all on my computer and share it with you all.

MarcusSorealheis · 2024-08-06T04:14:22Z

I wonder if this is related to the recent scheduler changes cc @adam-singer

We recently introduced the ability to run with a distributed scheduler or a single scheduler, but I don't believe these changes should have added more errors. I need to run the tests on Mac. I will do it tonight. For the last couple weeks I have been on Linux for development.

Evaan2001 · 2024-08-06T05:02:37Z

@MarcusSorealheis, let me know if you see the error messages while testing on Mac. I saw 29 errors; sharing 5 I randomly selected here (ignore if not helpful):

Running tests/running_actions_manager_test.rs (target/debug/deps/running_actions_manager_test-daacda8377e1371d)

.
.
test unix_executable_file_test ... ok
  2024-08-06T01:15:10.651632Z ERROR nativelink_worker::running_actions_manager: Command returned non-zero exit code, exit_code: 9, stdout: "", stderr: ""
    at nativelink-worker/src/running_actions_manager.rs:1160

test worker_times_out ... ok
.
.

Running tests/local_worker_test.rs (target/debug/deps/local_worker_test-52765e08b4ec9177)

  2024-08-06T01:15:09.938890Z ERROR nativelink_worker::local_worker: Error, err: Error { code: Internal, messages: ["Got EOF expected UpdateForWorker"] }
    at nativelink-worker/src/local_worker.rs:534
    in local_worker_test::utils::local_worker_test_utils::local_worker_spawn

Running tests/bytestream_server_test.rs (target/debug/deps/bytestream_server_test-8a4014f213431079)

.
.
test out_of_sequence_write ... ok
  2024-08-06T01:15:07.001937Z ERROR nativelink_service::bytestream_server: error: status: InvalidArgument, message: "Expected WriteRequest struct in stream : In ByteStreamServer::write", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/bytestream_server.rs:640
.
.

Running tests/filesystem_store_test.rs (target/debug/deps/filesystem_store_test-1f7e97c7e6febd9e)

.
.
test file_continues_to_stream_on_content_replace_test ... ok
  2024-08-06T01:15:07.394663Z ERROR nativelink_store::filesystem_store: Failed to touch file, err: Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/var/folders/rh/bz_crfln23q5qc4h149dz_r40000gn/T/nix-shell.F4XFEp/11411840072976465328/content_path/0123456789abcdef000000000000000000010000000000000123456789abcdef-10\""] }
    at nativelink-store/src/filesystem_store.rs:342
.
.

Running tests/cas_server_test.rs (target/debug/deps/cas_server_test-f85a530c68a87e66)

.
.
test empty_store ... ok
  2024-08-06T01:15:07.070854Z ERROR nativelink_service::cas_server: error: status: InvalidArgument, message: "Invalid string length : Invalid sha256 hash: BAD_HASH : Failed on find_missing_blobs() command", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/cas_server.rs:302
.
.

adam-singer · 2024-08-06T05:13:51Z

@MarcusSorealheis @Evaan2001 the printing of error messages isn't always an indication of a failed test passing, its an indication that the test executed a line of code that logs an event as error, that could be by the design of the tests intention. The most obvious case is (5) where the test for empty storage and a fetch is tried, we expect an error message to be printed. Unfortunately rust doesn't have the jvm level of log message interception and validation to verify error messages in tests (at least I'm not aware of any reflective test features like that), guessing we could try and hook the event listeners and verify we expect the message to be error level depending on the test context.

allada

FYI, the cargo test errors you were getting I was finally able to reproduce. A PR has already landed with the fix:
#1236

So if you rebase you can use it like normal now.

As @adam-singer said, those error messages are not a problem unless the test fails. We have some tests that check to ensure NativeLink handles failures properly and some of these are what you are seeing, but the test actually passes.

As long as the tests pass, you are good.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

Evaan2001 · 2024-08-06T16:08:04Z

@adam-singer, that makes sense. Thanks for clarifying.

@allada, I used rebase to incorporate your changes to the toml files. However, running cargo test inside the nativelink-service directory failed as the code couldn't compile. There was just one error that showed up 13 times, here:

Compiling nativelink-service v0.4.0 (/Users/evaanahmed/Documents/GitHub/nativelink/nativelink-service)
error[E0599]: no variant or associated item named `Gzip` found for enum `CompressionEncoding` in the current scope
   --> nativelink-service/tests/bytestream_server_test.rs:197:75
    |
197 |         make_stream_and_writer_spawn(bs_server, Some(CompressionEncoding::Gzip));
    |                                                                           ^^^^ variant or associated item not found in `CompressionEncoding`
    |
note: if you're trying to build a new `CompressionEncoding`, consider using `CompressionEncoding::from_accept_encoding_header` which returns `Option<CompressionEncoding>`
   --> /Users/evaanahmed/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tonic-0.12.1/src/codec/compression.rs:101:5
    |
101 | /     pub(crate) fn from_accept_encoding_header(
102 | |         map: &http::HeaderMap,
103 | |         enabled_encodings: EnabledCompressionEncodings,
104 | |     ) -> Option<Self> {
    | |_____________________^
help: there is a method `vzip` with a similar name
    |
197 |         make_stream_and_writer_spawn(bs_server, Some(CompressionEncoding::vzip));
    |                                                                           ~~~~

The good thing is that cargo test --all works – the code compiles and the tests run – so I can be on my way to work on your feedback, testing my code by running all the tests.

Evaan2001 · 2024-08-07T01:39:15Z

Hey @allada, just finished updating the commit. Your feedback helped me realize that my understanding of the code was slightly wrong, so I thank you for that :).

cargo test --all conducted 316 tests and my current code passed all of them.

I had a question regarding one of your earlier suggestions. Regarding error handling, you mentioned "merging the 2 responses". I'm afraid I didn't quite understand that. If 1 response is the error message, what's the other response we're talking about? For now, I just returned the error:

return Some((Err(err.into()), None));

allada

After reading the code again, I realized response is not a Result, so we don't need to merge the two potential error cases.

Just a few more minor issues and it should be ready.

Thanks and great job!

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved

nativelink-service/src/bytestream_server.rs line 331 at r2 (raw file):

                                        // right before setting state.get_part_fut to Box::pin(pending())
                                        // In other words, the only time we wait for get_part_fut to complete is
                                        // when maybe_get_part_result = None

nit: Can we change this comment to be more direct and short... something like:

// If `maybe_get_part_result` has a value, it means `get_part_fut` will never complete,
// so only await it, if it is not `pending()`.

nativelink-service/src/bytestream_server.rs line 333 at r2 (raw file):

                                        // when maybe_get_part_result = None
                                        let get_part_result = if let Some(result) = state.maybe_get_part_result {
                                            // if maybe_get_part_result is not None, get_part_fut either is done or is set to Box::pin(pending())

nit: I suggest removing this comment and the one below.

nativelink-service/src/bytestream_server.rs line 341 at r2 (raw file):

                                        };

                                        // Now, get_part_result has a future that resulted in either Result<()> or Error.

nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.

Evaan2001 · 2024-08-07T16:17:44Z

@allada, shortened the comments; have a look!

MarcusSorealheis

you need to wait for Blaise here but

Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved

MarcusSorealheis

i personally like more comments

Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 3 discussions need to be resolved

adam-singer

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 3 discussions need to be resolved

Evaan2001 · 2024-08-08T20:59:07Z

@allada although I have 2 LGTMs, I still need your final good-to-go as the PR needs you to approve the requested changes. Think you can have a look soon?

allada

Yes, it's good, but just that one last comment and it's ready.

Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 1 discussions need to be resolved

Evaan2001 · 2024-08-09T17:24:53Z

@allada, done!

allada

Reviewed all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 0 of 1 files reviewed, and 1 discussions need to be resolved

nativelink-service/src/bytestream_server.rs line 341 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This comment is just describing the code. Can we remove this comment, as the code is clear enough to understand.

can we remove this comment? This is the only blocker. The comment is just describing the line below it, which is self-explanatory.

Issue TraceMachina#490 mentions that ByteStreamServer::inner_read drops future a bit too aggressively when EOF is received. This commit modifies the ByteStreamServer to wait for get_part_fut to finish once EOF is received. The code also checks if get_part_fut returns an error and handles it appropriately. Edit – this commit updates the code commited earlier based on Blaise's feedback on the PR.

Evaan2001 · 2024-08-12T19:17:29Z

@allada it's done now. I thought you wanted me to delete just // Now, get_part_result has a future that resulted in either Result<()> or Error. but I guess the entire comment block needed to go; my bad for misunderstanding you!

Evaan2001 · 2024-08-26T23:45:54Z

Hey @allada, pinging you in case you missed my last review request. The extra comments are removed. Think you can give your LGTM and merge this PR?

Evaan2001 mentioned this pull request Aug 5, 2024

ByteStreamServer::inner_read drops future a bit too aggressively when EOF is received #490

Closed

allada requested changes Aug 5, 2024

View reviewed changes

allada reviewed Aug 5, 2024

View reviewed changes

allada reviewed Aug 6, 2024

View reviewed changes

Evaan2001 force-pushed the work_after_EOF branch from d9055e3 to da9a4a4 Compare August 6, 2024 15:53

Evaan2001 force-pushed the work_after_EOF branch 2 times, most recently from 86b62d2 to 826c00b Compare August 7, 2024 01:32

Evaan2001 requested a review from allada August 7, 2024 01:41

allada reviewed Aug 7, 2024

View reviewed changes

Evaan2001 force-pushed the work_after_EOF branch from 826c00b to 74c14f6 Compare August 7, 2024 16:16

Evaan2001 requested a review from allada August 7, 2024 16:17

MarcusSorealheis approved these changes Aug 8, 2024

View reviewed changes

MarcusSorealheis reviewed Aug 8, 2024

View reviewed changes

adam-singer approved these changes Aug 8, 2024

View reviewed changes

allada reviewed Aug 9, 2024

View reviewed changes

Evaan2001 force-pushed the work_after_EOF branch from 74c14f6 to b21ffd9 Compare August 9, 2024 17:20

Evaan2001 requested a review from allada August 9, 2024 17:24

allada reviewed Aug 12, 2024

View reviewed changes

Evaan2001 force-pushed the work_after_EOF branch from b21ffd9 to 4ed726d Compare August 12, 2024 19:12

Evaan2001 requested a review from allada August 12, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Evaan2001 commented Aug 5, 2024 •

edited

Loading

allada left a comment

allada left a comment

Evaan2001 commented Aug 6, 2024 •

edited

Loading

MarcusSorealheis commented Aug 6, 2024

Evaan2001 commented Aug 6, 2024 •

edited

Loading

MarcusSorealheis commented Aug 6, 2024

Evaan2001 commented Aug 6, 2024

adam-singer commented Aug 6, 2024 •

edited

Loading

allada left a comment

Evaan2001 commented Aug 6, 2024

Evaan2001 commented Aug 7, 2024 •

edited

Loading

allada left a comment

Evaan2001 commented Aug 7, 2024

MarcusSorealheis left a comment

MarcusSorealheis left a comment

adam-singer left a comment

Evaan2001 commented Aug 8, 2024

allada left a comment

Evaan2001 commented Aug 9, 2024

allada left a comment

Evaan2001 commented Aug 12, 2024

Evaan2001 commented Aug 26, 2024

Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Are you sure you want to change the base?

Wait For get_part_fut To Finish After Receiving EOF In byteStream_server.rs #1234

Conversation

Evaan2001 commented Aug 5, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

allada left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 6, 2024 • edited Loading

MarcusSorealheis commented Aug 6, 2024

Evaan2001 commented Aug 6, 2024 • edited Loading

MarcusSorealheis commented Aug 6, 2024

Evaan2001 commented Aug 6, 2024

adam-singer commented Aug 6, 2024 • edited Loading

allada left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 6, 2024

Evaan2001 commented Aug 7, 2024 • edited Loading

allada left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 7, 2024

MarcusSorealheis left a comment

Choose a reason for hiding this comment

MarcusSorealheis left a comment

Choose a reason for hiding this comment

adam-singer left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 8, 2024

allada left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 9, 2024

allada left a comment

Choose a reason for hiding this comment

Evaan2001 commented Aug 12, 2024

Evaan2001 commented Aug 26, 2024

Evaan2001 commented Aug 5, 2024 •

edited

Loading

Evaan2001 commented Aug 6, 2024 •

edited

Loading

Evaan2001 commented Aug 6, 2024 •

edited

Loading

adam-singer commented Aug 6, 2024 •

edited

Loading

Evaan2001 commented Aug 7, 2024 •

edited

Loading