Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CB improvements #769

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ilya-lavrenov
Copy link
Contributor

No description provided.

auto running_sequences = sequence_group->get_running_sequences();
// TODO: ilavrenov - why beam search case is not handled here?
// in case of beam search 'blocks_num' are not equally distributed between sequences
// because some of them share the same blocks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// preempt prompt fully to not leave partially generated prompt
preempted_tokens = processed_tokens;
preempted_tokens = context_len;
// TODO: ilavrenov - what is we have multiple sequences within a group? we need to drop them all
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov ilya-lavrenov added this to the 2024.4 milestone Aug 13, 2024
Comment on lines +58 to +59
// token by token. This structure represents a vector of already generated tokens so far
// for a given prompt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not exactly true. I would rephrase that to: "... a vector of tokens generated since the last read...".
To be more specific it's always vector of one element if N == 1 and we are not using beam search.
For N > 1 and/or beam search is used vector contains all generated tokens.

using GenerationOutputs = std::unordered_map<uint64_t, GenerationOutput>;

class GenerationStream;

class OPENVINO_GENAI_EXPORTS GenerationHandleImpl {
class OPENVINO_GENAI_EXPORTS GenerationHandle {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?

std::shared_ptr<GenerationStream> m_generation_stream;
ov::genai::GenerationConfig m_sampling_params;

bool is_dropped();
// whether client ha dropped session with pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// whether client ha dropped session with pipeline
// whether client has dropped session with pipeline


bool can_read();
// whether new tokens are available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// whether new tokens are available
// whether read() is possible (new tokens are available and handle has not been dropped)

Comment on lines 25 to +30
// whether to split prompt / generate to different scheduling phases
// - Dynamic split fuse schdules requests in generation phase first, then
// schdules requests in prompt phase. If request cannot be fully fit into
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially
// and other tokens can be scheduled only next iterations
// - vLLM mode priorities requests in prompt phase over requests on generation phase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// whether to split prompt / generate to different scheduling phases
// - Dynamic split fuse schdules requests in generation phase first, then
// schdules requests in prompt phase. If request cannot be fully fit into
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially
// and other tokens can be scheduled only next iterations
// - vLLM mode priorities requests in prompt phase over requests on generation phase
// whether to split prompt / generate to different scheduling phases
// - Dynamic split fuse schedules requests in generation phase first, then
// schedules requests in prompt phase. If request cannot be fully fit into
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially
// and other tokens can be scheduled in next iterations
// - vLLM mode prioritizes requests in prompt phase over requests in generation phase

@andrei-kochin andrei-kochin removed this from the 2024.4 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants