Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [SKU modularization] remove sku_config from v1alpha1 and implement skuHandler interface #601

Closed
wants to merge 7 commits into from

Conversation

smritidahal653
Copy link
Contributor

Reason for Change:

  • sku_config is Azure cloud specific, this is replaced by skuHandler interface as a part of the effort to modularize gpu skus. Deleting it from the v1alpha1 package
  • Implementing skuHandler to get gpu configs in place where SupportedGPUConfigs was used from sku_config file
  • Fixing test cases to use updated skus after skuHandler implementation

Requirements

  • added unit tests and e2e tests (if applicable).

Notes for Reviewers:

smritidahal653 and others added 7 commits September 19, 2024 13:34
This PR adds the initial draft for the RAGEngine CRD in Kaito.

A RAGEngine CRD defines all resources needed to run a RAG on top of a
LLM inference service. Upon creating a RAGEngine CR, a new controller
will create a deployment which runs a RAG engine instance. The instance
provides http endpoints for both `index` and `query` services. The
instance can optionally choose a public model embedding service or run a
local embedding model with GPU to convert the input index data to
vectors. The instance can also connect to a Vector DB instance to
persist the vectors db or by default using an in-memory vector DB. The
instance uses the `llamaIndex` library to orchestrate the workflow. When
RAGEngine instance is up and running, users should send questions to the
`query` endpoint of RAG instance instead of the normal `chat` endpoint
in the inference service.

The RAGEngine is intended to be "standalone". It can use any public
inference service or inference services hosted by Kaito workspace.

The RAG engine instance is designed to help retrieve prompts from
unstructured data (arbitrary index data provided by the users).
Retrieving from structured data or search engine is out of the scope for
now.
This PR adds the initial draft for the RAGEngine CRD in Kaito.

A RAGEngine CRD defines all resources needed to run a RAG on top of a
LLM inference service. Upon creating a RAGEngine CR, a new controller
will create a deployment which runs a RAG engine instance. The instance
provides http endpoints for both `index` and `query` services. The
instance can optionally choose a public model embedding service or run a
local embedding model with GPU to convert the input index data to
vectors. The instance can also connect to a Vector DB instance to
persist the vectors db or by default using an in-memory vector DB. The
instance uses the `llamaIndex` library to orchestrate the workflow. When
RAGEngine instance is up and running, users should send questions to the
`query` endpoint of RAG instance instead of the normal `chat` endpoint
in the inference service.

The RAGEngine is intended to be "standalone". It can use any public
inference service or inference services hosted by Kaito workspace.

The RAG engine instance is designed to help retrieve prompts from
unstructured data (arbitrary index data provided by the users).
Retrieving from structured data or search engine is out of the scope for
now.
This PR adds the initial draft for the RAGEngine CRD in Kaito.

A RAGEngine CRD defines all resources needed to run a RAG on top of a
LLM inference service. Upon creating a RAGEngine CR, a new controller
will create a deployment which runs a RAG engine instance. The instance
provides http endpoints for both `index` and `query` services. The
instance can optionally choose a public model embedding service or run a
local embedding model with GPU to convert the input index data to
vectors. The instance can also connect to a Vector DB instance to
persist the vectors db or by default using an in-memory vector DB. The
instance uses the `llamaIndex` library to orchestrate the workflow. When
RAGEngine instance is up and running, users should send questions to the
`query` endpoint of RAG instance instead of the normal `chat` endpoint
in the inference service.

The RAGEngine is intended to be "standalone". It can use any public
inference service or inference services hosted by Kaito workspace.

The RAG engine instance is designed to help retrieve prompts from
unstructured data (arbitrary index data provided by the users).
Retrieving from structured data or search engine is out of the scope for
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants