Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: preload textual model #12729

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

martabal
Copy link
Member

I recently changed my clip model and noticed that the first search is terribly slow because the textual model is loaded only during the first search.

This PR adds a new setting that allows to load the text model on client connection with the websocket thus improving the time it takes for the first search a bit (this improvement depends on many factors like cpu power, model size...). To differentiate a client connection used for the background upload and a "normal" session, it only loads the model if the websocket URL has the query background=false

Todo:

  • Improve the error handling for machine-learning.repository

@mertalev
Copy link
Contributor

I was thinking of handling this with the MACHINE_LEARNING_PRELOAD envs, like with a MACHINE_LEARNING_PRELOAD__CLIP__TYPE env that you can set to textual.

I think it'd be simpler and more generic, although the advantage of this PR is that the model can still be unloaded when not in use.

What do you think?

@martabal
Copy link
Member Author

I think it'd be simpler and more generic, although the advantage of this PR is that the model can still be unloaded when not in use.

Yes, that's exactly why I worked on it. I don't want to have a clip model loaded all the time, I want it loaded only when I (probably) need it. From my tests, with the model I use (XLM-Roberta-Large-Vit-B-16Plus), it adds almost 2 GB of data to my RAM, it's nice to have it unloaded when I don't use Immich 😅

Copy link
Contributor

@mertalev mertalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This is a great way to lower latency for the first search without committing to 24/7 memory usage. That being said, there are some refinements I'd like to see.

Comment on lines +112 to +120
class LoadModelEntry(InferenceEntry):
ttl: int

def __init__(self, name: str, task: ModelTask, type: ModelType, options: dict[str, Any], ttl: int):
super().__init__(name=name, task=task, type=type, options=options)

if ttl <= 0:
raise ValueError("ttl must be a positive integer")
self.ttl = ttl
Copy link
Contributor

@mertalev mertalev Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InferenceEntry is just a type for a dict, so this __init__ shouldn't exist.

Also, I have some reservations about the ttl field here. TTL is currently set by an env and assumed to be the same for all models. At minimum, you would need to change the idle shutdown conditions to not rely on the env anymore and instead (synchronously) check if the model cache is empty.

Moreover, I'm not sure if this is something that the caller should be able to decide. The relationship between server and ML isn't necessarily one-to-one, so the settings should be designed with that in mind. If you have multiple servers that share a machine learning service, the effective TTL now depends on the order of requests. TTL is also relevant to whoever is deploying ML and may not be something they want a caller to be able to configure.

@Type(() => LoadTextualModelOnConnection)
@ValidateNested()
@IsObject()
loadTextualModelOnConnection!: LoadTextualModelOnConnection;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be less wordy? Maybe preloadTextualModel?

Also, I imagine there are a lot of sessions where nothing gets searched and the model is loaded unnecessarily. If this were instead done when the user clicks the search bar, it'd almost always be a positive and could be enabled by default.

@@ -68,6 +75,16 @@ export class EventRepository implements OnGatewayConnection, OnGatewayDisconnect
queryParams: {},
metadata: { adminRoute: false, sharedLinkRoute: false, uri: '/api/socket.io' },
});
if ('background' in client.handshake.query && client.handshake.query.background === 'false') {
const { machineLearning } = await this.configCore.getConfig({ withCache: true });
if (machineLearning.clip.loadTextualModelOnConnection.enabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check if smart search is enabled.

@martabal
Copy link
Member Author

Thanks for working on this! This is a great way to lower latency for the first search without committing to 24/7 memory usage. That being said, there are some refinements I'd like to see.

Do we want to offer multiple strategies for when to load the textual model (on client connection, on search bar click...)? I guess loading model may vary a lot from one instance to another

@mertalev
Copy link
Contributor

Do we want to offer multiple strategies for when to load the textual model (on client connection, on search bar click...)? I guess loading model may vary a lot from one instance to another

The connection approach is better if the model takes more than a few seconds to load and the user searches right away, but it's worse outside of that. The TTL aspect makes it hard to optimize: the model can be unloaded before they search, or it can stay loaded longer than needed because you don't know whether it'll get used. I think the search bar approach is a safer bet for now, but I'm open to having different options if it turns out that it isn't enough.

@martabal
Copy link
Member Author

The TTL aspect makes it hard to optimize: the model can be unloaded before they search, or it can stay loaded longer than needed because you don't know whether it'll get used

We can use the handleDisconnect event to be sure there's no client connected, but yeah, that's not 100% perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants