-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial 09: Update to EmbeddingRetriever Training #35
Comments
Q1. I think doing a 9A and 9B tutorial would be good? I'd let them coexist. I could also be convinced to replace the existing (DPR) one, but it's always a little painful to "throw away" useful information! 😄 Q2. The data format strikes me as somewhat difficult, also for users, due to the |
I think 9A and 9B make total sense. Before we dive into EmbeddingRetriever training, does it even make sense now with these V3 models from sentence transformers? Perhaps one would potentially need to GPL adapt for particular ___domain data only. @mkkuemmel and the team might have better insights from the field... |
There is this dataset with cross-encoder scores for MarginMSELoss: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives I'd vote for implementing |
Oh and training definitely makes sense. If you have labeled data, you will get much better results with training than with the out-of-the-box models. |
Ok, cool, good to know @mathislucka . It's a rough naming scheme Nils used for his msmarco and likely other models. So the latest models we want are likely these V5 trained on MarginMSE loss? |
So to take stock: Does training EmbeddingRetrievers make sense? Which sentence-transformer model(s) do we suggest for out-of-the-box use? What procedure do we suggest for fine-tuning? What format must the data be in? Which do we go for? |
As for the options, my higher-level opinion is to find a good trade-off between "scientific correctness" and feasibility for the users.
How feasible is this for users? Can we guide them how to do it? About "scientific correctness": |
My 2 cents:
The v5 models are only for msmarco. We have seen with clients that
Go with Opt-2 as it is simpler than |
Hi, So based on discussions above, am pivoting to adding Can get back to this Tutorial rework once that is resolved/completed. |
Hi, Can we expect the tutorial for fine-tuning using embedding retriever soon using GPL train data maybe? |
Hi @sinchanabhat, Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. 😅 In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss. The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164). Both notebooks are not Tutorials per se (so not as polished) but might be helpful still. |
Thanks alot for directing me to the notebooks. I have gone through them and pardon me for asking this (even if my question might sound stupid), when we talk about adapting the retriever to gpl data, doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is? |
Ah yes, generally would involve monitoring and acting on val metric as you mentioned (for instance, performance might plateau after some steps as in the GPL paper figure2+section6.1). The tutorial is more for a demonstration of how to setup and perform the training. |
Got it! Thank you so much for the quick revert! @bglearning |
This is a related question @bglearning So how do I go about this scenario? |
Hey @vibha0411 , If you use MultipleNegativesRankingLoss ( In fact, for MNRL even having the negative docs is optional because it considers all other docs in a batch as negative. The MNRL example colab notebook linked above might be useful. It only uses positive docs but of course it's good if you already have negatives and you can pass those too. Roughly would be something like: embedding_retriever = EmbeddingRetriever(...)
training_data = [
{"question": ..., "pos_doc": ..., "neg_doc": ...},
...
]
embedding_retriever.train(training_data=training_data, train_loss='mnrl') [1] Margin scoring would be required when training with |
Thank you for your reply. Hence I thought it might be more sensible to go with margin scoring.. But not sure what is the best way to go about it. |
I think i got your point. |
Also is there anyway to monitor the training? |
Overview
With deepset-ai/haystack#2887, we replaced DPR with
EmbeddingRetriever
in Tutorial 06.Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.
Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?
Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.
Training EmbeddingRetriever
Only the sentence-encoder variant of
EmbeddingRetriever
can be trained.Its
train
method does some data setup and then calls thefit
method onSentenceTransformer
(from thesentence_transformers
package).Input data format is:
It uses
MarginMSELoss
(as part of the GPL procedure).Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the
score
above, right? So there doesn't seem to be a download-and-use form of dataset available?RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel
The text was updated successfully, but these errors were encountered: