Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial 09: Update to EmbeddingRetriever Training #35

Open
bglearning opened this issue Aug 15, 2022 · 19 comments
Open

Tutorial 09: Update to EmbeddingRetriever Training #35

bglearning opened this issue Aug 15, 2022 · 19 comments

Comments

@bglearning
Copy link
Contributor

bglearning commented Aug 15, 2022

Overview

With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel

@bglearning bglearning changed the title Tutorial 11: Update to EmbeddingRetriever Training Tutorial 09: Update to EmbeddingRetriever Training Aug 17, 2022
@mkkuemmel
Copy link
Member

mkkuemmel commented Aug 17, 2022

Q1. I think doing a 9A and 9B tutorial would be good? I'd let them coexist. I could also be convinced to replace the existing (DPR) one, but it's always a little painful to "throw away" useful information! 😄

Q2. The data format strikes me as somewhat difficult, also for users, due to the score field. Is there a different loss we could implement to be able to use an "easier to create", more accessible data format? @vblagoje @julian-risch do you have any experiences with this? What are your suggestions?

@vblagoje
Copy link
Member

I think 9A and 9B make total sense. Before we dive into EmbeddingRetriever training, does it even make sense now with these V3 models from sentence transformers? Perhaps one would potentially need to GPL adapt for particular ___domain data only. @mkkuemmel and the team might have better insights from the field...

@mathislucka
Copy link
Member

There is this dataset with cross-encoder scores for MarginMSELoss: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives

I'd vote for implementing MultipleNegativesRankingLoss because MarginMSE is already used in GPL and MNRL also yields very good results. What do you mean by v3 models @vblagoje ?

@mathislucka
Copy link
Member

Oh and training definitely makes sense. If you have labeled data, you will get much better results with training than with the out-of-the-box models.

@vblagoje
Copy link
Member

Ok, cool, good to know @mathislucka . It's a rough naming scheme Nils used for his msmarco and likely other models. So the latest models we want are likely these V5 trained on MarginMSE loss?

@bglearning
Copy link
Contributor Author

So to take stock:

Does training EmbeddingRetrievers make sense?
Yes, definitely helps if labeled data is available.

Which sentence-transformer model(s) do we suggest for out-of-the-box use?
Now it makes sense to use and promote v5 models (?)

What procedure do we suggest for fine-tuning? What format must the data be in?
Here possible options:
Opt-1: Suggest users to convert/collect data into the format with teacher encoder scores. Use MarginMSE loss.
Opt-2: We add in support for MultipleNegativesRankingLoss and suggest its usage (as user wouldn't need teacher encoder scores).
Opt-3: ...?

Which do we go for?

@mkkuemmel
Copy link
Member

As for the options, my higher-level opinion is to find a good trade-off between "scientific correctness" and feasibility for the users.
About the latter:

convert/collect data into the format with teacher encoder scores

How feasible is this for users? Can we guide them how to do it?

About "scientific correctness":
Which one of the losses seems to be the more sensible for this task?

@mathislucka
Copy link
Member

mathislucka commented Aug 24, 2022

My 2 cents:

Which sentence-transformer model(s) do we suggest for out-of-the-box use?
Now it makes sense to use and promote v5 models (?)

The v5 models are only for msmarco. We have seen with clients that all-mpnet-base-v2 or multi-qa-mpnet-base-dot-v1 usually perform best. So I think we should recommend these models. Maybe add multi-qa-MiniLM-L6-cos-v1 as an option for a small and fast model and paraphrase-multilingual-mpnet-base-v2 as a multi-lingual model.

What procedure do we suggest for fine-tuning? What format must the data be in?

Go with Opt-2 as it is simpler than MarginMSE and existing datasets can be re-used.

@bglearning
Copy link
Contributor Author

Hi,

So based on discussions above, am pivoting to adding MultipleNegativesRankingLoss support to the training of EmbeddingRetriever. Opened an issue for it here: deepset-ai/haystack#3136

Can get back to this Tutorial rework once that is resolved/completed.

@sinchanabhat
Copy link

Hi,

Can we expect the tutorial for fine-tuning using embedding retriever soon using GPL train data maybe?

@bglearning
Copy link
Contributor Author

bglearning commented Sep 12, 2022

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. 😅

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss.
(Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one 😄).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

@sinchanabhat
Copy link

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. sweat_smile

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss. (Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one smile).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

Thanks alot for directing me to the notebooks. I have gone through them and pardon me for asking this (even if my question might sound stupid), when we talk about adapting the retriever to gpl data, doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

@bglearning
Copy link
Contributor Author

doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

Ah yes, generally would involve monitoring and acting on val metric as you mentioned (for instance, performance might plateau after some steps as in the GPL paper figure2+section6.1). The tutorial is more for a demonstration of how to setup and perform the training.

@sinchanabhat
Copy link

Got it! Thank you so much for the quick revert! @bglearning

@masci masci transferred this issue from deepset-ai/haystack Sep 26, 2022
@vibha0411
Copy link

This is a related question @bglearning
I see that
[https://colab.research.google.com/drive/1Tz9GSzre7JfvXDDKe7sCnO0FMuDViMnN?usp=sharing#scrollTo=TD2PZuuNTpQ3]
PseudoLabelGenerator runs basically
Question generation - optional step
Negative mining
Pseudo labeling (margin scoring)

steps.
But in a case where i already have the positive and negative documents for a question, I would not be requiring the negative mining step, however as per my understanding I still require the margin scoring to perform the training.

So how do I go about this scenario?
Kindly help.

@bglearning
Copy link
Contributor Author

Hey @vibha0411 ,

If you use MultipleNegativesRankingLoss (train_loss='mnrl', currently the default), the scores aren't required. [1]

In fact, for MNRL even having the negative docs is optional because it considers all other docs in a batch as negative. The MNRL example colab notebook linked above might be useful. It only uses positive docs but of course it's good if you already have negatives and you can pass those too.

Roughly would be something like:

embedding_retriever = EmbeddingRetriever(...)
training_data = [
    {"question": ..., "pos_doc": ..., "neg_doc": ...}, 
    ...
]
embedding_retriever.train(training_data=training_data, train_loss='mnrl')  

[1] Margin scoring would be required when training with margin_mse.

@vibha0411
Copy link

vibha0411 commented Mar 16, 2023

Thank you for your reply.
Yes, I did have a look at MNRL as well.
But in my case since the customer feedback contains the explicit negative feedback, I am looking for a loss function that can leverage this handpicked negative feedback.

Hence I thought it might be more sensible to go with margin scoring..

But not sure what is the best way to go about it.

@vibha0411
Copy link

I think i got your point.
You mean that mnrl will consider the negative that we explicitly provide to be the hard negative so that will help.
Got it!
Thanks

@vibha0411
Copy link

Also is there anyway to monitor the training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants