Tutorial 09: Update to EmbeddingRetriever Training #35

bglearning · 2022-08-15T13:56:32Z

Overview

With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel

The text was updated successfully, but these errors were encountered:

mkkuemmel · 2022-08-17T08:44:53Z

Q1. I think doing a 9A and 9B tutorial would be good? I'd let them coexist. I could also be convinced to replace the existing (DPR) one, but it's always a little painful to "throw away" useful information! 😄

Q2. The data format strikes me as somewhat difficult, also for users, due to the score field. Is there a different loss we could implement to be able to use an "easier to create", more accessible data format? @vblagoje @julian-risch do you have any experiences with this? What are your suggestions?

vblagoje · 2022-08-17T10:26:28Z

I think 9A and 9B make total sense. Before we dive into EmbeddingRetriever training, does it even make sense now with these V3 models from sentence transformers? Perhaps one would potentially need to GPL adapt for particular ___domain data only. @mkkuemmel and the team might have better insights from the field...

mathislucka · 2022-08-17T10:35:45Z

There is this dataset with cross-encoder scores for MarginMSELoss: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives

I'd vote for implementing MultipleNegativesRankingLoss because MarginMSE is already used in GPL and MNRL also yields very good results. What do you mean by v3 models @vblagoje ?

mathislucka · 2022-08-17T10:37:50Z

Oh and training definitely makes sense. If you have labeled data, you will get much better results with training than with the out-of-the-box models.

vblagoje · 2022-08-17T11:25:21Z

Ok, cool, good to know @mathislucka . It's a rough naming scheme Nils used for his msmarco and likely other models. So the latest models we want are likely these V5 trained on MarginMSE loss?

bglearning · 2022-08-23T11:21:50Z

So to take stock:

Does training EmbeddingRetrievers make sense?
Yes, definitely helps if labeled data is available.

Which sentence-transformer model(s) do we suggest for out-of-the-box use?
Now it makes sense to use and promote v5 models (?)

What procedure do we suggest for fine-tuning? What format must the data be in?
Here possible options:
Opt-1: Suggest users to convert/collect data into the format with teacher encoder scores. Use MarginMSE loss.
Opt-2: We add in support for MultipleNegativesRankingLoss and suggest its usage (as user wouldn't need teacher encoder scores).
Opt-3: ...?

Which do we go for?

mkkuemmel · 2022-08-24T08:35:18Z

As for the options, my higher-level opinion is to find a good trade-off between "scientific correctness" and feasibility for the users.
About the latter:

convert/collect data into the format with teacher encoder scores

How feasible is this for users? Can we guide them how to do it?

About "scientific correctness":
Which one of the losses seems to be the more sensible for this task?

mathislucka · 2022-08-24T11:19:12Z

My 2 cents:

Which sentence-transformer model(s) do we suggest for out-of-the-box use?
Now it makes sense to use and promote v5 models (?)

The v5 models are only for msmarco. We have seen with clients that all-mpnet-base-v2 or multi-qa-mpnet-base-dot-v1 usually perform best. So I think we should recommend these models. Maybe add multi-qa-MiniLM-L6-cos-v1 as an option for a small and fast model and paraphrase-multilingual-mpnet-base-v2 as a multi-lingual model.

What procedure do we suggest for fine-tuning? What format must the data be in?

Go with Opt-2 as it is simpler than MarginMSE and existing datasets can be re-used.

bglearning · 2022-09-01T15:04:36Z

Hi,

So based on discussions above, am pivoting to adding MultipleNegativesRankingLoss support to the training of EmbeddingRetriever. Opened an issue for it here: deepset-ai/haystack#3136

Can get back to this Tutorial rework once that is resolved/completed.

sinchanabhat · 2022-09-12T10:32:38Z

Hi,

Can we expect the tutorial for fine-tuning using embedding retriever soon using GPL train data maybe?

bglearning · 2022-09-12T12:03:35Z

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. 😅

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss.
(Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one 😄).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

sinchanabhat · 2022-09-12T12:50:23Z

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. sweat_smile

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss. (Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one smile).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

Thanks alot for directing me to the notebooks. I have gone through them and pardon me for asking this (even if my question might sound stupid), when we talk about adapting the retriever to gpl data, doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

bglearning · 2022-09-12T15:29:14Z

doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

Ah yes, generally would involve monitoring and acting on val metric as you mentioned (for instance, performance might plateau after some steps as in the GPL paper figure2+section6.1). The tutorial is more for a demonstration of how to setup and perform the training.

sinchanabhat · 2022-09-18T09:27:15Z

Got it! Thank you so much for the quick revert! @bglearning

vibha0411 · 2023-03-15T19:36:23Z

This is a related question @bglearning
I see that
[https://colab.research.google.com/drive/1Tz9GSzre7JfvXDDKe7sCnO0FMuDViMnN?usp=sharing#scrollTo=TD2PZuuNTpQ3]
PseudoLabelGenerator runs basically
Question generation - optional step
Negative mining
Pseudo labeling (margin scoring)
steps.
But in a case where i already have the positive and negative documents for a question, I would not be requiring the negative mining step, however as per my understanding I still require the margin scoring to perform the training.

So how do I go about this scenario?
Kindly help.

bglearning · 2023-03-15T21:25:55Z

Hey @vibha0411 ,

If you use MultipleNegativesRankingLoss (train_loss='mnrl', currently the default), the scores aren't required. [1]

In fact, for MNRL even having the negative docs is optional because it considers all other docs in a batch as negative. The MNRL example colab notebook linked above might be useful. It only uses positive docs but of course it's good if you already have negatives and you can pass those too.

Roughly would be something like:

embedding_retriever = EmbeddingRetriever(...)
training_data = [
    {"question": ..., "pos_doc": ..., "neg_doc": ...}, 
    ...
]
embedding_retriever.train(training_data=training_data, train_loss='mnrl')

[1] Margin scoring would be required when training with margin_mse.

vibha0411 · 2023-03-16T08:30:18Z

Thank you for your reply.
Yes, I did have a look at MNRL as well.
But in my case since the customer feedback contains the explicit negative feedback, I am looking for a loss function that can leverage this handpicked negative feedback.

Hence I thought it might be more sensible to go with margin scoring..

But not sure what is the best way to go about it.

vibha0411 · 2023-03-16T11:38:40Z

I think i got your point.
You mean that mnrl will consider the negative that we explicitly provide to be the hard negative so that will help.
Got it!
Thanks

vibha0411 · 2023-03-16T16:10:15Z

Also is there anyway to monitor the training?

bglearning changed the title ~~Tutorial 11: Update to EmbeddingRetriever Training~~ Tutorial 09: Update to EmbeddingRetriever Training Aug 17, 2022

bglearning mentioned this issue Sep 1, 2022

Add MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers deepset-ai/haystack#3136

Closed

masci transferred this issue from deepset-ai/haystack Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 09: Update to EmbeddingRetriever Training #35

Tutorial 09: Update to EmbeddingRetriever Training #35

bglearning commented Aug 15, 2022 •

edited

Loading

mkkuemmel commented Aug 17, 2022 •

edited

Loading

vblagoje commented Aug 17, 2022

mathislucka commented Aug 17, 2022

mathislucka commented Aug 17, 2022

vblagoje commented Aug 17, 2022

bglearning commented Aug 23, 2022

mkkuemmel commented Aug 24, 2022

mathislucka commented Aug 24, 2022 •

edited

Loading

bglearning commented Sep 1, 2022

sinchanabhat commented Sep 12, 2022

bglearning commented Sep 12, 2022 •

edited

Loading

sinchanabhat commented Sep 12, 2022

bglearning commented Sep 12, 2022

sinchanabhat commented Sep 18, 2022

vibha0411 commented Mar 15, 2023

bglearning commented Mar 15, 2023

vibha0411 commented Mar 16, 2023 •

edited

Loading

vibha0411 commented Mar 16, 2023

vibha0411 commented Mar 16, 2023

Tutorial 09: Update to EmbeddingRetriever Training #35

Tutorial 09: Update to EmbeddingRetriever Training #35

Comments

bglearning commented Aug 15, 2022 • edited Loading

Overview

Training EmbeddingRetriever

mkkuemmel commented Aug 17, 2022 • edited Loading

vblagoje commented Aug 17, 2022

mathislucka commented Aug 17, 2022

mathislucka commented Aug 17, 2022

vblagoje commented Aug 17, 2022

bglearning commented Aug 23, 2022

mkkuemmel commented Aug 24, 2022

mathislucka commented Aug 24, 2022 • edited Loading

bglearning commented Sep 1, 2022

sinchanabhat commented Sep 12, 2022

bglearning commented Sep 12, 2022 • edited Loading

sinchanabhat commented Sep 12, 2022

bglearning commented Sep 12, 2022

sinchanabhat commented Sep 18, 2022

vibha0411 commented Mar 15, 2023

bglearning commented Mar 15, 2023

vibha0411 commented Mar 16, 2023 • edited Loading

vibha0411 commented Mar 16, 2023

vibha0411 commented Mar 16, 2023

bglearning commented Aug 15, 2022 •

edited

Loading

mkkuemmel commented Aug 17, 2022 •

edited

Loading

mathislucka commented Aug 24, 2022 •

edited

Loading

bglearning commented Sep 12, 2022 •

edited

Loading

vibha0411 commented Mar 16, 2023 •

edited

Loading