3

I am using pre-trained LLM to generate a representative embedding for an input text. But it is wired that the output embeddings are all the same regardless of different input texts.

The codes:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

def generate_embedding(document):
    inputs = tokenizer(document, return_tensors='pt')
    print("Tokenized inputs:", inputs)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[0, 0, :].numpy()
    print("Generated embedding:", embedding)
    return embedding

text1 = "this is a test"
text2 = "this is another test"
text3 = "there are other tests"

embedding1 = generate_embedding(text1)
embedding2 = generate_embedding(text2)
embedding3 = generate_embedding(text3)

are_equal = np.array_equal(embedding1, embedding2) and np.array_equal(embedding2, embedding3)

if are_equal:
    print("The embeddings are the same.")
else:
    print("The embeddings are not the same.")

The printed tokens are different, but the printed embeddings are the same. The outputs:

Tokenized inputs: {'input_ids': tensor([[   1,  456,  349,  264, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  456,  349, 1698, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  736,  460,  799, 8079]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
The embeddings are the same.

Does anyone know where the problem is? Many thanks!

0

1 Answer 1

4

You're not slicing it the dimensions right at

outputs.last_hidden_state[0, 0, :].numpy()

Q: What is the 0th token in all inputs?

A: Beginning of sentence token (BOS)

Q: So that's the "embeddings" I'm slicing is the BOS token?

A: Try this:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np

PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

model(**tokenizer("", return_tensors='pt')).last_hidden_state

[out]:

tensor([[[-1.7763,  1.9293, -2.2413,  ...,  2.6380, -3.1049,  4.8060]]],
       grad_fn=<MulBackward0>)

Q: Then, how do I get the embeddings from a decoder-only model?

A: Can you really get an "embedding" from a decoder-only model? The model outputs a hidden state per token it "regress" through, so different texts get different tensor output size.

Q: How do you make it into a single fixed size vector then?

A: Most probably,

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much for your detailed answer @alvas. I thought I could use the embedding of the special token at the beginning to represent the whole sequence, just like the embedding of the CLS token to represent a sequence for text classification. Now it turns out that the embedding of the BOS token in this model keeps almost the same for different input texts. I need to at least do a pooling among the embeddings of the tokens in the sequence then. Thanks!
@Howie: you might want to look at this answer for fetching sentence embeddings from decoder models.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.