NLP Collective

Frequently asked questions relating to NLP. Many of these may be questions that are often asked over and over, duplicates would likely be closed in favor of these. Add the best answer (using the ...

Berthold♦

answered Aug 2, 2023 at 17:43

Can you answer these questions?

View all unanswered questions

These questions still don't have an answer

0 votes

0 answers

7 views

Vertex AI TEI Deployment Fails for Private Hugging Face Model - "Could not download model artifacts"

I'm trying to deploy a Hugging Face model to Vertex AI using the Text Embeddings Inference (TEI) workflow, but I'm getting consistent errors during deployment. This same deployment approach worked for ...

George Atoyan

asked 2 hours ago

0 votes

0 answers

8 views

Language Model Evaluation with Custom Task - Hugging Face Lighteval

I am creating a benchmark to evaluate a language model. First, I generated the dataset that I'm gonna prompt the Language model with. Subsequently, I tried to evaluate any tiny language model just to ...

Mahmoud Hanouneh

asked 7 hours ago

1 vote

0 answers

13 views

Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...

orchid Ali

asked 8 hours ago

-1 votes

0 answers

22 views

ChunkedEncodingError: ('Connection broken: IncompleteRead(6182 bytes read, 4058 more expected)', IncompleteRead(6182 bytes read, 4058 more expected))

I wanna fetch the data from "https://www.yellowpages.co.th/" and store as a pd dataframe. import time def retry_fetching(soup, times=3, string="h1[class='typ-profile-title ...

W. Wongcharoenbhorn

asked Jun 19 at 12:42

0 votes

0 answers

21 views

Marianmt fine-tunning for translation. seq2seq trainer stucked

I'm fine-tuning a MarianMT model for a translation task using Hugging Face's Seq2SeqTrainer. Everything works fine until I add early stopping — then the training crashes silently, without any error ...

AlArgente

modified Jun 23 at 13:48

Recommended answers

View all recommended answers

These answers have been recommended

3 votes

1 answer

1k views

Removing strange/special characters from outputs llama 3.1 model

Background: I'm using Hugging Face's transformers package and Llama 3.1 8B (Instruct). Problem: I am generating responses to a prompt one word at a time in the following way (note that I choose over ...

alvas

123k

modified Sep 26, 2024 at 22:12

Answer Accepted

TL;DR Use this instead of rolling out your own detokenizer. tokenizer.batch_decode(input_ids) In Long The official Llama 3.1 has some approval process that might take some time, so this answer will ...

View answer

alvas

123k

answered Sep 26, 2024 at 21:45

1 vote

1 answer

665 views

Error while converting google flan T5 model to onnx

I am looking to convert flan-T5 model downloaded from Hugging face into onnx format and make inference with the same. My input data is the symptoms of disease and expected output is the Disease name ...

alvas

123k

answered May 15, 2024 at 15:44

Answer

Use https://huggingface.co/datasets/bakks/flan-t5-onnx instead. And to convert the google/flan-t5, see https://huggingface.co/datasets/bakks/flan-t5-onnx/blob/main/exportt5.py from pathlib import ...

View answer

alvas

123k

answered May 15, 2024 at 15:44

1 vote

2 answers

407 views

Why did my fine-tuning T5-Base Model for a sequence-to-sequence task has short incomplete generation?

I am trying to fine-tune a t5-base model for creating appropriate question against a compliance item. Compliance iteams are paragraph of texts and my question are in the past format of them. I have ...

KLaz

answered May 6 at 10:01

Answer

Because of: labels = tokenizer(targets, max_length=32, padding="max_length", truncation=True) Most probably your model has learnt to just output/generate outputs that are ~32 tokens. Try: ...

View answer

alvas

123k

answered May 8, 2024 at 17:16

3 votes

1 answer

418 views

How to save the LLM2Vec model as a HuggingFace PreTrainedModel object?

Typically, we should be able to save a merged base + PEFT model, like this: import torch from transformers import AutoTokenizer, AutoModel, AutoConfig from peft import PeftModel # Loading base MNTP ...

alvas

123k

answered Apr 12, 2024 at 18:33

Answer

Wrapping the LLM2Vec object around like in https://stackoverflow.com/a/74109727/610569 We can try this: import torch.nn as nn from transformers import PreTrainedModel, PretrainedConfig from ...

View answer

alvas

123k

answered Apr 12, 2024 at 18:33

3 votes

1 answer

1k views

Mistral model generates the same embeddings for different input texts

I am using pre-trained LLM to generate a representative embedding for an input text. But it is wired that the output embeddings are all the same regardless of different input texts. The codes: from ...

alvas

123k

answered Apr 11, 2024 at 12:13

Answer Accepted

You're not slicing it the dimensions right at outputs.last_hidden_state[0, 0, :].numpy() Q: What is the 0th token in all inputs? A: Beginning of sentence token (BOS) Q: So that's the "embeddings&...

View answer

alvas

123k

answered Apr 11, 2024 at 12:13

See what's trending

View all trending questions

These are the most active questions in NLP Collective

470 votes

18 answers

104k views

How does the Google "Did you mean?" Algorithm work? [closed]

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly ...

CommunityBot

modified May 10, 2018 at 20:23

179 votes

34 answers

480k views

spaCy: Can't find model 'en_core_web_sm' on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

What is the difference between spacy.load('en_core_web_sm') and spacy.load('en')? This link explains different model sizes. But I am still not clear how spacy.load('en_core_web_sm') and spacy.load('en'...

desertnaut

60.6k

modified Mar 15 at 14:57

355 votes

6 answers

220k views

What is "entropy and information gain"? [closed]

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy ...

Waseem Ahmad Naeem

modified Aug 1, 2018 at 17:26

288 votes

14 answers

318k views

How to compute the similarity between two text documents?

I want to take two documents and determine how similar they are. Any programming language if fine but I prefer Python.

Franck Dernoncourt

modified Mar 14 at 0:25

225 votes

18 answers

234k views

googletrans stopped working with error 'NoneType' object has no attribute 'group'

I was trying googletrans and it was working quite well. Since this morning I started getting below error. I went through multiple posts from stackoverflow and other sites and found probably my ip is ...

Amir Charkhi

modified Mar 23, 2023 at 7:49

Collectives on Stack Overflow: a subcommunity defined by tags to help you find trusted answers faster and share knowledge with the community.

Get started with collectives

Explore collective features

Read your first bulletin

Check out the leaderboard

Learn about the different roles

Discover recommended answers

See all collectives

AVERAGE RESPONSE RATE (within 24 hours)

20%

Help improve the percentage by Answering questions

LEADERBOARD POSITION

View all 16 tags

Collectives™ on Stack Overflow

NLP Collective

Pinned content

Can you answer these questions?

Recommended answers

See what's trending