New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing tokenizer test files [still need 2 first time contributors] #16627
Comments
|
Hi, I would like to add tests for |
|
@SaulLu I would like to add tests for Flaubert |
|
Hey I would like to contribute for |
|
Thank you all for offering your help! @Rajathbharadwaj ,sure! what do you need help with? Do you need more details on any of the steps listed in the main post? |
|
Hi, first time contributor here-could I add tests for |
|
@faiazrahman , thank you so much for working on this! Regarding your issue, if you're in the You should have a newly created folder If that's the case, you'll need after to do something like: Keep me posted |
|
Thanks so much @SaulLu turns out it was due to not recognizing my installed cookiecutter so i sorted it out there. |
|
Hi @anmolsjoshi, @tgadeliya, @Rajathbharadwaj and @farahdian, Just a quick message to see how things are going for you and if you have any problems. If you do, please share them! |
|
Thanks @SaulLu ! I've been exploring the tokenization test files in the repo just trying to figure out which ones would be a good basis for writing a tokenization test for splinter... if you have any guidance on this it would be super helpful! |
|
Hey @SaulLu my apologies, been a bit busy. I'll get started ASAP, however, I still didn't understand where exactly I should run the Help on this would be helpful |
|
Hi @farahdian , Thank you very much for the update! To know where you stand, have you done step 3)? Is it for step 4) that you are looking for a similar tokenizer? |
|
Hi @Rajathbharadwaj , Thank you for the update too!
You can run the When you run the command, it will create a new folder at the location you are. In this folder you will find a base for the python test file that you need to move inside the Below is an example of the sequence of bash commands I would personally use: (base) username@hostname:~$ cd ~/repos
(base) username@hostname:~/repos$ git clone git@github.com:huggingface/transformers.git
[Install my development setup]
(transformers-dev) username@hostname:~/repos$ cookiecutter transformers/templates/adding_a_missing_tokenization_test/
[Answer the questionnaire]
(transformers-dev) username@hostname:~/repos$ mv cookiecutter-template-Electra/test_tokenization_electra.py transformers/tests/Electra
(transformers-dev) username@hostname:~/repos$ rm -r cookiecutter-template-Electra/Hope that'll help you |
|
Appreciate your patience @SaulLu ! Yup I've done step 3 and generated a test tokenization file with cookiecutter. Now onto working on the setUp method |
|
@farahdian , this is indeed a very good question: finding the closest tokenizer to draw inspiration from and identifying the important difference with that tokenizer is the most interesting part. For that there are several ways to start:
For the model you're in charge @farahdian:
Given these mentions, it seems that Splinter's tokenizer is very similar to Bert's one. It would be interesting to confirm this impression and to understand all the differences between SplinterTokenizer and BertTokenizer so that it is well reflected in the test |
Thank you so much @SaulLu
So did I select correctly? Or should I set it to False? Apologies for asking so many questions However now I've started adding tests for Thanks for helping once again! |
|
Hi @SaulLu, |
|
@Rajathbharadwaj , I'm happy to help! Especially as your questions will surely be useful for other people
Some
and you can see that the backend is instantiated here: transformers/src/transformers/models/t5/tokenization_t5.py Lines 151 to 152 in 3dd57b1
On the contrary, BertTokenizer for example does not use a sentencepiece backend. I hope this helped you! |
|
Hi @tgadeliya , Thanks for the update!
In your case, I wouldn't be surprised if Longformer uses the same tokenizer as RoBERTa. In this case, it seems legitimate to use the same toy tokenizer. Maybe the only check you can do to confirm this hypothesis is comparing the vocabularies of the 'main" checkpoints of both models: !wget https://huggingface.co/allenai/longformer-base-4096/raw/main/merges.txt
!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/vocab.json
!wget https://huggingface.co/roberta-base/raw/main/merges.txt
!wget https://huggingface.co/roberta-base/raw/main/vocab.json
!diff merges.txt merges.txt.1
!diff vocab.json vocab.json.1Turn out the result confirms it! |
|
Hi, I'm happy to take |
|
I'd like to work on ConvBert. |
@SaulLu I'm having trouble identifying ConvBert's 'reference' checkpoint files on the hub. Would you kindly provide more guidance on this? |
|
Hi @elusenji , In the
In particular YituTech/conv-bert-base is a reference checkpoint for ConvBert. Is this what you were having trouble with? |
|
Yes, this helps! |
|
Hi @SaulLu, I am happy to write tests for |



Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.
Tokenizers concerned
not yet claimed
LED
RetriBert
claimed
with an ongoing PR
with an accepted PR
none
How to contribute?
Claim a tokenizer
a. Choose a tokenizer from the list of "not yet claimed" tokenizers
b. Check that no one in the messages for this issue has indicated that they care about this tokenizer
c. Put a message in the issue that you are handling this tokenizer
Create a local development setup (if you have not already done it)
I refer you to section "start-contributing-pull-requests" of the Contributing guidelines where everything is explained. Don't be afraid with step 5. For this contribution you will only need to test locally the tests you add.
Follow the instructions on the readme inside the
templates/adding_a_missing_tokenization_testfolder to generate the template with cookie cutter for the new test file you will be adding. Don't forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in thetestsfolder. Some details about questionnaire - assuming that the name of the lowcase model isbrand_new_bert:tokenization_brand_new_bert.pyfile in the foldersrc/transformers/models/brand_new_berttokenization_brand_new_bert_fast.pyfile the foldersrc/transformers/models/brand_new_bert.tokenization_brand_new_bert.pyfile uses sentencepiece. If this tokenizer don't have a ``tokenization_brand_new_bert.py` file set False.Complete the
setUpmethod in the generated test file, you can take inspiration for how it is done for the other tokenizers.Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you don't know what to do we can discuss this in the PR (step 7.).
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.
Tips
Do not hesitate to read the questions / answers in this issue馃摪
The text was updated successfully, but these errors were encountered: