The Wayback Machine - https://web.archive.org/web/20201018095835/https://github.com/pytorch/text/pull/707
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: custom UNK token conversion error #707

Open
wants to merge 9 commits into
base: master
from

Conversation

@ohke
Copy link

@ohke ohke commented Mar 13, 2020

Bugfix: #618, #706

Newly, this changes adds unk_token argument to build_vocab method for set by Field.
Also, for backward compatibility, this PR leaves Vocab.UNK as default token.

@@ -34,7 +34,8 @@ class Vocab(object):
UNK = '<unk>'

def __init__(self, counter, max_size=None, min_freq=1, specials=['<unk>', '<pad>'],
vectors=None, unk_init=None, vectors_cache=None, specials_first=True):
vectors=None, unk_init=None, vectors_cache=None, specials_first=True,
unk_token=None):

This comment has been minimized.

@zhangguanheng66

zhangguanheng66 Mar 13, 2020
Collaborator

Not sure if we want to add extra argument to the API. Ideally we should put all the special tokens in specials. stoi should return '<unk>' id if the token is not found.

This comment has been minimized.

@bentrevett

bentrevett Mar 13, 2020
Contributor

How will we know which of the special tokens is the unk token without explicitly passing it as an argument?

This comment has been minimized.

@zhangguanheng66

zhangguanheng66 Mar 13, 2020
Collaborator

I'm not against the hard-code unk token and assign it to the vocab class. I just don't like the idea to make the API longer and longer.

This comment has been minimized.

@ohke

ohke Mar 14, 2020
Author

Thanks for your comments.

I don't want to increase the arguments to maintain if possible, but Field class accespts custom unk_token.
Field class adds unk_token to specials in build_vocab method, so there is no problem when accessing via Field object.

We can include arguments check if unk_token exists in specials , in Vocab initialize method.

This comment has been minimized.

@ohke

ohke Mar 16, 2020
Author

I added one argument validation and its test

ohke added 4 commits Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.