The Wayback Machine - https://web.archive.org/web/20200521144507/https://github.com/uber/ludwig/issues/648
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Allow using custom languages/models for spaCy NLP #648

Open
ZeroAurora opened this issue Mar 6, 2020 · 1 comment
Open

Comments

@ZeroAurora
Copy link

@ZeroAurora ZeroAurora commented Mar 6, 2020

Is your feature request related to a problem? Please describe.
Other related issues: #408 #251
I trained a Chinese model for spaCy, linked it to [spacy's package folder]/data/zh (using spacy link) and want to use that for ludwig. However, when I tried to set the config for ludwig, I received an error, which tell me that there is no way to load the Chinese model.

ValueError: Key chinese_tokenizer not supported, available options: dict_keys(['characters', 'space', 'space_punct', 'underscore', 'comma', 'untokenized' (...) 'bert'])

Describe the use case
By allowing using custom languages for spacy, users using other language would be able to process their texts quicker and easier.

Describe the solution you'd like
Here's the current solution...

input_features:
  -
    name: input
    type: text
    preprocessing:
      word_tokenizer: english_tokenize

...which I think could be changed to this...

input_features:
  -
    name: input
    type: text
    preprocessing:
      word_tokenizer: spacy_tokenize
      spacy_model: zh #(or en, xx, etc.)

Describe alternatives you've considered
I've considered not to use spacy but to use a custom script to simply split sentences to words using some processors like "jieba". However, by using this method I would lose nearly all benefits from NLP.

Additional context
I think that's all :)
I don't know whether my advice could be accepted. But if it got solved I would be very thankful.
BTW since I'm not a native English speaker, there may be some mistakes. Please don't mind it :p

@w4nderlust
Copy link
Collaborator

@w4nderlust w4nderlust commented Mar 6, 2020

We are thinking about changing the way you define the tokenizer to be more flexible, and that would allow you to do what you are looking for.

In the mean time, if you are using the API, you can do the following:

from ludwig.utils.nlp_utils import language_module_registry
from ludwig.utils.strings_utils import tokenizer_registry
from ludwig.utils.strings_utils import BaseTokenizer

language_module_registry['zh'] = 'your_model_name'  # for example 'en_core_web_sm'

class ChineseTokenizer(BaseTokenizer):
    def __call__(self, text):
        return process_text(text, load_nlp_pipeline('zh'))

tokenizer_registry['chinese_tokenizer'] = ChineseTokenizer

After you do this, you can refer to chinese_tokenizer in your model configuration within the same script where you run the code above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.