The Wayback Machine - https://web.archive.org/web/20200918200828/https://github.com/google/sentencepiece/issues/385
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Python Executables for training, encoding and decoding #385

Open
bricksdont opened this issue Aug 26, 2019 · 5 comments
Open

Comments

@bricksdont
Copy link

@bricksdont bricksdont commented Aug 26, 2019

It would be helpful if installing the Python package via pip would also install Python binaries. What I mean is that the following should be possible:

pip install sentencepiece
spm_train --input data.txt --model-prefix sp

This would make the package much more usable out of the box.

Compiling the source code leads to such binaries, but has many dependencies that are trickier to install.

Thanks for considering!

@bricksdont bricksdont changed the title Python Executables for training, encoding and decoding Feature Request: Python Executables for training, encoding and decoding Aug 26, 2019
@taku910
Copy link
Collaborator

@taku910 taku910 commented Aug 27, 2019

Isn't it enough to use Python API? They support all the features of spm_train and spm_encode.

>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
@bricksdont
Copy link
Author

@bricksdont bricksdont commented Aug 27, 2019

That is also useful, I am just saying that a Python executable would make the library even more accessible, because there would be no need to check carefully which functions in

https://github.com/google/sentencepiece/blob/master/python/sentencepiece.py

are equivalent to a call to, for instance, spm_encode

@taku910
Copy link
Collaborator

@taku910 taku910 commented Aug 28, 2019

spm_{train,endode} are pure C++ program/binary so we should not release them as Python packages. If we really want to have them, we would like to re-implement these tools in python and sentencepiece python wrapper.

@taku910 taku910 closed this Aug 28, 2019
@taku910 taku910 reopened this Aug 28, 2019
@bricksdont
Copy link
Author

@bricksdont bricksdont commented Aug 28, 2019

Thank you for your answer - and feel free to close the issue if this is not feasible or reasonable.

@taku910
Copy link
Collaborator

@taku910 taku910 commented Sep 1, 2019

If we decide to release these command line tools, we want to introduce py_spm_train or py_spm_encode, which are re-implementation of C++ library and installed under python site-package directory.

We don't have enough bandwidht for working it. Any contributions are appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.