TokenEstimator

TokenEstimator is a Rails gem that provides functionality to count tokens in various file formats and text inputs using different tokenizers.

Installation

Add this line to your application's Gemfile:

gem "token_counter"

And then execute:

bundle install

Methods

`count_tokens_from_text`

Count tokens from a given text.

    require "token_estimator"

    tokenizer_name = "gpt2"
    estimator = TokenEstimator::Estimator.new(tokenizer_name)

    text = "Your sample text here."
    token_estimation = estimator.count_tokens_from_text(text)

    puts "Token estimation: #{token_estimation}"

`count_tokens_from_file`

Count tokens from a file. The file type is determined by the file extension.

    require "token_estimator"

    file_path = "spec/fixtures/files/lorem.pdf"
    tokenizer_name = "gpt2"
    estimator = TokenEstimator::Estimator.new(tokenizer_name)

    token_estimation = estimator.count_tokens_from_file(file_path)

    puts "Token estimation: #{token_estimation}"

`count_tokens_from_excel_file`

Counts tokens from an Excel (.xlsx) file.

`count_tokens_from_csv_file`

Counts tokens from a CSV file.

`count_tokens_from_pdf_file`

Counts tokens from a PDF file.

`count_tokens_from_txt_file`

Counts tokens from a plain text (.txt) file.

`count_tokens_from_markdown_file`

Counts tokens from a Markdown (.md) file.

`count_tokens_from_json_file`

Counts tokens from a JSON file.

`count_tokens_from_html_file`

Counts tokens from an HTML file.

`count_tokens_from_json`

Counts tokens from a JSON object.

`count_tokens_from_html`

Counts tokens from an HTML string.

`TokenEstimator::Estimator::SUPPORTED_FILE_TYPES`

Return the supported file types.

Roadmap

Here is a checklist of the formats we currently support for token counting and those we plan to support in the future:

PDF
Markdown (.md)
CSV
Excel (XLSX)
JSON
Plain Text
HTML
DOCX (Word Documents)
XML
RTF (Rich Text Format)
PNG
JPG

Error Handling

If you try to count tokens from an unsupported file type, the gem will raise an UnsupportedFileTypeError

begin
  token_count = estimator.count_tokens_from_file("path/to/your/file.unsupported")
rescue TokenEstimator::UnsupportedFileTypeError => e
  puts e.message
end

Contributing

Contribution directions go here. You can fork the repository, create a new branch, and submit a pull request for review. Please make sure to write tests for your contributions and follow the coding standards set in the project.

License

The gem is available as open source under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
bin		bin
lib		lib
sig		sig
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.standard.yml		.standard.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
token_estimator-0.1.0.gem		token_estimator-0.1.0.gem
token_estimator-0.1.1.gem		token_estimator-0.1.1.gem
token_estimator-0.1.2.gem		token_estimator-0.1.2.gem
token_estimator.gemspec		token_estimator.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenEstimator

Installation

Methods

`count_tokens_from_text`

`count_tokens_from_file`

`count_tokens_from_excel_file`

`count_tokens_from_csv_file`

`count_tokens_from_pdf_file`

`count_tokens_from_txt_file`

`count_tokens_from_markdown_file`

`count_tokens_from_json_file`

`count_tokens_from_html_file`

`count_tokens_from_json`

`count_tokens_from_html`

`TokenEstimator::Estimator::SUPPORTED_FILE_TYPES`

Roadmap

Error Handling

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

aemabit/token_estimator

Folders and files

Latest commit

History

Repository files navigation

TokenEstimator

Installation

Methods

count_tokens_from_text

count_tokens_from_file

count_tokens_from_excel_file

count_tokens_from_csv_file

count_tokens_from_pdf_file

count_tokens_from_txt_file

count_tokens_from_markdown_file

count_tokens_from_json_file

count_tokens_from_html_file

count_tokens_from_json

count_tokens_from_html

TokenEstimator::Estimator::SUPPORTED_FILE_TYPES

Roadmap

Error Handling

Contributing

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`count_tokens_from_text`

`count_tokens_from_file`

`count_tokens_from_excel_file`

`count_tokens_from_csv_file`

`count_tokens_from_pdf_file`

`count_tokens_from_txt_file`

`count_tokens_from_markdown_file`

`count_tokens_from_json_file`

`count_tokens_from_html_file`

`count_tokens_from_json`

`count_tokens_from_html`

`TokenEstimator::Estimator::SUPPORTED_FILE_TYPES`

Packages