Curious to know how many tokens the models have seen. The repo mentions the dataset, but not the totals.
This checkpoint is trained on the stricter permissive subset of the deduplicated version of the Stack dataset (v1.1). Supported languages (and frameworks) are as follows: c, c++, c-sharp, dart, go, java, javascript, kotlin, lua, php, python, ruby, rust, scala, shell, sql, swift, typescript, vue.
Thanks!
Curious to know how many tokens the models have seen. The repo mentions the dataset, but not the totals.
Thanks!