sentencepiece
https://github.com/google/sentencepiece
C++
Unsupervised text tokenizer for Neural Network-based text generation.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
C++ not yet supported0 Subscribers
Add a CodeTriage badge to sentencepiece
Help out
- Issues
- reject out-of-bounds root unit in DoubleArray::validate
- [Security] Out-of-bounds read when loading a malicious tokenizer model via the precompiled_charsmap trie
- Migrate SentencePiece Python wrapper from SWIG to pybind11
- Migrate from SWIG to pybind11
- Derepecates unused methods
- OSS-Fuzz issue 518026690
- add --split_digits mode
- riscv64 distribution?
- Use upb instead of protobuf
- Deprecate differential privacy features.
- Docs
- C++ not yet supported