sentencepiece
https://github.com/google/sentencepiece
C++
Unsupervised text tokenizer for Neural Network-based text generation.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
C++ not yet supported0 Subscribers
Add a CodeTriage badge to sentencepiece
Help out
- Issues
- add faster BPE learning method: O(N) -> O(log N) per merge. 10x - 20x speedup or more (on large settings)
- Pass explicit key lengths in BuildTrie and add bounds checks to piece accessors
- Bump the build-time-deps group across 1 directory with 3 updates
- OSS-Fuzz issue 497277591
- OSS-Fuzz issue 497277654
- OSS-Fuzz issue 497277546
- Bump cryptography from 46.0.5 to 46.0.6 in /.github/workflows/requirements
- Bump requests from 2.32.4 to 2.33.0 in /.github/workflows/requirements
- SentencePieceProcessor::SampleEncodeAndScore(...) segfaults for a single quote with wor=true and include_best=true
- Add riscv64 (linux_riscv64) wheel to PyPI releases
- Docs
- C++ not yet supported