sentencepiece
https://github.com/google/sentencepiece
C++
Unsupervised text tokenizer for Neural Network-based text generation.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
C++ not yet supported0 Subscribers
Add a CodeTriage badge to sentencepiece
Help out
- Issues
- Allow whitespace-only pieces
- High frequency token segmented into letter sequence when input is a tsv file
- Evaluate Profile-Guided Optimization (PGO)
- Tips for Termux installation
- A library that conflicts with the use of protobuf in vcpkg
- A recent EMNLP work to share about task-adaptive tokenization with variable segmentation
- Unexpected behavior with sampling of repeated character sequence.
- Duplicate tokens in BPE vocabulary
- Patches carried by conda-forge for packaging sentencepiece
- Tokens Chunking to respect Language Word Boundaries
- Docs
- C++ not yet supported