A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
-
Updated
Mar 19, 2025 - Python
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
Teaching transformer-based architectures
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
BPE tokenizer for LLMs in Pure Zig
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
Build a light-weight Llama from scratch, based on course Stanford CS336 2025.
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.
processing de LANguage NATural
C89, single header, nostdlib byte pair encoding algorythm
Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.
a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.
Byte Pair Encoding Implementation From Scratch (Rust)
in dev ...
A lightweight, educational implementation of Byte-Pair Encoding (BPE) tokenization with integrated Skip-Gram style embeddings, built from scratch in Python. Built for educational purposes and to explore how an LLM works using Shakespearian text for training.
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."