bpe-tokenizer

Here are 35 public repositories matching this topic...

gweidart / rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

python rust openai pypi-package bpe byte-pair-encoding huggingface tokenizers llm tiktoken bpe-tokenizer byte-pair-tokenizer

Updated Mar 19, 2025
Python

RahulDey12 / tiktoken-php

Sponsor

Star

A PHP implementation of OpenAI's BPE tokenizer tiktoken.

tiktoken bpe-tokenizer tiktoken-php special-tokens

Updated Jan 25, 2025
PHP

jmaczan / bpe-tokenizer

Star

Byte-Pair Encoding tokenizer for training large language models on huge datasets

python machine-learning deep-learning tokenizer chunking from-scratch bpe byte-pair-encoding large-language-models llm bpe-tokenizer

Updated Jun 4, 2024
Python

mrinalxdev / bpe-cpp

Star

implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8

machine-learning bpe-tokenizer

Updated Aug 21, 2025
C++

xmarva / transformer-architectures

Star

Teaching transformer-based architectures

nlp transformer attention-mechanism weights-and-biases positional-encoding bpe-tokenizer

Updated May 4, 2025
Jupyter Notebook

Lizhecheng02 / Kaggle-Automated_Essay_Scoring_2.0

Star

(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.

regression transformer classification vectorizer awp pooling huggingface treemodel kfold llm deberta-v3-large bpe-tokenizer

Updated Jul 3, 2024
Python

jaco-bro / tokenizer

Star

BPE tokenizer for LLMs in Pure Zig

zig regex tokenizer pcre2 zig-package bpe-tokenizer

Updated May 16, 2025
Zig

neluca / tinybpe

Star

🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.

tokenizer cpython-extensions bpe llm bpe-tokenizer

Updated Apr 20, 2025
C

willxxy / superbpe

Star

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

rust rust-lang bpe bytepairencoding bpe-tokenizer

Updated Apr 14, 2025
Rust

jerrypan617 / LightLlama

Star

Build a light-weight Llama from scratch, based on course Stanford CS336 2025.

llama llm bpe-tokenizer

Updated Aug 1, 2025
Python

estnafinema0 / russian-jokes-generator

Star

Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.

nlp pytorch alibi transformer-models rotary-position-embedding grouped-query-attention swiglu bpe-tokenizer

Updated Mar 10, 2025
Jupyter Notebook

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

scala logback word2vec scalatest hadoop-mapreduce deeplearning4j nd4j apache-hadoop llm bpe-tokenizer jtokkit

Updated Aug 26, 2025
Scala

hyouteki / lanat

Star

processing de LANguage NATural

natural-language-processing named-entity-recognition bpe-tokenizer bigram-language-model bio-encoding

Updated Apr 3, 2024
Jupyter Notebook

nickscha / bpe

Star

C89, single header, nostdlib byte pair encoding algorythm

ai neural-network c89 single-header nostdlib bpe-tokenizer

Updated Aug 26, 2025
C

gnatykdm / tokenizer-chopper

Star

Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.

machine-learning ai cpp tokenizer machine-learning-algorithms token bpe llm llms llm-inference bpe-tokenizer detoke detokeni

Updated Mar 8, 2025
C++

Demon-Sheriff / tiny-BPE

Star

a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.

python multiprocessing tokenization bpe-tokenizer

Updated Aug 28, 2025
Jupyter Notebook

jmaczan / bpe.c

Star

High performance Byte-Pair Encoding tokenizer for large language models

c tokenizer clang bpe llm bpe-tokenizer

Updated Jun 23, 2024
C

happybear-21 / bpe-rs

Star

Byte Pair Encoding Implementation From Scratch (Rust)

rust bpe byte-pair-encoding bpe-tokenizer byte-pair-tokenizer

Updated Aug 3, 2025
Rust

Amir-Hofo / GPT2

Star

in dev ...

transformers pytorch english-nlp gpt2 huggingface decoder-only bpe-tokenizer tiny-stories

Updated Aug 12, 2025
Python

devinstanley / simple-LLM

Star

A lightweight, educational implementation of Byte-Pair Encoding (BPE) tokenization with integrated Skip-Gram style embeddings, built from scratch in Python. Built for educational purposes and to explore how an LLM works using Shakespearian text for training.

tokenizer skip-gram bpe-tokenizer

Updated Aug 20, 2025
Python

Improve this page

Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpe-tokenizer

Here are 35 public repositories matching this topic...

gweidart / rs-bpe

RahulDey12 / tiktoken-php

jmaczan / bpe-tokenizer

mrinalxdev / bpe-cpp

xmarva / transformer-architectures

Lizhecheng02 / Kaggle-Automated_Essay_Scoring_2.0

jaco-bro / tokenizer

neluca / tinybpe

willxxy / superbpe

jerrypan617 / LightLlama

estnafinema0 / russian-jokes-generator

taabishhh / LLM_Preprocessing

hyouteki / lanat

nickscha / bpe

gnatykdm / tokenizer-chopper

Demon-Sheriff / tiny-BPE

jmaczan / bpe.c

happybear-21 / bpe-rs

Amir-Hofo / GPT2

devinstanley / simple-LLM

Improve this page

Add this topic to your repo