Introduction
JTokkit is a fast and efficient tokenizer designed for use in natural language processing tasks using the OpenAI models. It provides an easy-to-use interface for tokenizing input text, for example for counting required tokens in preparation of requests to the gpt-3.5-turbo
model. This library resulted out of the need to have similar capacities in the JVM ecosystem as the library tiktoken provides for Python.
Features
✅ Implements encoding and decoding via r50k_base
, p50k_base
, p50k_edit
and cl100k_base
✅ Easy-to-use API
✅ Easy extensibility for custom encoding algorithms
✅ Zero Dependencies
✅ Supports Java 8 and above
✅ Fast and efficient performance
Performance
JTokkit reaches 2-3 times the throughput of a comparable tokenizer. Take a look in the benchmarks for more details.
Installation
You can install JTokkit by adding the following dependency to your Maven project:
<dependency>
<groupId>com.knuddels</groupId>
<artifactId>jtokkit</artifactId>
<version>1.1.0</version>
</dependency>
Or alternatively using Gradle:
dependencies {
implementation 'com.knuddels:jtokkit:1.1.0'
}