Skip to main content

Extending JTokkit

You may want to extend JTokkit and re-use its registry to support additional byte pair encodings or even completely custom encodings. To do so you have two options.

Implementing the Encoding interface

Implement the Encoding interface and register it with the EncodingRegistry. Make sure that the name you return from Encoding#getName is unique and that your implementation is thread-safe. It will be cached and reused by the EncodingRegistry.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding customEncoding = new CustomEncoding();
registry.register(customEncoding);

// Get the encoding from the registry
Encoding encodingFromRegistry = registry.getEncoding("custom-name");

Adding a new byte pair encoding

You can add a new byte pair encoding by specifying the necessary parameters.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
"custom-name",
Pattern.compile("some custom pattern"),
encodingMap,
specialTokenEncodingMap
);
registry.registerGptBytePairEncoding(params);

// Get the encoding from the registry
Encoding encodingFromRegistry = registry.getEncoding("custom-name");

Reference EncodingFactory for more details on the parameters and examples on how those parameters are used for the pre-defined encodings.