Version: 0.2.2

Class: WordPieceTokenizer

text/WordpieceTokenizer.WordPieceTokenizer

Constructors

constructor

• new WordPieceTokenizer(config)

Construct a tokenizer with a WordPieceTokenizer object.

Parameters

Name	Type	Description
`config`	WordPieceTokenizerConfig	a tokenizer configuration object that specify the vocabulary and special tokens, etc.

Methods

decode

▸ decode(tokenIds): string

Decode an array of tokenIds to a string using the vocabulary

Parameters

Name	Type	Description
`tokenIds`	`number`[]	an array of tokenIds derived from the output of model

Returns

string

a string decoded from the output of the model

encode

▸ encode(text): number[]

Encode the raw input to a NLP model to an array of number, which is tensorizable.

Parameters

Name	Type	Description
`text`	`string`	The raw input of the model

Returns

number[]

An array of number, which can then be used to create a tensor as model input with the torch.tensor API

tokenize

▸ tokenize(text): string[]

Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

Parameters

Name	Type	Description
`text`	`string`	the raw input of the model

Returns

string[]

an array of tokens in vocabulary representing the input text.

Constructors​

constructor​

Parameters​

Methods​

decode​

Parameters​

Returns​

encode​

Parameters​

Returns​

tokenize​

Parameters​

Returns​

Constructors

constructor

Parameters

Methods

decode

Parameters

Returns

encode

Parameters

Returns

tokenize

Parameters

Returns