Class: WordPieceTokenizer
text/WordpieceTokenizer.WordPieceTokenizer
Constructors
constructor
• new WordPieceTokenizer(config
)
Construct a tokenizer with a WordPieceTokenizer object.
Parameters
Name | Type | Description |
---|---|---|
config | WordPieceTokenizerConfig | a tokenizer configuration object that specify the vocabulary and special tokens, etc. |
Methods
decode
▸ decode(tokenIds
): string
Decode an array of tokenIds to a string using the vocabulary
Parameters
Name | Type | Description |
---|---|---|
tokenIds | number [] | an array of tokenIds derived from the output of model |
Returns
string
a string decoded from the output of the model
encode
▸ encode(text
): number
[]
Encode the raw input to a NLP model to an array of number, which is tensorizable.
Parameters
Name | Type | Description |
---|---|---|
text | string | The raw input of the model |
Returns
number
[]
An array of number, which can then be used to create a tensor as model input with the torch.tensor API
tokenize
▸ tokenize(text
): string
[]
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
Parameters
Name | Type | Description |
---|---|---|
text | string | the raw input of the model |
Returns
string
[]
an array of tokens in vocabulary representing the input text.