textattack.models.tokenizers package

Tokenizers for Model Wrapper

Glove Tokenizer

class textattack.models.tokenizers.glove_tokenizer.GloveTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, max_length=256)[source]

Bases: textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer

A word-level tokenizer with GloVe 200-dimensional vectors.

Lowercased, since GloVe vectors are lowercased.

batch_encode(input_text_list)[source]

The batch equivalent of encode.

convert_ids_to_tokens(ids)[source]
encode(text)[source]

Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

Parameters
  • sequence

    InputSequence: The sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:

    • If is_pretokenized=False: InputSequence is expected to be str

    • If is_pretokenized=True: InputSequence is expected to be

      Union[List[str], Tuple[str]]

  • is_pretokenized – bool: Whether the input is already pre-tokenized.

  • add_special_tokens – bool: Whether to add the special tokens while encoding.

Returns

An Encoding

class textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, unk_token='[UNK]', sep_token='[SEP]', cls_token='[CLS]', pad_token='[PAD]', lowercase: bool = False, unicode_normalizer=None)[source]

Bases: tokenizers.implementations.base_tokenizer.BaseTokenizer

WordLevelTokenizer.

Represents a simple word level tokenization using the internals of BERT’s tokenizer.

Based off the tokenizers BertWordPieceTokenizer (https://github.com/huggingface/tokenizers/blob/704cf3fdd2f607ead58a561b892b510b49c301db/bindings/python/tokenizers/implementations/bert_wordpiece.py).

T5 Tokenizer

class textattack.models.tokenizers.t5_tokenizer.T5Tokenizer(mode='english_to_german', max_length=64)[source]

Bases: object

Uses the T5 tokenizer to convert an input for processing.

For more information, please see the T5 paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. Appendix D contains information about the various tasks supported by T5.

Supports the following modes:

  • summarization: summarize English text

  • english_to_german: translate English to German

  • english_to_french: translate English to French

  • english_to_romanian: translate English to Romanian

decode(ids)[source]

Converts IDs (typically generated by the model) back to a string.