textattack.models.tokenizers package

Tokenizers for Model Wrapper

Glove Tokenizer

class textattack.models.tokenizers.glove_tokenizer.GloveTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, max_length=256)[source]

Bases: WordLevelTokenizer

A word-level tokenizer with GloVe 200-dimensional vectors.

Lowercased, since GloVe vectors are lowercased.


The batch equivalent of encode.


Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

  • sequence

    InputSequence: The sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:

    • If is_pretokenized=False: InputSequence is expected to be str

    • If is_pretokenized=True: InputSequence is expected to be

      Union[List[str], Tuple[str]]

  • is_pretokenized – bool: Whether the input is already pre-tokenized.

  • add_special_tokens – bool: Whether to add the special tokens while encoding.


An Encoding

class textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, unk_token='[UNK]', sep_token='[SEP]', cls_token='[CLS]', pad_token='[PAD]', lowercase: bool = False, unicode_normalizer=None)[source]

Bases: BaseTokenizer


Represents a simple word level tokenization using the internals of BERT’s tokenizer.

Based off the tokenizers BertWordPieceTokenizer (https://github.com/huggingface/tokenizers/blob/704cf3fdd2f607ead58a561b892b510b49c301db/bindings/python/tokenizers/implementations/bert_wordpiece.py).

T5 Tokenizer

class textattack.models.tokenizers.t5_tokenizer.T5Tokenizer(mode='english_to_german', max_length=64)[source]

Bases: object

Uses the T5 tokenizer to convert an input for processing.

For more information, please see the T5 paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. Appendix D contains information about the various tasks supported by T5.

Supports the following modes:

  • summarization: summarize English text

  • english_to_german: translate English to German

  • english_to_french: translate English to French

  • english_to_romanian: translate English to Romanian


Converts IDs (typically generated by the model) back to a string.