textattack.models.tokenizers package
Tokenizers for Model Wrapper
Glove Tokenizer
- class textattack.models.tokenizers.glove_tokenizer.GloveTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, max_length=256)[source]
Bases:
WordLevelTokenizer
A word-level tokenizer with GloVe 200-dimensional vectors.
Lowercased, since GloVe vectors are lowercased.
- encode(text)[source]
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
- Parameters:
sequence –
InputSequence: The sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:
If is_pretokenized=False: InputSequence is expected to be str
- If is_pretokenized=True: InputSequence is expected to be
Union[List[str], Tuple[str]]
is_pretokenized – bool: Whether the input is already pre-tokenized.
add_special_tokens – bool: Whether to add the special tokens while encoding.
- Returns:
An Encoding
- class textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, unk_token='[UNK]', sep_token='[SEP]', cls_token='[CLS]', pad_token='[PAD]', lowercase: bool = False, unicode_normalizer=None)[source]
Bases:
BaseTokenizer
WordLevelTokenizer.
Represents a simple word level tokenization using the internals of BERT’s tokenizer.
Based off the tokenizers BertWordPieceTokenizer (https://github.com/huggingface/tokenizers/blob/704cf3fdd2f607ead58a561b892b510b49c301db/bindings/python/tokenizers/implementations/bert_wordpiece.py).
T5 Tokenizer
- class textattack.models.tokenizers.t5_tokenizer.T5Tokenizer(mode='english_to_german', max_length=64)[source]
Bases:
object
Uses the T5 tokenizer to convert an input for processing.
For more information, please see the T5 paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. Appendix D contains information about the various tasks supported by T5.
Supports the following modes:
summarization: summarize English text
english_to_german: translate English to German
english_to_french: translate English to French
english_to_romanian: translate English to Romanian