textattack.models.tokenizers package
Tokenizers for Model Wrapper
Glove Tokenizer
- class textattack.models.tokenizers.glove_tokenizer.GloveTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, max_length=256)[source]
Bases:
textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer
A word-level tokenizer with GloVe 200-dimensional vectors.
Lowercased, since GloVe vectors are lowercased.
- encode(text)[source]
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
- Parameters
sequence –
InputSequence: The sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:
If is_pretokenized=False: InputSequence is expected to be str
- If is_pretokenized=True: InputSequence is expected to be
Union[List[str], Tuple[str]]
is_pretokenized – bool: Whether the input is already pre-tokenized.
add_special_tokens – bool: Whether to add the special tokens while encoding.
- Returns
An Encoding
- class textattack.models.tokenizers.glove_tokenizer.WordLevelTokenizer(word_id_map={}, pad_token_id=None, unk_token_id=None, unk_token='[UNK]', sep_token='[SEP]', cls_token='[CLS]', pad_token='[PAD]', lowercase: bool = False, unicode_normalizer=None)[source]
Bases:
tokenizers.implementations.base_tokenizer.BaseTokenizer
WordLevelTokenizer.
Represents a simple word level tokenization using the internals of BERT’s tokenizer.
Based off the tokenizers BertWordPieceTokenizer (https://github.com/huggingface/tokenizers/blob/704cf3fdd2f607ead58a561b892b510b49c301db/bindings/python/tokenizers/implementations/bert_wordpiece.py).
T5 Tokenizer
- class textattack.models.tokenizers.t5_tokenizer.T5Tokenizer(mode='english_to_german', max_length=64)[source]
Bases:
object
Uses the T5 tokenizer to convert an input for processing.
For more information, please see the T5 paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. Appendix D contains information about the various tasks supported by T5.
Supports the following modes:
summarization: summarize English text
english_to_german: translate English to German
english_to_french: translate English to French
english_to_romanian: translate English to Romanian