textattack.transformations.word_merges package

Word Merge

Word Merge transformations act by taking two adjacent words, and “merges” them into one word by deleting one word and replacing another. For example, if we can merge the words “the” and “movie” in the text “I like the movie” and get following text: “I like film”. When we choose to “merge” word at index i, we merge it with the next word at i+1.

class textattack.transformations.word_merges.word_merge.WordMerge[source]

Bases: textattack.transformations.transformation.Transformation

An abstract class for word merges.

class textattack.transformations.word_merges.word_merge_masked_lm.WordMergeMaskedLM(masked_language_model='bert-base-uncased', tokenizer=None, max_length=512, window_size=inf, max_candidates=50, min_confidence=0.0005, batch_size=16)[source]

Bases: textattack.transformations.transformation.Transformation

Generate potential merge of adjacent using a masked language model.

Based off of:

CLARE: Contextualized Perturbation for Textual Adversarial Attack” (Li et al, 2020) https://arxiv.org/abs/2009.07502

  • masked_language_model (Union[str|transformers.AutoModelForMaskedLM]) – Either the name of pretrained masked language model from transformers model hub or the actual model. Default is bert-base-uncased.

  • tokenizer (obj) – The tokenizer of the corresponding model. If you passed in name of a pretrained model for masked_language_model, you can skip this argument as the correct tokenizer can be infered from the name. However, if you’re passing the actual model, you must provide a tokenizer.

  • max_length (int) – The max sequence length the masked language model is designed to work with. Default is 512.

  • window_size (int) – The number of surrounding words to include when making top word prediction. For each position to merge, we take window_size // 2 words to the left and window_size // 2 words to the right and pass the text within the window to the masked language model. Default is float(“inf”), which is equivalent to using the whole text.

  • max_candidates (int) – Maximum number of candidates to consider as replacements for each word. Replacements are ranked by model’s confidence.

  • min_confidence (float) – Minimum confidence threshold each replacement word must pass.

textattack.transformations.word_merges.word_merge_masked_lm.find_merge_index(token_tags, indices=None)[source]