textattack.shared package
Shared TextAttack Functions
This package includes functions shared across packages.
- textattack.shared.utils package
LazyLoader
load_module_from_file()
download_from_s3()
download_from_url()
http_get()
path_in_cache()
s3_url()
set_cache_dir()
unzip_file()
get_textattack_model_num_labels()
hashable()
html_style_from_dict()
html_table_from_rows()
load_textattack_model_from_path()
set_seed()
sigmoid()
ANSI_ESCAPE_CODES
ANSI_ESCAPE_CODES.BOLD
ANSI_ESCAPE_CODES.BROWN
ANSI_ESCAPE_CODES.CYAN
ANSI_ESCAPE_CODES.FAIL
ANSI_ESCAPE_CODES.GRAY
ANSI_ESCAPE_CODES.HEADER
ANSI_ESCAPE_CODES.OKBLUE
ANSI_ESCAPE_CODES.OKGREEN
ANSI_ESCAPE_CODES.ORANGE
ANSI_ESCAPE_CODES.PINK
ANSI_ESCAPE_CODES.PURPLE
ANSI_ESCAPE_CODES.STOP
ANSI_ESCAPE_CODES.UNDERLINE
ANSI_ESCAPE_CODES.WARNING
ANSI_ESCAPE_CODES.YELLOW
ReprMixin
TextAttackFlairTokenizer
add_indent()
check_if_punctuations()
check_if_subword()
color_from_label()
color_from_output()
color_text()
default_class_repr()
flair_tag()
has_letter()
is_one_word()
process_label_name()
strip_BPE_artifacts()
words_from_text()
zip_flair_result()
zip_stanza_result()
batch_model_predict()
Attacked Text Class
A helper class that represents a string that can be attacked.
- class textattack.shared.attacked_text.AttackedText(text_input, attack_attrs=None)[source]
Bases:
object
A helper class that represents a string that can be attacked.
Models that take multiple sentences as input separate them by
SPLIT_TOKEN
. Attacks “see” the entire input, joined into one string, without the split token.AttackedText
instances that were perturbed from otherAttackedText
objects contain a pointer to the previous text (attack_attrs["previous_attacked_text"]
), so that the full chain of perturbations might be reconstructed by using this key to form a linked list.- Parameters:
text (string) – The string that this AttackedText represents
attack_attrs (dict) – Dictionary of various attributes stored during the course of an attack.
- align_with_model_tokens(model_wrapper: ModelWrapper) → Dict[int, Iterable[int]][source]
Align AttackedText’s words with target model’s tokenization scheme (e.g. word, character, subword). Specifically, we map each word to list of indices of tokens that compose the word (e.g. embedding –> [“em”, “##bed”, “##ding”])
- Parameters:
model_wrapper (textattack.models.wrappers.ModelWrapper) – ModelWrapper of the target model
- Returns:
Dictionary that maps i-th word to list of indices.
- Return type:
word2token_mapping (dict[int, list[int]])
- all_words_diff(other_attacked_text: AttackedText) → Set[int][source]
Returns the set of indices for which this and other_attacked_text have different words.
- convert_from_original_idxs(idxs: Iterable[int]) → List[int][source]
Takes indices of words from original string and converts them to indices of the same words in the current string.
Uses information from
self.attack_attrs['original_index_map']
, which maps word indices from the original to perturbed text.
- delete_word_at_index(index: int) → AttackedText[source]
Returns a new AttackedText object where the word at
index
is removed.
- first_word_diff(other_attacked_text: AttackedText) → str | None[source]
Returns the first word in self.words that differs from other_attacked_text, or None if all words are the same.
Useful for word swap strategies.
- first_word_diff_index(other_attacked_text: AttackedText) → int | None[source]
Returns the index of the first word in self.words that differs from other_attacked_text.
Useful for word swap strategies.
- free_memory()[source]
Delete items that take up memory.
Can be called once the AttackedText is only needed to display.
- generate_new_attacked_text(new_words: Iterable[str]) → AttackedText[source]
Returns a new AttackedText object and replaces old list of words with a new list of words, but preserves the punctuation and spacing of the original message.
self.words
is a list of the words in the current text with punctuation removed. However, each “word” innew_words
could be an empty string, representing a word deletion, or a string with multiple space-separated words, representation an insertion of one or more words.
- insert_text_after_word_index(index: int, text: str) → AttackedText[source]
Inserts a string before word at index
index
and attempts to add appropriate spacing.
- insert_text_before_word_index(index: int, text: str) → AttackedText[source]
Inserts a string before word at index
index
and attempts to add appropriate spacing.
- ith_word_diff(other_attacked_text: AttackedText, i: int) → bool[source]
Returns bool representing whether the word at index i differs from other_attacked_text.
- ner_of_word_index(desired_word_idx: int, model_name='ner') → str[source]
Returns the ner tag of the word at index word_idx.
Uses FLAIR ner tagger.
Throws: ValueError, if not NER tag found for index.
- pos_of_word_index(desired_word_idx: int) → str[source]
Returns the part-of-speech of the word at index word_idx.
Uses FLAIR part-of-speech tagger.
Throws: ValueError, if no POS tag found for index.
- printable_text(key_color='bold', key_color_method=None) → str[source]
Represents full text input. Adds field descriptions.
- For example, entailment inputs look like:
` premise: ... hypothesis: ... `
- replace_word_at_index(index: int, new_word: str) → AttackedText[source]
Returns a new AttackedText object where the word at
index
is replaced with a new word.
- replace_words_at_indices(indices: Iterable[int], new_words: Iterable[str]) → AttackedText[source]
Returns a new AttackedText object where the word at
index
is replaced with a new word.
- text_until_word_index(i: int) → str[source]
Returns the text before the beginning of word at index
i
.
- text_window_around_index(index: int, window_size: int) → str[source]
The text window of
window_size
words centered aroundindex
.
- words_diff_num(other_attacked_text: AttackedText) → int[source]
The number of words different between two AttackedText objects.
- words_diff_ratio(x: AttackedText) → float[source]
Get the ratio of words difference between current text and x.
Note that current text and x must have same number of words.
- SPLIT_TOKEN = '<SPLIT>'
- property column_labels: List[str]
Returns the labels for this text’s columns.
For single-sequence inputs, this simply returns [‘text’].
- property newly_swapped_words: List[str]
- property num_words: int
Returns the number of words in the sequence.
- property text: str
Represents full text input.
Multiply inputs are joined with a line break.
- property tokenizer_input: Tuple[str]
The tuple of inputs to be passed to the tokenizer.
- property words: List[str]
- property words_per_input: List[List[str]]
Returns a list of lists of words corresponding to each input.
Misc Checkpoints
The AttackCheckpoint
class saves in-progress attacks and loads saved attacks from disk.
- class textattack.shared.checkpoint.AttackCheckpoint(attack_args, attack_log_manager, worklist, worklist_candidates, chkpt_time=None)[source]
Bases:
object
An object that stores necessary information for saving and loading checkpoints.
- Parameters:
attack_args (textattack.AttackArgs) – Arguments of the original attack
attack_log_manager (textattack.loggers.AttackLogManager) – Object for storing attack results
worklist (deque[int]) – List of examples that will be attacked. Examples are represented by their indicies within the dataset.
worklist_candidates (int) – List of other available examples we can attack. Used to get the next dataset element when attack_n=True.
chkpt_time (float) – epoch time representing when checkpoint was made
- property dataset_offset
Calculate offset into the dataset to start from.
- property datetime
- property num_failed_attacks
- property num_maximized_attacks
- property num_remaining_attacks
- property num_skipped_attacks
- property num_successful_attacks
- property results_count
Return number of attacks made so far.
Shared data fields
Lists of named entities: countries, nationalities, cities.
Lists of person names, first and last.
Misc Validators
Validators ensure compatibility between search methods, transformations, constraints, and goal functions.
- textattack.shared.validators.transformation_consists_of(transformation, transformation_classes)[source]
Determines if
transformation
is or consists only of instances of a class intransformation_classes
- textattack.shared.validators.transformation_consists_of_word_swaps(transformation)[source]
Determines if
transformation
is a word swap or consists of only word swaps.
- textattack.shared.validators.transformation_consists_of_word_swaps_and_deletions(transformation)[source]
Determines if
transformation
is a word swap or consists of only word swaps and deletions.
- textattack.shared.validators.validate_model_goal_function_compatibility(goal_function_class, model_class)[source]
Determines if
model_class
is task-compatible withgoal_function_class
.For example, a text-generative model like one intended for translation or summarization would not be compatible with a goal function that requires probability scores, like the UntargetedGoalFunction.
- textattack.shared.validators.validate_model_gradient_word_swap_compatibility(model)[source]
Determines if
model
is task-compatible withGradientBasedWordSwap
.We can only take the gradient with respect to an individual word if the model uses a word-based tokenizer.
Shared loads word embeddings and related distances
- class textattack.shared.word_embeddings.AbstractWordEmbedding[source]
Bases:
ReprMixin
,ABC
Abstract class representing word embedding used by TextAttack.
This class specifies all the methods that is required to be defined so that it can be used for transformation and constraints. For custom word embedding not supported by TextAttack, please create a class that inherits this class and implement the required methods. However, please first check if you can use WordEmbedding class, which has a lot of internal methods implemented.
- abstract get_cos_sim(a, b)[source]
Return cosine similarity between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
cosine similarity
- Return type:
distance (float)
- abstract get_mse_dist(a, b)[source]
Return MSE distance between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
MSE (L2) distance
- Return type:
distance (float)
- abstract index2word(index)[source]
Convert index to corresponding word :param index: :type index: int
- Returns:
word (str)
- abstract nearest_neighbours(index, topn)[source]
Get top-N nearest neighbours for a word :param index: ID of the word for which we’re finding the nearest neighbours :type index: int :param topn: Used for specifying N nearest neighbours :type topn: int
- Returns:
List of indices of the nearest neighbours
- Return type:
neighbours (list[int])
- class textattack.shared.word_embeddings.GensimWordEmbedding(keyed_vectors)[source]
Bases:
AbstractWordEmbedding
Wraps Gensim’s models.keyedvectors module (https://radimrehurek.com/gensim/models/keyedvectors.html)
- get_cos_sim(a, b)[source]
Return cosine similarity between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
cosine similarity
- Return type:
distance (float)
- get_mse_dist(a, b)[source]
Return MSE distance between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
MSE (L2) distance
- Return type:
distance (float)
- index2word(index)[source]
Convert index to corresponding word :param index: :type index: int
- Returns:
word (str)
- nearest_neighbours(index, topn, return_words=True)[source]
Get top-N nearest neighbours for a word :param index: ID of the word for which we’re finding the nearest neighbours :type index: int :param topn: Used for specifying N nearest neighbours :type topn: int
- Returns:
List of indices of the nearest neighbours
- Return type:
neighbours (list[int])
- class textattack.shared.word_embeddings.WordEmbedding(embedding_matrix, word2index, index2word, nn_matrix=None)[source]
Bases:
AbstractWordEmbedding
Object for loading word embeddings and related distances for TextAttack. This class has a lot of internal components (e.g. get consine similarity) implemented. Consider using this class if you can provide the appropriate input data to create the object.
- Parameters:
emedding_matrix (ndarray) – 2-D array of shape N x D where N represents size of vocab and D is the dimension of embedding vectors.
word2index (Union[dict|object]) – dictionary (or a similar object) that maps word to its index with in the embedding matrix.
index2word (Union[dict|object]) – dictionary (or a similar object) that maps index to its word.
nn_matrix (ndarray) – Matrix for precomputed nearest neighbours. It should be a 2-D integer array of shape N x K where N represents size of vocab and K is the top-K nearest neighbours. If this is set to None, we have to compute nearest neighbours on the fly for nearest_neighbours method, which is costly.
- static counterfitted_GLOVE_embedding()[source]
Returns a prebuilt counter-fitted GLOVE word embedding proposed by “Counter-fitting Word Vectors to Linguistic Constraints” (Mrkšić et al., 2016)
- get_cos_sim(a, b)[source]
Return cosine similarity between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
cosine similarity
- Return type:
distance (float)
- get_mse_dist(a, b)[source]
Return MSE distance between vector for word a and vector for word b.
Since this is a metric, get_mse_dist(a,b) and get_mse_dist(b,a) should return the same value. :param a: Either word or integer presenting the id of the word :type a: Union[str|int] :param b: Either word or integer presenting the id of the word :type b: Union[str|int]
- Returns:
MSE (L2) distance
- Return type:
distance (float)
- index2word(index)[source]
Convert index to corresponding word :param index: :type index: int
- Returns:
word (str)
- nearest_neighbours(index, topn)[source]
Get top-N nearest neighbours for a word :param index: ID of the word for which we’re finding the nearest neighbours :type index: int :param topn: Used for specifying N nearest neighbours :type topn: int
- Returns:
List of indices of the nearest neighbours
- Return type:
neighbours (list[int])
- word2index(word)[source]
Convert between word to id (i.e. index of word in embedding matrix) :param word: :type word: str
- Returns:
index (int)
- PATH = 'word_embeddings'