TextAttack Augmentation

Open In Colab

View Source on GitHub

Please remember to run the following in your notebook enviroment before running the tutorial codes:

pip3 install textattack[tensorflow]

Augmenting a dataset using TextAttack requries only a few lines of code when it is done right. The Augmenter class is created for this purpose to generate augmentations of a string or a list of strings. Augmentation could be done in either python script or command line.

Creating an Augmenter

The Augmenter class is essensial for performing data augmentation using TextAttack. It takes in four paramerters in the following order:

  1. transformation: all transformations implemented by TextAttack can be used to create an Augmenter. Note here that if we want to apply multiple transformations in the same time, they first need to be incooporated into a CompositeTransformation class.

  2. constraints: constraints determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations. The default augmenter does not have any constraints but contraints can be supplied as a list to the Augmenter.

  3. pct_words_to_swap: percentage of words to swap per augmented example. The default is set to 0.1 (10%).

  4. transformations_per_example maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)

An example of creating one’s own augmenter is shown below. In this case, we are creating an augmenter with RandomCharacterDeletion and WordSwapQWERTY transformations, RepeatModification and StopWordModification constraints. A maximum of 50% of the words could be purturbed, and 10 augmentations will be generated from each input sentence.

[ ]:
# import transformations, contraints, and the Augmenter
from textattack.transformations import WordSwapRandomCharacterDeletion
from textattack.transformations import WordSwapQWERTY
from textattack.transformations import CompositeTransformation

from textattack.constraints.pre_transformation import RepeatModification
from textattack.constraints.pre_transformation import StopwordModification

from textattack.augmentation import Augmenter
[ ]:
# Set up transformation using CompositeTransformation()
transformation = CompositeTransformation(
    [WordSwapRandomCharacterDeletion(), WordSwapQWERTY()]
)
# Set up constraints
constraints = [RepeatModification(), StopwordModification()]
# Create augmenter with specified parameters
augmenter = Augmenter(
    transformation=transformation,
    constraints=constraints,
    pct_words_to_swap=0.5,
    transformations_per_example=10,
)
s = "What I cannot create, I do not understand."
# Augment!
augmenter.augment(s)
['Ahat I camnot reate, I do not unerstand.',
 'Ahat I cwnnot crewte, I do not undefstand.',
 'Wat I camnot vreate, I do not undefstand.',
 'Wha I annot crate, I do not unerstand.',
 'Whaf I canno creatr, I do not ynderstand.',
 'Wtat I cannor dreate, I do not understwnd.',
 'Wuat I canno ceate, I do not unferstand.',
 'hat I cnnot ceate, I do not undersand.',
 'hat I cnnot cfeate, I do not undfrstand.',
 'hat I cwnnot crfate, I do not ujderstand.']

Pre-built Augmentation Recipes

In addition to creating our own augmenter, we could also use pre-built augmentation recipes to perturb datasets. These recipes are implemented from publishded papers and are very convenient to use. The list of available recipes can be found here.

In the following example, we will use the CheckListAugmenter to showcase our augmentation recipes. The CheckListAugmenter augments words by using the transformation methods provided by CheckList INV testing, which combines Name Replacement, Location Replacement, Number Alteration, and Contraction/Extension. The original paper can be found here: “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” (Ribeiro et al., 2020)

[ ]:
# import the CheckListAugmenter
from textattack.augmentation import CheckListAugmenter

# Alter default values if desired
augmenter = CheckListAugmenter(pct_words_to_swap=0.2, transformations_per_example=5)
s = "I'd love to go to Japan but the tickets are 500 dollars"
# Augment
augmenter.augment(s)
2021-06-09 16:58:41,816 --------------------------------------------------------------------------------
2021-06-09 16:58:41,817 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub
2021-06-09 16:58:41,817  - The most current version of the model is automatically downloaded from there.
2021-06-09 16:58:41,818  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)
2021-06-09 16:58:41,818 --------------------------------------------------------------------------------
2021-06-09 16:58:41,906 loading file /u/lab/jy2ma/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
['I would love to go to Chile but the tickets are 500 dollars',
 'I would love to go to Japan but the tickets are 500 dollars',
 'I would love to go to Japan but the tickets are 75 dollars',
 "I'd love to go to Oman but the tickets are 373 dollars",
 "I'd love to go to Vietnam but the tickets are 613 dollars"]

Note that the previous snippet of code is equivalent of running

textattack augment --recipe checklist --pct-words-to-swap .1 --transformations-per-example 5 --exclude-original --interactive

in command line.

Here’s another example of using WordNetAugmenter. In this scenario, we enable enable_advanced_metrics to acquire perplexity and USE score, and enable high_yield to generate more examples in the same running time:

[9]:
from textattack.augmentation import WordNetAugmenter

augmenter = WordNetAugmenter(
    pct_words_to_swap=0.4,
    transformations_per_example=5,
    high_yield=True,
    enable_advanced_metrics=True,
)
s = "I'd love to go to Japan but the tickets are 500 dollars"
results = augmenter.augment(s)
print(f"Average Original Perplexity Score: {results[1]['avg_original_perplexity']}\n")
print(f"Average Augment Perplexity Score: {results[1]['avg_attack_perplexity']}\n")
print(f"Average Augment USE Score: {results[2]['avg_attack_use_score']}\n")
print(f"Augmentations:")
results[0]
Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors
Average Original Perplexity Score: 1.09

Average Augment Perplexity Score: 3.17

Average Augment USE Score: 0.72

Augmentations:
[9]:
["I'd bang to operate to Japan but the ticket are 500 buck",
 "I'd bang to plump to Nihon but the tickets are 500 clam",
 "I'd bed to operate to Japan but the ticket are 500 buck",
 "I'd bed to plump to Nihon but the tickets are 500 clam",
 "I'd beloved to operate to Japan but the ticket are 500 buck",
 "I'd beloved to plump to Nihon but the tickets are 500 clam",
 "I'd bonk to operate to Japan but the ticket are 500 buck",
 "I'd bonk to plump to Nihon but the tickets are 500 clam",
 "I'd bonk to travel to Japan but the tag are 500 buck",
 "I'd bonk to travel to Japan but the tag are 500 clam",
 "I'd bonk to travel to Japan but the tag are 500 dollar",
 "I'd bonk to travel to Japan but the tag are 500 dollars",
 "I'd bonk to travel to Japan but the tag are D dollars",
 "I'd bonk to travel to Japan but the tag are d dollars",
 "I'd bonk to travel to Nihon but the tag are 500 dollars",
 "I'd bonk to travel to Nippon but the tag are 500 dollars",
 "I'd bonk to travel to japan but the tag are 500 dollars",
 "I'd dear to operate to Japan but the ticket are 500 buck",
 "I'd dear to plump to Nihon but the tickets are 500 clam",
 "I'd dearest to operate to Japan but the ticket are 500 buck",
 "I'd dearest to plump to Nihon but the tickets are 500 clam",
 "I'd eff to operate to Japan but the ticket are 500 buck",
 "I'd eff to plump to Nihon but the tickets are 500 clam",
 "I'd enjoy to exit to Japan but the fine are 500 buck",
 "I'd enjoy to exit to Japan but the slate are 500 buck",
 "I'd enjoy to exit to Japan but the tag are 500 buck",
 "I'd enjoy to exit to Japan but the ticket are 500 buck",
 "I'd enjoy to exit to Japan but the tickets are 500 buck",
 "I'd enjoy to exit to Japan but the tickets are D buck",
 "I'd enjoy to exit to Japan but the tickets are d buck",
 "I'd enjoy to exit to Nihon but the tickets are 500 buck",
 "I'd enjoy to exit to Nippon but the tickets are 500 buck",
 "I'd enjoy to exit to japan but the tickets are 500 buck",
 "I'd enjoy to operate to Japan but the ticket are 500 buck",
 "I'd enjoy to plump to Nihon but the tickets are 500 clam",
 "I'd fuck to operate to Japan but the ticket are 500 buck",
 "I'd fuck to plump to Nihon but the tickets are 500 clam",
 "I'd honey to operate to Japan but the ticket are 500 buck",
 "I'd honey to plump to Nihon but the tickets are 500 clam",
 "I'd hump to operate to Japan but the ticket are 500 buck",
 "I'd hump to plump to Nihon but the tickets are 500 clam",
 "I'd jazz to operate to Japan but the ticket are 500 buck",
 "I'd jazz to plump to Nihon but the tickets are 500 clam",
 "I'd know to operate to Japan but the ticket are 500 buck",
 "I'd know to plump to Nihon but the tickets are 500 clam",
 "I'd love to operate to Japan but the ticket are 500 buck",
 "I'd love to operate to Japan but the ticket are D buck",
 "I'd love to operate to Japan but the ticket are d buck",
 "I'd love to operate to Nihon but the ticket are 500 buck",
 "I'd love to operate to Nippon but the ticket are 500 buck",
 "I'd love to operate to japan but the ticket are 500 buck",
 "I'd love to plump to Nihon but the fine are 500 clam",
 "I'd love to plump to Nihon but the slate are 500 clam",
 "I'd love to plump to Nihon but the tag are 500 clam",
 "I'd love to plump to Nihon but the ticket are 500 clam",
 "I'd love to plump to Nihon but the tickets are 500 clam",
 "I'd love to plump to Nihon but the tickets are D clam",
 "I'd love to plump to Nihon but the tickets are d clam",
 "I'd lovemaking to operate to Japan but the ticket are 500 buck",
 "I'd lovemaking to plump to Nihon but the tickets are 500 clam",
 "I'd passion to fit to Japan but the fine are 500 buck",
 "I'd passion to fit to Japan but the fine are 500 clam",
 "I'd passion to fit to Japan but the fine are 500 dollar",
 "I'd passion to fit to Japan but the fine are 500 dollars",
 "I'd passion to fit to Japan but the fine are D dollars",
 "I'd passion to fit to Japan but the fine are d dollars",
 "I'd passion to fit to Nihon but the fine are 500 dollars",
 "I'd passion to fit to Nippon but the fine are 500 dollars",
 "I'd passion to fit to japan but the fine are 500 dollars",
 "I'd passion to operate to Japan but the ticket are 500 buck",
 "I'd passion to plump to Nihon but the tickets are 500 clam",
 "I'd screw to operate to Japan but the ticket are 500 buck",
 "I'd screw to plump to Nihon but the tickets are 500 clam"]

Conclusion

We have now went through the basics in running Augmenter by either creating a new augmenter from scratch or using a pre-built augmenter. This could be done in as few as 4 lines of code so please give it a try if you haven’t already! 🐙