TextAttack Augmentation

Open In Colab

View Source on GitHub

Augmenting a dataset using TextAttack requries only a few lines of code when it is done right. The Augmenter class is created for this purpose to generate augmentations of a string or a list of strings. Augmentation could be done in either python script or command line.

Creating an Augmenter

The Augmenter class is essensial for performing data augmentation using TextAttack. It takes in four paramerters in the following order:

  1. transformation: all transformations implemented by TextAttack can be used to create an Augmenter. Note here that if we want to apply multiple transformations in the same time, they first need to be incooporated into a CompositeTransformation class.

  2. constraints: constraints determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations. The default augmenter does not have any constraints but contraints can be supplied as a list to the Augmenter.

  3. pct_words_to_swap: percentage of words to swap per augmented example. The default is set to 0.1 (10%).

  4. transformations_per_example maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)

An example of creating one’s own augmenter is shown below. In this case, we are creating an augmenter with RandomCharacterDeletion and WordSwapQWERTY transformations, RepeatModification and StopWordModification constraints. A maximum of 50% of the words could be purturbed, and 10 augmentations will be generated from each input sentence.

[1]:
# import transformations, contraints, and the Augmenter
from textattack.transformations import WordSwapRandomCharacterDeletion
from textattack.transformations import WordSwapQWERTY
from textattack.transformations import CompositeTransformation

from textattack.constraints.pre_transformation import RepeatModification
from textattack.constraints.pre_transformation import StopwordModification

from textattack.augmentation import Augmenter
[2]:
# Set up transformation using CompositeTransformation()
transformation = CompositeTransformation([WordSwapRandomCharacterDeletion(), WordSwapQWERTY()])
# Set up constraints
constraints = [RepeatModification(), StopwordModification()]
# Create augmenter with specified parameters
augmenter = Augmenter(transformation=transformation, constraints=constraints, pct_words_to_swap=0.5, transformations_per_example=10)
s = 'What I cannot create, I do not understand.'
# Augment!
augmenter.augment(s)
[2]:
['Ahat I camnot reate, I do not unerstand.',
 'Ahat I cwnnot crewte, I do not undefstand.',
 'Wat I camnot vreate, I do not undefstand.',
 'Wha I annot crate, I do not unerstand.',
 'Whaf I canno creatr, I do not ynderstand.',
 'Wtat I cannor dreate, I do not understwnd.',
 'Wuat I canno ceate, I do not unferstand.',
 'hat I cnnot ceate, I do not undersand.',
 'hat I cnnot cfeate, I do not undfrstand.',
 'hat I cwnnot crfate, I do not ujderstand.']

Pre-built Augmentation Recipes

In addition to creating our own augmenter, we could also use pre-built augmentation recipes to perturb datasets. These recipes are implemented from publishded papers and are very convenient to use. The list of available recipes can be found here.

In the following example, we will use the CheckListAugmenter to showcase our augmentation recipes. The CheckListAugmenter augments words by using the transformation methods provided by CheckList INV testing, which combines Name Replacement, Location Replacement, Number Alteration, and Contraction/Extension. The original paper can be found here: “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” (Ribeiro et al., 2020)

[3]:
# import the CheckListAugmenter
from textattack.augmentation import CheckListAugmenter
# Alter default values if desired
augmenter = CheckListAugmenter(pct_words_to_swap=0.2, transformations_per_example=5)
s = "I'd love to go to Japan but the tickets are 500 dollars"
# Augment
augmenter.augment(s)
2021-06-09 16:58:41,816 --------------------------------------------------------------------------------
2021-06-09 16:58:41,817 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub
2021-06-09 16:58:41,817  - The most current version of the model is automatically downloaded from there.
2021-06-09 16:58:41,818  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)
2021-06-09 16:58:41,818 --------------------------------------------------------------------------------
2021-06-09 16:58:41,906 loading file /u/lab/jy2ma/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
[3]:
['I would love to go to Chile but the tickets are 500 dollars',
 'I would love to go to Japan but the tickets are 500 dollars',
 'I would love to go to Japan but the tickets are 75 dollars',
 "I'd love to go to Oman but the tickets are 373 dollars",
 "I'd love to go to Vietnam but the tickets are 613 dollars"]

Note that the previous snippet of code is equivalent of running

textattack augment --recipe checklist --pct-words-to-swap .1 --transformations-per-example 5 --exclude-original --interactive

in command line.

Here’s another example of using WordNetAugmenter:

[4]:
from textattack.augmentation import WordNetAugmenter
augmenter = WordNetAugmenter(pct_words_to_swap=0.2, transformations_per_example=5)
s = "I'd love to go to Japan but the tickets are 500 dollars"
augmenter.augment(s)
[4]:
["I'd fuck to fit to Japan but the tickets are 500 dollars",
 "I'd know to cristal to Japan but the tickets are 500 dollars",
 "I'd love to depart to Japan but the tickets are D dollars",
 "I'd love to get to Nihon but the tickets are 500 dollars",
 "I'd love to work to Japan but the tickets are 500 buck"]

Conclusion

We have now went through the basics in running Augmenter by either creating a new augmenter from scratch or using a pre-built augmenter. This could be done in as few as 4 lines of code so please give it a try if you haven’t already! 🐙