Augmenting a dataset using TextAttack requries only a few lines of code when it is done right. The
Augmenter class is created for this purpose to generate augmentations of a string or a list of strings. Augmentation could be done in either python script or command line.
Creating an Augmenter¶
The Augmenter class is essensial for performing data augmentation using TextAttack. It takes in four paramerters in the following order:
transformation: all transformations implemented by TextAttack can be used to create an
Augmenter. Note here that if we want to apply multiple transformations in the same time, they first need to be incooporated into a
constraints: constraints determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations. The default augmenter does not have any constraints but contraints can be supplied as a list to the Augmenter.
pct_words_to_swap: percentage of words to swap per augmented example. The default is set to 0.1 (10%).
transformations_per_example maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)
An example of creating one’s own augmenter is shown below. In this case, we are creating an augmenter with RandomCharacterDeletion and WordSwapQWERTY transformations, RepeatModification and StopWordModification constraints. A maximum of 50% of the words could be purturbed, and 10 augmentations will be generated from each input sentence.
# import transformations, contraints, and the Augmenter from textattack.transformations import WordSwapRandomCharacterDeletion from textattack.transformations import WordSwapQWERTY from textattack.transformations import CompositeTransformation from textattack.constraints.pre_transformation import RepeatModification from textattack.constraints.pre_transformation import StopwordModification from textattack.augmentation import Augmenter
# Set up transformation using CompositeTransformation() transformation = CompositeTransformation([WordSwapRandomCharacterDeletion(), WordSwapQWERTY()]) # Set up constraints constraints = [RepeatModification(), StopwordModification()] # Create augmenter with specified parameters augmenter = Augmenter(transformation=transformation, constraints=constraints, pct_words_to_swap=0.5, transformations_per_example=10) s = 'What I cannot create, I do not understand.' # Augment! augmenter.augment(s)
['Ahat I camnot reate, I do not unerstand.', 'Ahat I cwnnot crewte, I do not undefstand.', 'Wat I camnot vreate, I do not undefstand.', 'Wha I annot crate, I do not unerstand.', 'Whaf I canno creatr, I do not ynderstand.', 'Wtat I cannor dreate, I do not understwnd.', 'Wuat I canno ceate, I do not unferstand.', 'hat I cnnot ceate, I do not undersand.', 'hat I cnnot cfeate, I do not undfrstand.', 'hat I cwnnot crfate, I do not ujderstand.']
Pre-built Augmentation Recipes¶
In addition to creating our own augmenter, we could also use pre-built augmentation recipes to perturb datasets. These recipes are implemented from publishded papers and are very convenient to use. The list of available recipes can be found here.
In the following example, we will use the
CheckListAugmenter to showcase our augmentation recipes. The
CheckListAugmenter augments words by using the transformation methods provided by CheckList INV testing, which combines Name Replacement, Location Replacement, Number Alteration, and Contraction/Extension. The original paper can be found here: “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” (Ribeiro et al., 2020)
# import the CheckListAugmenter from textattack.augmentation import CheckListAugmenter # Alter default values if desired augmenter = CheckListAugmenter(pct_words_to_swap=0.2, transformations_per_example=5) s = "I'd love to go to Japan but the tickets are 500 dollars" # Augment augmenter.augment(s)
2021-06-09 16:58:41,816 -------------------------------------------------------------------------------- 2021-06-09 16:58:41,817 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub 2021-06-09 16:58:41,817 - The most current version of the model is automatically downloaded from there. 2021-06-09 16:58:41,818 - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt) 2021-06-09 16:58:41,818 -------------------------------------------------------------------------------- 2021-06-09 16:58:41,906 loading file /u/lab/jy2ma/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
['I would love to go to Chile but the tickets are 500 dollars', 'I would love to go to Japan but the tickets are 500 dollars', 'I would love to go to Japan but the tickets are 75 dollars', "I'd love to go to Oman but the tickets are 373 dollars", "I'd love to go to Vietnam but the tickets are 613 dollars"]
Note that the previous snippet of code is equivalent of running
textattack augment --recipe checklist --pct-words-to-swap .1 --transformations-per-example 5 --exclude-original --interactive
in command line.
Here’s another example of using
from textattack.augmentation import WordNetAugmenter augmenter = WordNetAugmenter(pct_words_to_swap=0.2, transformations_per_example=5) s = "I'd love to go to Japan but the tickets are 500 dollars" augmenter.augment(s)
["I'd fuck to fit to Japan but the tickets are 500 dollars", "I'd know to cristal to Japan but the tickets are 500 dollars", "I'd love to depart to Japan but the tickets are D dollars", "I'd love to get to Nihon but the tickets are 500 dollars", "I'd love to work to Japan but the tickets are 500 buck"]
We have now went through the basics in running
Augmenter by either creating a new augmenter from scratch or using a pre-built augmenter. This could be done in as few as 4 lines of code so please give it a try if you haven’t already! 🐙