{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "m83IiqVREJ96" }, "source": [ "# TextAttack Augmentation" ] }, { "cell_type": "markdown", "metadata": { "id": "6UZ0d84hEJ98" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/3_Augmentations.ipynb)\n", "\n", "[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/3_Augmentations.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "tjqc2c5_7YaX" }, "source": [ " Please remember to run the following in your notebook enviroment before running the tutorial codes:\n", "\n", "```\n", "pip3 install textattack[tensorflow]\n", "```\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qZ5xnoevEJ99" }, "source": [ "Augmenting a dataset using TextAttack requries only a few lines of code when it is done right. The `Augmenter` class is created for this purpose to generate augmentations of a string or a list of strings. Augmentation could be done in either python script or command line.\n", "\n", "### Creating an Augmenter\n", "\n", "The **Augmenter** class is essensial for performing data augmentation using TextAttack. It takes in four paramerters in the following order:\n", "\n", "\n", "1. **transformation**: all [transformations](https://textattack.readthedocs.io/en/latest/apidoc/textattack.transformations.html) implemented by TextAttack can be used to create an `Augmenter`. Note here that if we want to apply multiple transformations in the same time, they first need to be incooporated into a `CompositeTransformation` class.\n", "2. **constraints**: [constraints](https://textattack.readthedocs.io/en/latest/apidoc/textattack.constraints.html#) determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations. The default augmenter does not have any constraints but contraints can be supplied as a list to the Augmenter.\n", "3. **pct_words_to_swap**: percentage of words to swap per augmented example. The default is set to 0.1 (10%).\n", "4. **transformations_per_example** maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)\n", "\n", "An example of creating one's own augmenter is shown below. In this case, we are creating an augmenter with **RandomCharacterDeletion** and **WordSwapQWERTY** transformations, **RepeatModification** and **StopWordModification** constraints. A maximum of **50%** of the words could be purturbed, and 10 augmentations will be generated from each input sentence.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5AXyxiLD4X93" }, "outputs": [], "source": [ "# import transformations, contraints, and the Augmenter\n", "from textattack.transformations import WordSwapRandomCharacterDeletion\n", "from textattack.transformations import WordSwapQWERTY\n", "from textattack.transformations import CompositeTransformation\n", "\n", "from textattack.constraints.pre_transformation import RepeatModification\n", "from textattack.constraints.pre_transformation import StopwordModification\n", "\n", "from textattack.augmentation import Augmenter" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wFeXF_OL-vyw", "outputId": "c041e77e-accd-4a58-88be-9b140dd0cd56" }, "outputs": [ { "data": { "text/plain": [ "['Ahat I camnot reate, I do not unerstand.',\n", " 'Ahat I cwnnot crewte, I do not undefstand.',\n", " 'Wat I camnot vreate, I do not undefstand.',\n", " 'Wha I annot crate, I do not unerstand.',\n", " 'Whaf I canno creatr, I do not ynderstand.',\n", " 'Wtat I cannor dreate, I do not understwnd.',\n", " 'Wuat I canno ceate, I do not unferstand.',\n", " 'hat I cnnot ceate, I do not undersand.',\n", " 'hat I cnnot cfeate, I do not undfrstand.',\n", " 'hat I cwnnot crfate, I do not ujderstand.']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set up transformation using CompositeTransformation()\n", "transformation = CompositeTransformation(\n", " [WordSwapRandomCharacterDeletion(), WordSwapQWERTY()]\n", ")\n", "# Set up constraints\n", "constraints = [RepeatModification(), StopwordModification()]\n", "# Create augmenter with specified parameters\n", "augmenter = Augmenter(\n", " transformation=transformation,\n", " constraints=constraints,\n", " pct_words_to_swap=0.5,\n", " transformations_per_example=10,\n", ")\n", "s = \"What I cannot create, I do not understand.\"\n", "# Augment!\n", "augmenter.augment(s)" ] }, { "cell_type": "markdown", "metadata": { "id": "b7020KtvEJ9-" }, "source": [ "### Pre-built Augmentation Recipes\n", "\n", "In addition to creating our own augmenter, we could also use pre-built augmentation recipes to perturb datasets. These recipes are implemented from publishded papers and are very convenient to use. The list of available recipes can be found [here](https://textattack.readthedocs.io/en/latest/3recipes/augmenter_recipes.html).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pkBqK5wYQKZu" }, "source": [ "In the following example, we will use the `CheckListAugmenter` to showcase our augmentation recipes. The `CheckListAugmenter` augments words by using the transformation methods provided by CheckList INV testing, which combines **Name Replacement**, **Location Replacement**, **Number Alteration**, and **Contraction/Extension**. The original paper can be found here: [\"Beyond Accuracy: Behavioral Testing of NLP models with CheckList\" (Ribeiro et al., 2020)](https://arxiv.org/abs/2005.04118)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WkYiVH6lQedu", "outputId": "cd5ffc65-ca80-45cd-b3bb-d023bcad09a4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2021-06-09 16:58:41,816 --------------------------------------------------------------------------------\n", "2021-06-09 16:58:41,817 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub\n", "2021-06-09 16:58:41,817 - The most current version of the model is automatically downloaded from there.\n", "2021-06-09 16:58:41,818 - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)\n", "2021-06-09 16:58:41,818 --------------------------------------------------------------------------------\n", "2021-06-09 16:58:41,906 loading file /u/lab/jy2ma/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4\n" ] }, { "data": { "text/plain": [ "['I would love to go to Chile but the tickets are 500 dollars',\n", " 'I would love to go to Japan but the tickets are 500 dollars',\n", " 'I would love to go to Japan but the tickets are 75 dollars',\n", " \"I'd love to go to Oman but the tickets are 373 dollars\",\n", " \"I'd love to go to Vietnam but the tickets are 613 dollars\"]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import the CheckListAugmenter\n", "from textattack.augmentation import CheckListAugmenter\n", "\n", "# Alter default values if desired\n", "augmenter = CheckListAugmenter(pct_words_to_swap=0.2, transformations_per_example=5)\n", "s = \"I'd love to go to Japan but the tickets are 500 dollars\"\n", "# Augment\n", "augmenter.augment(s)" ] }, { "cell_type": "markdown", "metadata": { "id": "5vn22xrLST0H" }, "source": [ "Note that the previous snippet of code is equivalent of running\n", "\n", "```\n", "textattack augment --recipe checklist --pct-words-to-swap .1 --transformations-per-example 5 --exclude-original --interactive\n", "```\n", "in command line.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "VqfmCKz0XY-Y" }, "source": [ "\n", "\n", "\n", "Here's another example of using `WordNetAugmenter`. In this scenario, we enable `enable_advanced_metrics` to acquire perplexity and USE score, and enable `high_yield` to generate more examples in the same running time:\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "l2b-4scuXvkA", "outputId": "5a372fd2-226a-4970-a2c9-c09bf2af56c2" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Average Original Perplexity Score: 1.09\n", "\n", "Average Augment Perplexity Score: 3.17\n", "\n", "Average Augment USE Score: 0.72\n", "\n", "Augmentations:\n" ] }, { "data": { "text/plain": [ "[\"I'd bang to operate to Japan but the ticket are 500 buck\",\n", " \"I'd bang to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd bed to operate to Japan but the ticket are 500 buck\",\n", " \"I'd bed to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd beloved to operate to Japan but the ticket are 500 buck\",\n", " \"I'd beloved to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd bonk to operate to Japan but the ticket are 500 buck\",\n", " \"I'd bonk to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd bonk to travel to Japan but the tag are 500 buck\",\n", " \"I'd bonk to travel to Japan but the tag are 500 clam\",\n", " \"I'd bonk to travel to Japan but the tag are 500 dollar\",\n", " \"I'd bonk to travel to Japan but the tag are 500 dollars\",\n", " \"I'd bonk to travel to Japan but the tag are D dollars\",\n", " \"I'd bonk to travel to Japan but the tag are d dollars\",\n", " \"I'd bonk to travel to Nihon but the tag are 500 dollars\",\n", " \"I'd bonk to travel to Nippon but the tag are 500 dollars\",\n", " \"I'd bonk to travel to japan but the tag are 500 dollars\",\n", " \"I'd dear to operate to Japan but the ticket are 500 buck\",\n", " \"I'd dear to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd dearest to operate to Japan but the ticket are 500 buck\",\n", " \"I'd dearest to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd eff to operate to Japan but the ticket are 500 buck\",\n", " \"I'd eff to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd enjoy to exit to Japan but the fine are 500 buck\",\n", " \"I'd enjoy to exit to Japan but the slate are 500 buck\",\n", " \"I'd enjoy to exit to Japan but the tag are 500 buck\",\n", " \"I'd enjoy to exit to Japan but the ticket are 500 buck\",\n", " \"I'd enjoy to exit to Japan but the tickets are 500 buck\",\n", " \"I'd enjoy to exit to Japan but the tickets are D buck\",\n", " \"I'd enjoy to exit to Japan but the tickets are d buck\",\n", " \"I'd enjoy to exit to Nihon but the tickets are 500 buck\",\n", " \"I'd enjoy to exit to Nippon but the tickets are 500 buck\",\n", " \"I'd enjoy to exit to japan but the tickets are 500 buck\",\n", " \"I'd enjoy to operate to Japan but the ticket are 500 buck\",\n", " \"I'd enjoy to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd fuck to operate to Japan but the ticket are 500 buck\",\n", " \"I'd fuck to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd honey to operate to Japan but the ticket are 500 buck\",\n", " \"I'd honey to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd hump to operate to Japan but the ticket are 500 buck\",\n", " \"I'd hump to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd jazz to operate to Japan but the ticket are 500 buck\",\n", " \"I'd jazz to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd know to operate to Japan but the ticket are 500 buck\",\n", " \"I'd know to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd love to operate to Japan but the ticket are 500 buck\",\n", " \"I'd love to operate to Japan but the ticket are D buck\",\n", " \"I'd love to operate to Japan but the ticket are d buck\",\n", " \"I'd love to operate to Nihon but the ticket are 500 buck\",\n", " \"I'd love to operate to Nippon but the ticket are 500 buck\",\n", " \"I'd love to operate to japan but the ticket are 500 buck\",\n", " \"I'd love to plump to Nihon but the fine are 500 clam\",\n", " \"I'd love to plump to Nihon but the slate are 500 clam\",\n", " \"I'd love to plump to Nihon but the tag are 500 clam\",\n", " \"I'd love to plump to Nihon but the ticket are 500 clam\",\n", " \"I'd love to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd love to plump to Nihon but the tickets are D clam\",\n", " \"I'd love to plump to Nihon but the tickets are d clam\",\n", " \"I'd lovemaking to operate to Japan but the ticket are 500 buck\",\n", " \"I'd lovemaking to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd passion to fit to Japan but the fine are 500 buck\",\n", " \"I'd passion to fit to Japan but the fine are 500 clam\",\n", " \"I'd passion to fit to Japan but the fine are 500 dollar\",\n", " \"I'd passion to fit to Japan but the fine are 500 dollars\",\n", " \"I'd passion to fit to Japan but the fine are D dollars\",\n", " \"I'd passion to fit to Japan but the fine are d dollars\",\n", " \"I'd passion to fit to Nihon but the fine are 500 dollars\",\n", " \"I'd passion to fit to Nippon but the fine are 500 dollars\",\n", " \"I'd passion to fit to japan but the fine are 500 dollars\",\n", " \"I'd passion to operate to Japan but the ticket are 500 buck\",\n", " \"I'd passion to plump to Nihon but the tickets are 500 clam\",\n", " \"I'd screw to operate to Japan but the ticket are 500 buck\",\n", " \"I'd screw to plump to Nihon but the tickets are 500 clam\"]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from textattack.augmentation import WordNetAugmenter\n", "\n", "augmenter = WordNetAugmenter(\n", " pct_words_to_swap=0.4,\n", " transformations_per_example=5,\n", " high_yield=True,\n", " enable_advanced_metrics=True,\n", ")\n", "s = \"I'd love to go to Japan but the tickets are 500 dollars\"\n", "results = augmenter.augment(s)\n", "print(f\"Average Original Perplexity Score: {results[1]['avg_original_perplexity']}\\n\")\n", "print(f\"Average Augment Perplexity Score: {results[1]['avg_attack_perplexity']}\\n\")\n", "print(f\"Average Augment USE Score: {results[2]['avg_attack_use_score']}\\n\")\n", "print(f\"Augmentations:\")\n", "results[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "whvwbHLVEJ-S" }, "source": [ "### Conclusion\n", "We have now went through the basics in running `Augmenter` by either creating a new augmenter from scratch or using a pre-built augmenter. This could be done in as few as 4 lines of code so please give it a try if you haven't already! 🐙" ] } ], "metadata": { "colab": { "name": "Augmentation with TextAttack.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }