{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m83IiqVREJ96"
   },
   "source": [
    "# TextAttack Augmentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6UZ0d84hEJ98"
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/3_Augmentations.ipynb)\n",
    "\n",
    "[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/3_Augmentations.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tjqc2c5_7YaX"
   },
   "source": [
    " Please remember to run the following in your notebook enviroment before running the tutorial codes:\n",
    "\n",
    "```\n",
    "pip3 install textattack[tensorflow]\n",
    "```\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qZ5xnoevEJ99"
   },
   "source": [
    "Augmenting a dataset using TextAttack requries only a few lines of code when it is done right. The `Augmenter` class is created for this purpose to generate augmentations of a string or a list of strings. Augmentation could be done in either python script or command line.\n",
    "\n",
    "### Creating an Augmenter\n",
    "\n",
    "The **Augmenter** class is essensial for performing data augmentation using TextAttack. It takes in four paramerters in the following order:\n",
    "\n",
    "\n",
    "1.  **transformation**: all [transformations](https://textattack.readthedocs.io/en/latest/apidoc/textattack.transformations.html) implemented by TextAttack can be used to create an `Augmenter`. Note here that if we want to apply multiple transformations in the same time, they first need to be incooporated into a `CompositeTransformation` class.\n",
    "2.  **constraints**: [constraints](https://textattack.readthedocs.io/en/latest/apidoc/textattack.constraints.html#) determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations. The default augmenter does not have any constraints but contraints can be supplied as a list to the Augmenter.\n",
    "3.  **pct_words_to_swap**:  percentage of words to swap per augmented example. The default is set to 0.1 (10%).\n",
    "4.  **transformations_per_example** maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)\n",
    "\n",
    "An example of creating one's own augmenter is shown below. In this case, we are creating an augmenter with **RandomCharacterDeletion** and **WordSwapQWERTY** transformations, **RepeatModification** and **StopWordModification** constraints. A maximum of **50%** of the words could be purturbed, and 10 augmentations will be generated from each input sentence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5AXyxiLD4X93"
   },
   "outputs": [],
   "source": [
    "# import transformations, contraints, and the Augmenter\n",
    "from textattack.transformations import WordSwapRandomCharacterDeletion\n",
    "from textattack.transformations import WordSwapQWERTY\n",
    "from textattack.transformations import CompositeTransformation\n",
    "\n",
    "from textattack.constraints.pre_transformation import RepeatModification\n",
    "from textattack.constraints.pre_transformation import StopwordModification\n",
    "\n",
    "from textattack.augmentation import Augmenter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wFeXF_OL-vyw",
    "outputId": "c041e77e-accd-4a58-88be-9b140dd0cd56"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Ahat I camnot reate, I do not unerstand.',\n",
       " 'Ahat I cwnnot crewte, I do not undefstand.',\n",
       " 'Wat I camnot vreate, I do not undefstand.',\n",
       " 'Wha I annot crate, I do not unerstand.',\n",
       " 'Whaf I canno creatr, I do not ynderstand.',\n",
       " 'Wtat I cannor dreate, I do not understwnd.',\n",
       " 'Wuat I canno ceate, I do not unferstand.',\n",
       " 'hat I cnnot ceate, I do not undersand.',\n",
       " 'hat I cnnot cfeate, I do not undfrstand.',\n",
       " 'hat I cwnnot crfate, I do not ujderstand.']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Set up transformation using CompositeTransformation()\n",
    "transformation = CompositeTransformation(\n",
    "    [WordSwapRandomCharacterDeletion(), WordSwapQWERTY()]\n",
    ")\n",
    "# Set up constraints\n",
    "constraints = [RepeatModification(), StopwordModification()]\n",
    "# Create augmenter with specified parameters\n",
    "augmenter = Augmenter(\n",
    "    transformation=transformation,\n",
    "    constraints=constraints,\n",
    "    pct_words_to_swap=0.5,\n",
    "    transformations_per_example=10,\n",
    ")\n",
    "s = \"What I cannot create, I do not understand.\"\n",
    "# Augment!\n",
    "augmenter.augment(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "b7020KtvEJ9-"
   },
   "source": [
    "### Pre-built Augmentation Recipes\n",
    "\n",
    "In addition to creating our own augmenter, we could also use pre-built augmentation recipes to perturb datasets. These recipes are implemented from publishded papers and are very convenient to use. The list of available recipes can be found [here](https://textattack.readthedocs.io/en/latest/3recipes/augmenter_recipes.html).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pkBqK5wYQKZu"
   },
   "source": [
    "In the following example, we will use the `CheckListAugmenter` to showcase our augmentation recipes. The `CheckListAugmenter` augments words by using the transformation methods provided by CheckList INV testing, which combines **Name Replacement**, **Location Replacement**, **Number Alteration**, and **Contraction/Extension**. The original paper can be found here: [\"Beyond Accuracy: Behavioral Testing of NLP models with CheckList\" (Ribeiro et al., 2020)](https://arxiv.org/abs/2005.04118)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WkYiVH6lQedu",
    "outputId": "cd5ffc65-ca80-45cd-b3bb-d023bcad09a4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-06-09 16:58:41,816 --------------------------------------------------------------------------------\n",
      "2021-06-09 16:58:41,817 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub\n",
      "2021-06-09 16:58:41,817  - The most current version of the model is automatically downloaded from there.\n",
      "2021-06-09 16:58:41,818  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)\n",
      "2021-06-09 16:58:41,818 --------------------------------------------------------------------------------\n",
      "2021-06-09 16:58:41,906 loading file /u/lab/jy2ma/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['I would love to go to Chile but the tickets are 500 dollars',\n",
       " 'I would love to go to Japan but the tickets are 500 dollars',\n",
       " 'I would love to go to Japan but the tickets are 75 dollars',\n",
       " \"I'd love to go to Oman but the tickets are 373 dollars\",\n",
       " \"I'd love to go to Vietnam but the tickets are 613 dollars\"]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import the CheckListAugmenter\n",
    "from textattack.augmentation import CheckListAugmenter\n",
    "\n",
    "# Alter default values if desired\n",
    "augmenter = CheckListAugmenter(pct_words_to_swap=0.2, transformations_per_example=5)\n",
    "s = \"I'd love to go to Japan but the tickets are 500 dollars\"\n",
    "# Augment\n",
    "augmenter.augment(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5vn22xrLST0H"
   },
   "source": [
    "Note that the previous snippet of code is equivalent of running\n",
    "\n",
    "```\n",
    "textattack augment --recipe checklist --pct-words-to-swap .1 --transformations-per-example 5 --exclude-original --interactive\n",
    "```\n",
    "in command line.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VqfmCKz0XY-Y"
   },
   "source": [
    "\n",
    "\n",
    "\n",
    "Here's another example of using `WordNetAugmenter`. In this scenario, we enable `enable_advanced_metrics` to acquire perplexity and USE score, and enable `high_yield` to generate more examples in the same running time:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "l2b-4scuXvkA",
    "outputId": "5a372fd2-226a-4970-a2c9-c09bf2af56c2"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Original Perplexity Score: 1.09\n",
      "\n",
      "Average Augment Perplexity Score: 3.17\n",
      "\n",
      "Average Augment USE Score: 0.72\n",
      "\n",
      "Augmentations:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[\"I'd bang to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd bang to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd bed to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd bed to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd beloved to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd beloved to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd bonk to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd bonk to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd bonk to travel to Japan but the tag are 500 buck\",\n",
       " \"I'd bonk to travel to Japan but the tag are 500 clam\",\n",
       " \"I'd bonk to travel to Japan but the tag are 500 dollar\",\n",
       " \"I'd bonk to travel to Japan but the tag are 500 dollars\",\n",
       " \"I'd bonk to travel to Japan but the tag are D dollars\",\n",
       " \"I'd bonk to travel to Japan but the tag are d dollars\",\n",
       " \"I'd bonk to travel to Nihon but the tag are 500 dollars\",\n",
       " \"I'd bonk to travel to Nippon but the tag are 500 dollars\",\n",
       " \"I'd bonk to travel to japan but the tag are 500 dollars\",\n",
       " \"I'd dear to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd dear to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd dearest to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd dearest to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd eff to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd eff to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd enjoy to exit to Japan but the fine are 500 buck\",\n",
       " \"I'd enjoy to exit to Japan but the slate are 500 buck\",\n",
       " \"I'd enjoy to exit to Japan but the tag are 500 buck\",\n",
       " \"I'd enjoy to exit to Japan but the ticket are 500 buck\",\n",
       " \"I'd enjoy to exit to Japan but the tickets are 500 buck\",\n",
       " \"I'd enjoy to exit to Japan but the tickets are D buck\",\n",
       " \"I'd enjoy to exit to Japan but the tickets are d buck\",\n",
       " \"I'd enjoy to exit to Nihon but the tickets are 500 buck\",\n",
       " \"I'd enjoy to exit to Nippon but the tickets are 500 buck\",\n",
       " \"I'd enjoy to exit to japan but the tickets are 500 buck\",\n",
       " \"I'd enjoy to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd enjoy to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd fuck to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd fuck to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd honey to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd honey to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd hump to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd hump to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd jazz to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd jazz to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd know to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd know to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd love to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd love to operate to Japan but the ticket are D buck\",\n",
       " \"I'd love to operate to Japan but the ticket are d buck\",\n",
       " \"I'd love to operate to Nihon but the ticket are 500 buck\",\n",
       " \"I'd love to operate to Nippon but the ticket are 500 buck\",\n",
       " \"I'd love to operate to japan but the ticket are 500 buck\",\n",
       " \"I'd love to plump to Nihon but the fine are 500 clam\",\n",
       " \"I'd love to plump to Nihon but the slate are 500 clam\",\n",
       " \"I'd love to plump to Nihon but the tag are 500 clam\",\n",
       " \"I'd love to plump to Nihon but the ticket are 500 clam\",\n",
       " \"I'd love to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd love to plump to Nihon but the tickets are D clam\",\n",
       " \"I'd love to plump to Nihon but the tickets are d clam\",\n",
       " \"I'd lovemaking to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd lovemaking to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd passion to fit to Japan but the fine are 500 buck\",\n",
       " \"I'd passion to fit to Japan but the fine are 500 clam\",\n",
       " \"I'd passion to fit to Japan but the fine are 500 dollar\",\n",
       " \"I'd passion to fit to Japan but the fine are 500 dollars\",\n",
       " \"I'd passion to fit to Japan but the fine are D dollars\",\n",
       " \"I'd passion to fit to Japan but the fine are d dollars\",\n",
       " \"I'd passion to fit to Nihon but the fine are 500 dollars\",\n",
       " \"I'd passion to fit to Nippon but the fine are 500 dollars\",\n",
       " \"I'd passion to fit to japan but the fine are 500 dollars\",\n",
       " \"I'd passion to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd passion to plump to Nihon but the tickets are 500 clam\",\n",
       " \"I'd screw to operate to Japan but the ticket are 500 buck\",\n",
       " \"I'd screw to plump to Nihon but the tickets are 500 clam\"]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from textattack.augmentation import WordNetAugmenter\n",
    "\n",
    "augmenter = WordNetAugmenter(\n",
    "    pct_words_to_swap=0.4,\n",
    "    transformations_per_example=5,\n",
    "    high_yield=True,\n",
    "    enable_advanced_metrics=True,\n",
    ")\n",
    "s = \"I'd love to go to Japan but the tickets are 500 dollars\"\n",
    "results = augmenter.augment(s)\n",
    "print(f\"Average Original Perplexity Score: {results[1]['avg_original_perplexity']}\\n\")\n",
    "print(f\"Average Augment Perplexity Score: {results[1]['avg_attack_perplexity']}\\n\")\n",
    "print(f\"Average Augment USE Score: {results[2]['avg_attack_use_score']}\\n\")\n",
    "print(f\"Augmentations:\")\n",
    "results[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "whvwbHLVEJ-S"
   },
   "source": [
    "### Conclusion\n",
    "We have now went through the basics in running `Augmenter` by either creating a new augmenter from scratch or using a pre-built augmenter. This could be done in as few as 4 lines of code so please give it a try if you haven't already! 🐙"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "Augmentation with TextAttack.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}