sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features.

We will load data using datasets, train the models, and attack them using TextAttack.

Open In Colab

View Source on GitHub

[1]:
!pip install datasets nltk sklearn
Requirement already satisfied: datasets in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (1.6.1)
Requirement already satisfied: nltk in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (3.6.2)
Requirement already satisfied: sklearn in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (0.0)
Requirement already satisfied: multiprocess in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.70.11.1)
Requirement already satisfied: packaging in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (20.9)
Requirement already satisfied: xxhash in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2.0.2)
Requirement already satisfied: numpy>=1.17 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (1.18.5)
Requirement already satisfied: pyarrow>=1.0.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (4.0.0)
Requirement already satisfied: dill in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.3.3)
Requirement already satisfied: requests>=2.19.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2.25.1)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (4.49.0)
Requirement already satisfied: huggingface-hub<0.1.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.0.8)
Requirement already satisfied: pandas in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (1.2.4)
Requirement already satisfied: fsspec in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2021.4.0)
Requirement already satisfied: filelock in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from huggingface-hub<0.1.0->datasets) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (4.0.0)
Requirement already satisfied: regex in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (2021.4.4)
Requirement already satisfied: click in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (1.0.1)
Requirement already satisfied: scikit-learn in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from sklearn) (0.24.2)
Requirement already satisfied: pyparsing>=2.0.2 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from packaging->datasets) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from pandas->datasets) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from pandas->datasets) (2021.1)
Requirement already satisfied: six>=1.5 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Requirement already satisfied: scipy>=0.19.1 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from scikit-learn->sklearn) (2.1.0)

Please remember to run pip3 install textattack[tensorflow] in your notebook enviroment before the following codes:

Training

This code trains two models: one on bag-of-words statistics (bow_unstemmed) and one on tf–idf statistics (tfidf_unstemmed). The dataset is the IMDB movie review dataset.

[2]:
import nltk  # the Natural Language Toolkit

nltk.download("punkt")  # The NLTK tokenizer
[nltk_data] Downloading package punkt to /u/lab/jy2ma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[2]:
True
[3]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report


def load_data(dataset_split="train"):
    dataset = datasets.load_dataset("rotten_tomatoes")[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df["Review"] = [review["text"] for review in dataset]
    df["Sentiment"] = [review["label"] for review in dataset]
    # Remove non-alphanumeric characters
    df["Review"] = df["Review"].apply(lambda x: re.sub("[^a-zA-Z]", " ", str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized


def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = " ".join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column="Stemmed", value=stemmed_tokens)
    return df


def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(
        max_features=100, ngram_range=(1, 3), stop_words=ENGLISH_STOP_WORDS
    )
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(
        BOW_training.toarray(), columns=vect.get_feature_names()
    )
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(
        BOW_testing.toarray(), columns=vect.get_feature_names()
    )
    return vectFit, BOW_training_df, BOW_testing_Df


def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(
        ngram_range=(1, 3), max_features=100, stop_words=ENGLISH_STOP_WORDS
    )
    Tfidf_fit = Tfidf.fit(training[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(
        Tfidf_training.toarray(), columns=Tfidf.get_feature_names()
    )
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(
        Tfidf_testing.toarray(), columns=Tfidf.get_feature_names()
    )
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df


def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column="Lengths", value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x) / (len(x.split())) for x in df["Review"].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df["averageWords"] = Average_Words
    return df


def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print(
        "Training accuracy of " + name_of_test + ": ", log_reg.score(X_train, y_train)
    )
    print("Testing accuracy of " + name_of_test + ": ", log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg


# Load training and test sets
# Loading reviews into DF
df_train = load_data("train")

print("...successfully loaded training data")
print("Total length of training data: ", len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print("...augmented data with len_tokens and average_words")

# Load test DF
df_test = load_data("test")

print("...successfully loaded testing data")
print("Total length of testing data: ", len(df_test))
df_test = add_augmenting_features(df_test)
print("...augmented data with len_tokens and average_words")

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(
    df_train, df_test, "Review"
)
print("...successfully created the unstemmed BOW data")

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(
    df_train, df_test, "Review"
)
print("...successfully created the unstemmed TFIDF data")

# Running logistic regression on dataframes
bow_unstemmed = build_model(
    df_train_bow_unstem,
    df_train["Sentiment"],
    df_test_bow_unstem,
    df_test["Sentiment"],
    "BOW Unstemmed",
)

tfidf_unstemmed = build_model(
    df_train_tfidf_unstem,
    df_train["Sentiment"],
    df_test_tfidf_unstem,
    df_test["Sentiment"],
    "TFIDF Unstemmed",
)
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
...successfully loaded training data
Total length of training data:  8530
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
...augmented data with len_tokens and average_words
...successfully loaded testing data
Total length of testing data:  1066
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed:  0.6193434935521688
Testing accuracy of BOW Unstemmed:  0.6031894934333959
              precision    recall  f1-score   support

           0       0.59      0.69      0.63       533
           1       0.62      0.52      0.57       533

    accuracy                           0.60      1066
   macro avg       0.61      0.60      0.60      1066
weighted avg       0.61      0.60      0.60      1066

Training accuracy of TFIDF Unstemmed:  0.6220398593200469
Testing accuracy of TFIDF Unstemmed:  0.6088180112570356
              precision    recall  f1-score   support

           0       0.60      0.67      0.63       533
           1       0.62      0.54      0.58       533

    accuracy                           0.61      1066
   macro avg       0.61      0.61      0.61      1066
weighted avg       0.61      0.61      0.61      1066

Attacking

TextAttack includes a build-in SklearnModelWrapper that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass SklearnModelWrapper to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the TextFoolerJin2019 attack on our model.

[4]:
from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)
[5]:
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker

dataset = HuggingFaceDataset("rotten_tomatoes", None, "train")
attack = TextFoolerJin2019.build(model_wrapper)

attacker = Attacker(attack, dataset)
attacker.attack_dataset()
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
textattack: Loading datasets dataset rotten_tomatoes, split train.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
  0%|          | 0/10 [00:00<?, ?it/s]
Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints):
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premise', 'hypothesis']
        (columns_to_ignore):  {'premise'}
      )
  (is_black_box):  True
)

Using /p/qdata/jy2ma/.cache/textattack to cache modules.
[Succeeded / Failed / Skipped / Total] 2 / 0 / 1 / 3:  30%|███       | 3/10 [00:05<00:12,  1.71s/it]
--------------------------------------------- Result 1 ---------------------------------------------
Positive (55%) --> Negative (51%)

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

the rock is destined to be the 21st century's newest " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .


--------------------------------------------- Result 2 ---------------------------------------------
Positive (52%) --> Negative (52%)

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/dumbledore peter jackson's expanded vision of j . r . r . tolkien's middle-earth .


--------------------------------------------- Result 3 ---------------------------------------------
Negative (52%) --> [SKIPPED]

effective but too-tepid biopic


[Succeeded / Failed / Skipped / Total] 4 / 0 / 3 / 7:  70%|███████   | 7/10 [00:05<00:02,  1.29it/s]
--------------------------------------------- Result 4 ---------------------------------------------
Positive (72%) --> Negative (63%)

if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

if you sometimes like to go to the movie to have amuse , wasabi is a good place to start .


--------------------------------------------- Result 5 ---------------------------------------------
Negative (78%) --> [SKIPPED]

emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .


--------------------------------------------- Result 6 ---------------------------------------------
Positive (65%) --> Negative (60%)

the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

the movie provides some admirable insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .


--------------------------------------------- Result 7 ---------------------------------------------
Negative (52%) --> [SKIPPED]

offers that rare combination of entertainment and education .


[Succeeded / Failed / Skipped / Total] 5 / 0 / 5 / 10: 100%|██████████| 10/10 [00:05<00:00,  1.81it/s]
--------------------------------------------- Result 8 ---------------------------------------------
Positive (56%) --> Negative (51%)

perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .

perhaps no picture ever made has more literally showed that the road to hell is paved with decent intentions .


--------------------------------------------- Result 9 ---------------------------------------------
Negative (52%) --> [SKIPPED]

steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off .


--------------------------------------------- Result 10 ---------------------------------------------
Negative (52%) --> [SKIPPED]

take care of my cat offers a refreshingly different slice of asian cinema .



+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 5      |
| Number of failed attacks:     | 0      |
| Number of skipped attacks:    | 5      |
| Original accuracy:            | 50.0%  |
| Accuracy under attack:        | 0.0%   |
| Attack success rate:          | 100.0% |
| Average perturbed word %:     | 6.08%  |
| Average num. words per input: | 19.5   |
| Avg num queries:              | 67.6   |
+-------------------------------+--------+

[5]:
[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f7405e1b040>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f73fc688f10>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740006a9d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f73f46c8190>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f7405e1b0d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f74001a1df0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f7405e1b070>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f740d132df0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740d132e50>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740d132dc0>]

Conclusion

We were able to train a model on the IMDB dataset using sklearn and use it in TextAttack by initializing with the SklearnModelWrapper. It’s that simple!