sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features.

We will load data using datasets, train the models, and attack them using TextAttack.

!pip install datasets nltk sklearn
Please remember to run pip3 install textattack[tensorflow] in your notebook enviroment before the following codes:


This code trains two models: one on bag-of-words statistics (bow_unstemmed) and one on tf–idf statistics (tfidf_unstemmed). The dataset is the IMDB movie review dataset.

import nltk  # the Natural Language Toolkit"punkt")  # The NLTK tokenizer
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report

def load_data(dataset_split="train"):
    dataset = datasets.load_dataset("rotten_tomatoes")[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df["Review"] = [review["text"] for review in dataset]
    df["Sentiment"] = [review["label"] for review in dataset]
    # Remove non-alphanumeric characters
    df["Review"] = df["Review"].apply(lambda x: re.sub("[^a-zA-Z]", " ", str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized

def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = " ".join(stems)
    df.insert(1, column="Stemmed", value=stemmed_tokens)
    return df

def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(
        max_features=100, ngram_range=(1, 3), stop_words=ENGLISH_STOP_WORDS
    vectFit =[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(
        BOW_training.toarray(), columns=vect.get_feature_names()
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(
        BOW_testing.toarray(), columns=vect.get_feature_names()
    return vectFit, BOW_training_df, BOW_testing_Df

def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(
        ngram_range=(1, 3), max_features=100, stop_words=ENGLISH_STOP_WORDS
    Tfidf_fit =[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(
        Tfidf_training.toarray(), columns=Tfidf.get_feature_names()
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(
        Tfidf_testing.toarray(), columns=Tfidf.get_feature_names()
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df

def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column="Lengths", value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x) / (len(x.split())) for x in df["Review"].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df["averageWords"] = Average_Words
    return df

def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
        "Training accuracy of " + name_of_test + ": ", log_reg.score(X_train, y_train)
    print("Testing accuracy of " + name_of_test + ": ", log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg

# Load training and test sets
# Loading reviews into DF
df_train = load_data("train")

print("...successfully loaded training data")
print("Total length of training data: ", len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print("...augmented data with len_tokens and average_words")

# Load test DF
df_test = load_data("test")

print("...successfully loaded testing data")
print("Total length of testing data: ", len(df_test))
df_test = add_augmenting_features(df_test)
print("...augmented data with len_tokens and average_words")

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(
    df_train, df_test, "Review"
print("...successfully created the unstemmed BOW data")

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(
    df_train, df_test, "Review"
print("...successfully created the unstemmed TFIDF data")

# Running logistic regression on dataframes
bow_unstemmed = build_model(
    "BOW Unstemmed",

tfidf_unstemmed = build_model(
    "TFIDF Unstemmed",
...successfully loaded training data
Total length of training data:  8530
...augmented data with len_tokens and average_words
...successfully loaded testing data
Total length of testing data:  1066
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed:  0.6193434935521688
Testing accuracy of BOW Unstemmed:  0.6031894934333959
              precision    recall  f1-score   support

           0       0.59      0.69      0.63       533
           1       0.62      0.52      0.57       533

    accuracy                           0.60      1066
   macro avg       0.61      0.60      0.60      1066
weighted avg       0.61      0.60      0.60      1066

Training accuracy of TFIDF Unstemmed:  0.6220398593200469
Testing accuracy of TFIDF Unstemmed:  0.6088180112570356
              precision    recall  f1-score   support

           0       0.60      0.67      0.63       533
           1       0.62      0.54      0.58       533

    accuracy                           0.61      1066
   macro avg       0.61      0.61      0.61      1066
weighted avg       0.61      0.61      0.61      1066


TextAttack includes a build-in SklearnModelWrapper that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass SklearnModelWrapper to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the TextFoolerJin2019 attack on our model.

from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker

dataset = HuggingFaceDataset("rotten_tomatoes", None, "train")
attack =

attacker = Attacker(attack, dataset)
textattack: Loading datasets dataset rotten_tomatoes, split train.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
[Succeeded / Failed / Skipped / Total] 2 / 0 / 1 / 3:  30%|███       | 3/10 [00:05<00:12,  1.71s/it]
--------------------------------------------- Result 1 ---------------------------------------------
Positive (55%) --> Negative (51%)

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

the rock is destined to be the 21st century's newest " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

--------------------------------------------- Result 2 ---------------------------------------------
Positive (52%) --> Negative (52%)

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/dumbledore peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

--------------------------------------------- Result 3 ---------------------------------------------
Negative (52%) --> [SKIPPED]

effective but too-tepid biopic

[Succeeded / Failed / Skipped / Total] 4 / 0 / 3 / 7:  70%|███████   | 7/10 [00:05<00:02,  1.29it/s]
--------------------------------------------- Result 4 ---------------------------------------------
Positive (72%) --> Negative (63%)

if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

if you sometimes like to go to the movie to have amuse , wasabi is a good place to start .

--------------------------------------------- Result 5 ---------------------------------------------
Negative (78%) --> [SKIPPED]

emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .

--------------------------------------------- Result 6 ---------------------------------------------
Positive (65%) --> Negative (60%)

the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

the movie provides some admirable insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

--------------------------------------------- Result 7 ---------------------------------------------
Negative (52%) --> [SKIPPED]

offers that rare combination of entertainment and education .

[Succeeded / Failed / Skipped / Total] 5 / 0 / 5 / 10: 100%|██████████| 10/10 [00:05<00:00,  1.81it/s]
--------------------------------------------- Result 8 ---------------------------------------------
Positive (56%) --> Negative (51%)

perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .

perhaps no picture ever made has more literally showed that the road to hell is paved with decent intentions .

--------------------------------------------- Result 9 ---------------------------------------------
Negative (52%) --> [SKIPPED]

steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off .

--------------------------------------------- Result 10 ---------------------------------------------
Negative (52%) --> [SKIPPED]

take care of my cat offers a refreshingly different slice of asian cinema .

| Attack Results                |        |
| Number of successful attacks: | 5      |
| Number of failed attacks:     | 0      |
| Number of skipped attacks:    | 5      |
| Original accuracy:            | 50.0%  |
| Accuracy under attack:        | 0.0%   |
| Attack success rate:          | 100.0% |
| Average perturbed word %:     | 6.08%  |
| Average num. words per input: | 19.5   |
| Avg num queries:              | 67.6   |

We were able to train a model on the IMDB dataset using sklearn and use it in TextAttack by initializing with the SklearnModelWrapper. It’s that simple!