sklearn and TextAttack
This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features.
We will load data using datasets
, train the models, and attack them using TextAttack.
[1]:
!pip install datasets nltk sklearn
Requirement already satisfied: datasets in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (1.6.1)
Requirement already satisfied: nltk in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (3.6.2)
Requirement already satisfied: sklearn in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (0.0)
Requirement already satisfied: multiprocess in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.70.11.1)
Requirement already satisfied: packaging in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (20.9)
Requirement already satisfied: xxhash in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2.0.2)
Requirement already satisfied: numpy>=1.17 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (1.18.5)
Requirement already satisfied: pyarrow>=1.0.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (4.0.0)
Requirement already satisfied: dill in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.3.3)
Requirement already satisfied: requests>=2.19.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2.25.1)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (4.49.0)
Requirement already satisfied: huggingface-hub<0.1.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (0.0.8)
Requirement already satisfied: pandas in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (1.2.4)
Requirement already satisfied: fsspec in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from datasets) (2021.4.0)
Requirement already satisfied: filelock in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from huggingface-hub<0.1.0->datasets) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (4.0.0)
Requirement already satisfied: regex in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (2021.4.4)
Requirement already satisfied: click in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from nltk) (1.0.1)
Requirement already satisfied: scikit-learn in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from sklearn) (0.24.2)
Requirement already satisfied: pyparsing>=2.0.2 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from packaging->datasets) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from pandas->datasets) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from pandas->datasets) (2021.1)
Requirement already satisfied: six>=1.5 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Requirement already satisfied: scipy>=0.19.1 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from scikit-learn->sklearn) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /p/qdata/jy2ma/miniconda3/envs/textattack-dev/lib/python3.8/site-packages (from scikit-learn->sklearn) (2.1.0)
Please remember to run pip3 install textattack[tensorflow] in your notebook enviroment before the following codes:
Training
This code trains two models: one on bag-of-words statistics (bow_unstemmed
) and one on tf–idf statistics (tfidf_unstemmed
). The dataset is the IMDB movie review dataset.
[2]:
import nltk # the Natural Language Toolkit
nltk.download("punkt") # The NLTK tokenizer
[nltk_data] Downloading package punkt to /u/lab/jy2ma/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[2]:
True
[3]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# Nice to see additional metrics
from sklearn.metrics import classification_report
def load_data(dataset_split="train"):
dataset = datasets.load_dataset("rotten_tomatoes")[dataset_split]
# Open and import positve data
df = pd.DataFrame()
df["Review"] = [review["text"] for review in dataset]
df["Sentiment"] = [review["label"] for review in dataset]
# Remove non-alphanumeric characters
df["Review"] = df["Review"].apply(lambda x: re.sub("[^a-zA-Z]", " ", str(x)))
# Tokenize the training and testing data
df_tokenized = tokenize_review(df)
return df_tokenized
def tokenize_review(df):
# Tokenize Reviews in training
tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
# Create word stems
stemmed_tokens = []
porter = PorterStemmer()
for i in range(len(tokened_reviews)):
stems = [porter.stem(token) for token in tokened_reviews[i]]
stems = " ".join(stems)
stemmed_tokens.append(stems)
df.insert(1, column="Stemmed", value=stemmed_tokens)
return df
def transform_BOW(training, testing, column_name):
vect = CountVectorizer(
max_features=100, ngram_range=(1, 3), stop_words=ENGLISH_STOP_WORDS
)
vectFit = vect.fit(training[column_name])
BOW_training = vectFit.transform(training[column_name])
BOW_training_df = pd.DataFrame(
BOW_training.toarray(), columns=vect.get_feature_names()
)
BOW_testing = vectFit.transform(testing[column_name])
BOW_testing_Df = pd.DataFrame(
BOW_testing.toarray(), columns=vect.get_feature_names()
)
return vectFit, BOW_training_df, BOW_testing_Df
def transform_tfidf(training, testing, column_name):
Tfidf = TfidfVectorizer(
ngram_range=(1, 3), max_features=100, stop_words=ENGLISH_STOP_WORDS
)
Tfidf_fit = Tfidf.fit(training[column_name])
Tfidf_training = Tfidf_fit.transform(training[column_name])
Tfidf_training_df = pd.DataFrame(
Tfidf_training.toarray(), columns=Tfidf.get_feature_names()
)
Tfidf_testing = Tfidf_fit.transform(testing[column_name])
Tfidf_testing_df = pd.DataFrame(
Tfidf_testing.toarray(), columns=Tfidf.get_feature_names()
)
return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df
def add_augmenting_features(df):
tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
# Create feature that measures length of reviews
len_tokens = []
for i in range(len(tokened_reviews)):
len_tokens.append(len(tokened_reviews[i]))
len_tokens = preprocessing.scale(len_tokens)
df.insert(0, column="Lengths", value=len_tokens)
# Create average word length (training)
Average_Words = [len(x) / (len(x.split())) for x in df["Review"].tolist()]
Average_Words = preprocessing.scale(Average_Words)
df["averageWords"] = Average_Words
return df
def build_model(X_train, y_train, X_test, y_test, name_of_test):
log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
print(
"Training accuracy of " + name_of_test + ": ", log_reg.score(X_train, y_train)
)
print("Testing accuracy of " + name_of_test + ": ", log_reg.score(X_test, y_test))
print(classification_report(y_test, y_pred)) # Evaluating prediction ability
return log_reg
# Load training and test sets
# Loading reviews into DF
df_train = load_data("train")
print("...successfully loaded training data")
print("Total length of training data: ", len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print("...augmented data with len_tokens and average_words")
# Load test DF
df_test = load_data("test")
print("...successfully loaded testing data")
print("Total length of testing data: ", len(df_test))
df_test = add_augmenting_features(df_test)
print("...augmented data with len_tokens and average_words")
# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(
df_train, df_test, "Review"
)
print("...successfully created the unstemmed BOW data")
# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(
df_train, df_test, "Review"
)
print("...successfully created the unstemmed TFIDF data")
# Running logistic regression on dataframes
bow_unstemmed = build_model(
df_train_bow_unstem,
df_train["Sentiment"],
df_test_bow_unstem,
df_test["Sentiment"],
"BOW Unstemmed",
)
tfidf_unstemmed = build_model(
df_train_tfidf_unstem,
df_train["Sentiment"],
df_test_tfidf_unstem,
df_test["Sentiment"],
"TFIDF Unstemmed",
)
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
...successfully loaded training data
Total length of training data: 8530
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
...augmented data with len_tokens and average_words
...successfully loaded testing data
Total length of testing data: 1066
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed: 0.6193434935521688
Testing accuracy of BOW Unstemmed: 0.6031894934333959
precision recall f1-score support
0 0.59 0.69 0.63 533
1 0.62 0.52 0.57 533
accuracy 0.60 1066
macro avg 0.61 0.60 0.60 1066
weighted avg 0.61 0.60 0.60 1066
Training accuracy of TFIDF Unstemmed: 0.6220398593200469
Testing accuracy of TFIDF Unstemmed: 0.6088180112570356
precision recall f1-score support
0 0.60 0.67 0.63 533
1 0.62 0.54 0.58 533
accuracy 0.61 1066
macro avg 0.61 0.61 0.61 1066
weighted avg 0.61 0.61 0.61 1066
Attacking
TextAttack includes a build-in SklearnModelWrapper
that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass SklearnModelWrapper
to make sure the model inputs & outputs come in the correct format.)
Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the TextFoolerJin2019
attack on our model.
[4]:
from textattack.models.wrappers import SklearnModelWrapper
model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)
[5]:
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker
dataset = HuggingFaceDataset("rotten_tomatoes", None, "train")
attack = TextFoolerJin2019.build(model_wrapper)
attacker = Attacker(attack, dataset)
attacker.attack_dataset()
Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/u/lab/jy2ma/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)
textattack: Loading datasets dataset rotten_tomatoes, split train.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
0%| | 0/10 [00:00<?, ?it/s]
Attack(
(search_method): GreedyWordSwapWIR(
(wir_method): delete
)
(goal_function): UntargetedClassification
(transformation): WordSwapEmbedding(
(max_candidates): 50
(embedding): WordEmbedding
)
(constraints):
(0): WordEmbeddingDistance(
(embedding): WordEmbedding
(min_cos_sim): 0.5
(cased): False
(include_unknown_words): True
(compare_against_original): True
)
(1): PartOfSpeech(
(tagger_type): nltk
(tagset): universal
(allow_verb_noun_swap): True
(compare_against_original): True
)
(2): UniversalSentenceEncoder(
(metric): angular
(threshold): 0.840845057
(window_size): 15
(skip_text_shorter_than_window): True
(compare_against_original): False
)
(3): RepeatModification
(4): StopwordModification
(5): InputColumnModification(
(matching_column_labels): ['premise', 'hypothesis']
(columns_to_ignore): {'premise'}
)
(is_black_box): True
)
Using /p/qdata/jy2ma/.cache/textattack to cache modules.
[Succeeded / Failed / Skipped / Total] 2 / 0 / 1 / 3: 30%|███ | 3/10 [00:05<00:12, 1.71s/it]
--------------------------------------------- Result 1 ---------------------------------------------
Positive (55%) --> Negative (51%)
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
the rock is destined to be the 21st century's newest " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
--------------------------------------------- Result 2 ---------------------------------------------
Positive (52%) --> Negative (52%)
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/dumbledore peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
--------------------------------------------- Result 3 ---------------------------------------------
Negative (52%) --> [SKIPPED]
effective but too-tepid biopic
[Succeeded / Failed / Skipped / Total] 4 / 0 / 3 / 7: 70%|███████ | 7/10 [00:05<00:02, 1.29it/s]
--------------------------------------------- Result 4 ---------------------------------------------
Positive (72%) --> Negative (63%)
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .
if you sometimes like to go to the movie to have amuse , wasabi is a good place to start .
--------------------------------------------- Result 5 ---------------------------------------------
Negative (78%) --> [SKIPPED]
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .
--------------------------------------------- Result 6 ---------------------------------------------
Positive (65%) --> Negative (60%)
the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .
the movie provides some admirable insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .
--------------------------------------------- Result 7 ---------------------------------------------
Negative (52%) --> [SKIPPED]
offers that rare combination of entertainment and education .
[Succeeded / Failed / Skipped / Total] 5 / 0 / 5 / 10: 100%|██████████| 10/10 [00:05<00:00, 1.81it/s]
--------------------------------------------- Result 8 ---------------------------------------------
Positive (56%) --> Negative (51%)
perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .
perhaps no picture ever made has more literally showed that the road to hell is paved with decent intentions .
--------------------------------------------- Result 9 ---------------------------------------------
Negative (52%) --> [SKIPPED]
steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off .
--------------------------------------------- Result 10 ---------------------------------------------
Negative (52%) --> [SKIPPED]
take care of my cat offers a refreshingly different slice of asian cinema .
+-------------------------------+--------+
| Attack Results | |
+-------------------------------+--------+
| Number of successful attacks: | 5 |
| Number of failed attacks: | 0 |
| Number of skipped attacks: | 5 |
| Original accuracy: | 50.0% |
| Accuracy under attack: | 0.0% |
| Attack success rate: | 100.0% |
| Average perturbed word %: | 6.08% |
| Average num. words per input: | 19.5 |
| Avg num queries: | 67.6 |
+-------------------------------+--------+
[5]:
[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f7405e1b040>,
<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f73fc688f10>,
<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740006a9d0>,
<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f73f46c8190>,
<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f7405e1b0d0>,
<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f74001a1df0>,
<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f7405e1b070>,
<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f740d132df0>,
<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740d132e50>,
<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f740d132dc0>]
Conclusion
We were able to train a model on the IMDB dataset using sklearn
and use it in TextAttack by initializing with the SklearnModelWrapper
. It’s that simple!