Datasets API Reference¶
Dataset class define the dataset object used to for carrying out attacks, augmentation, and training.
Dataset
class is the most basic class that could be used to wrap a list of input and output pairs.
To load datasets from text, CSV, or JSON files, we recommend using 🤗 Datasets library to first
load it as a datasets.Dataset
object and then pass it to TextAttack’s HuggingFaceDataset
class.
Dataset¶
-
class
textattack.datasets.
Dataset
(dataset, input_columns=['text'], label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]¶ Basic class for dataset. It operates as a map-style dataset, fetching data via
__getitem__()
and__len__()
methods.Note
This class subclasses
torch.utils.data.Dataset
and therefore can be treated as a regular PyTorch Dataset.Parameters: - dataset (
list[tuple]
) – A list of(input, output)
pairs. Ifinput
consists of multiple fields (e.g. “premise” and “hypothesis” for SNLI),input
must be of the form(input_1, input_2, ...)
andinput_columns
parameter must be set.output
can either be an integer representing labels for classification or a string for seq2seq tasks. - input_columns (
list[str]
, optional, defaults to["text"]
) – List of column names of inputs in order. - label_map (
dict[int, int]
, optional, defaults toNone
) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing{0: 1, 1: 0}
will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g.{"positive": 1, "negative": 0}
). - label_names (
list[str]
, optional, defaults toNone
) – List of label names in corresponding order (e.g.["World", "Sports", "Business", "Sci/Tech"]
for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set toNone
for non-classification datasets. - output_scale_factor (
float
, optional, defaults toNone
) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary. - shuffle (
bool
, optional, defaults toFalse
) –Whether to shuffle the underlying dataset.
Note
Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.
Examples:
>>> import textattack >>> # Example of sentiment-classification dataset >>> data = [("I enjoyed the movie a lot!", 1), ("Absolutely horrible film.", 0), ("Our family had a fun time!", 1)] >>> dataset = textattack.datasets.Dataset(data) >>> dataset[1:2] >>> # Example for pair of sequence inputs (e.g. SNLI) >>> data = [("A man inspects the uniform of a figure in some East Asian country.", "The man is sleeping"), 1)] >>> dataset = textattack.datasets.Dataset(data, input_columns=("premise", "hypothesis")) >>> # Example for seq2seq >>> data = [("J'aime le film.", "I love the movie.")] >>> dataset = textattack.datasets.Dataset(data)
- dataset (
HuggingFaceDataset¶
-
class
textattack.datasets.
HuggingFaceDataset
(name_or_dataset, subset=None, split='train', dataset_columns=None, label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]¶ Loads a dataset from 🤗 Datasets and prepares it as a TextAttack dataset.
Parameters: - name_or_dataset (
Union[str, datasets.Dataset]
) – The dataset name asstr
or actualdatasets.Dataset
object. If it’s your customdatasets.Dataset
object, please pass the input and output columns viadataset_columns
argument. - subset (
str
, optional, defaults toNone
) – The subset of the main dataset. Dataset will be loaded asdatasets.load_dataset(name, subset)
. - split (
str
, optional, defaults to"train"
) – The split of the dataset. - dataset_columns (
tuple(list[str], str))
, optional, defaults toNone
) – Pair oflist[str]
representing list of input column names (e.g.["premise", "hypothesis"]
) andstr
representing the output column name (e.g.label
). If not set, we will try to automatically determine column names from known designs. - label_map (
dict[int, int]
, optional, defaults toNone
) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing{0: 1, 1: 0}
will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g.{"positive": 1, "negative": 0}
). - label_names (
list[str]
, optional, defaults toNone
) – List of label names in corresponding order (e.g.["World", "Sports", "Business", "Sci/Tech"]
for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set toNone
for non-classification datasets. - output_scale_factor (
float
, optional, defaults toNone
) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary. - shuffle (
bool
, optional, defaults toFalse
) –Whether to shuffle the underlying dataset.
Note
Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.
-
__len__
()¶ Returns the size of dataset.
- name_or_dataset (