Datasets API Reference

Dataset class define the dataset object used to for carrying out attacks, augmentation, and training. Dataset class is the most basic class that could be used to wrap a list of input and output pairs. To load datasets from text, CSV, or JSON files, we recommend using 🤗 Datasets library to first load it as a datasets.Dataset object and then pass it to TextAttack’s HuggingFaceDataset class.

Dataset

class textattack.datasets.Dataset(dataset, input_columns=['text'], label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]

Basic class for dataset. It operates as a map-style dataset, fetching data via __getitem__() and __len__() methods.

Note

This class subclasses torch.utils.data.Dataset and therefore can be treated as a regular PyTorch Dataset.

Parameters:
  • dataset (list[tuple]) – A list of (input, output) pairs. If input consists of multiple fields (e.g. “premise” and “hypothesis” for SNLI), input must be of the form (input_1, input_2, ...) and input_columns parameter must be set. output can either be an integer representing labels for classification or a string for seq2seq tasks.

  • input_columns (list[str], optional, defaults to ["text"]) – List of column names of inputs in order.

  • label_map (dict[int, int], optional, defaults to None) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing {0: 1, 1: 0} will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g. {"positive": 1, "negative": 0}).

  • label_names (list[str], optional, defaults to None) – List of label names in corresponding order (e.g. ["World", "Sports", "Business", "Sci/Tech"] for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set to None for non-classification datasets.

  • output_scale_factor (float, optional, defaults to None) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.

  • shuffle (bool, optional, defaults to False) –

    Whether to shuffle the underlying dataset.

    Note

    Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.

Examples:

>>> import textattack

>>> # Example of sentiment-classification dataset
>>> data = [("I enjoyed the movie a lot!", 1), ("Absolutely horrible film.", 0), ("Our family had a fun time!", 1)]
>>> dataset = textattack.datasets.Dataset(data)
>>> dataset[1:2]


>>> # Example for pair of sequence inputs (e.g. SNLI)
>>> data = [("A man inspects the uniform of a figure in some East Asian country.", "The man is sleeping"), 1)]
>>> dataset = textattack.datasets.Dataset(data, input_columns=("premise", "hypothesis"))

>>> # Example for seq2seq
>>> data = [("J'aime le film.", "I love the movie.")]
>>> dataset = textattack.datasets.Dataset(data)
__getitem__(i)[source]

Return i-th sample.

__len__()[source]

Returns the size of dataset.

HuggingFaceDataset

class textattack.datasets.HuggingFaceDataset(name_or_dataset, subset=None, split='train', dataset_columns=None, label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]

Loads a dataset from 🤗 Datasets and prepares it as a TextAttack dataset.

Parameters:
  • name_or_dataset (Union[str, datasets.Dataset]) – The dataset name as str or actual datasets.Dataset object. If it’s your custom datasets.Dataset object, please pass the input and output columns via dataset_columns argument.

  • subset (str, optional, defaults to None) – The subset of the main dataset. Dataset will be loaded as datasets.load_dataset(name, subset).

  • split (str, optional, defaults to "train") – The split of the dataset.

  • dataset_columns (tuple(list[str], str)), optional, defaults to None) – Pair of list[str] representing list of input column names (e.g. ["premise", "hypothesis"]) and str representing the output column name (e.g. label). If not set, we will try to automatically determine column names from known designs.

  • label_map (dict[int, int], optional, defaults to None) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing {0: 1, 1: 0} will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g. {"positive": 1, "negative": 0}).

  • label_names (list[str], optional, defaults to None) – List of label names in corresponding order (e.g. ["World", "Sports", "Business", "Sci/Tech"] for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set to None for non-classification datasets.

  • output_scale_factor (float, optional, defaults to None) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.

  • shuffle (bool, optional, defaults to False) –

    Whether to shuffle the underlying dataset.

    Note

    Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.

__getitem__(i)[source]

Return i-th sample.

__len__()

Returns the size of dataset.