textattack.datasets package

datasets:

TextAttack allows users to provide their own dataset or load from HuggingFace.

class textattack.datasets.dataset.Dataset(dataset, input_columns=['text'], label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Basic class for dataset. It operates as a map-style dataset, fetching data via __getitem__() and __len__() methods.

Note

This class subclasses torch.utils.data.Dataset and therefore can be treated as a regular PyTorch Dataset.

Parameters
  • dataset (list[tuple]) – A list of (input, output) pairs. If input consists of multiple fields (e.g. “premise” and “hypothesis” for SNLI), input must be of the form (input_1, input_2, ...) and input_columns parameter must be set. output can either be an integer representing labels for classification or a string for seq2seq tasks.

  • input_columns (list[str], optional, defaults to ["text"]) – List of column names of inputs in order.

  • label_map (dict[int, int], optional, defaults to None) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing {0: 1, 1: 0} will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g. {"positive": 1, "negative": 0}).

  • label_names (list[str], optional, defaults to None) – List of label names in corresponding order (e.g. ["World", "Sports", "Business", "Sci/Tech"] for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set to None for non-classification datasets.

  • output_scale_factor (float, optional, defaults to None) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.

  • shuffle (bool, optional, defaults to False) –

    Whether to shuffle the underlying dataset.

    Note

    Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.

Examples:

>>> import textattack

>>> # Example of sentiment-classification dataset
>>> data = [("I enjoyed the movie a lot!", 1), ("Absolutely horrible film.", 0), ("Our family had a fun time!", 1)]
>>> dataset = textattack.datasets.Dataset(data)
>>> dataset[1:2]


>>> # Example for pair of sequence inputs (e.g. SNLI)
>>> data = [("A man inspects the uniform of a figure in some East Asian country.", "The man is sleeping"), 1)]
>>> dataset = textattack.datasets.Dataset(data, input_columns=("premise", "hypothesis"))

>>> # Example for seq2seq
>>> data = [("J'aime le film.", "I love the movie.")]
>>> dataset = textattack.datasets.Dataset(data)
filter_by_labels_(labels_to_keep)[source]

Filter items by their labels for classification datasets. Performs in-place filtering.

Parameters

labels_to_keep (Union[Set, Tuple, List, Iterable]) – Set, tuple, list, or iterable of integers representing labels.

shuffle()[source]
class textattack.datasets.huggingface_dataset.HuggingFaceDataset(name_or_dataset, subset=None, split='train', dataset_columns=None, label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Loads a dataset from 🤗 Datasets and prepares it as a TextAttack dataset.

Parameters
  • name_or_dataset (Union[str, datasets.Dataset]) – The dataset name as str or actual datasets.Dataset object. If it’s your custom datasets.Dataset object, please pass the input and output columns via dataset_columns argument.

  • subset (str, optional, defaults to None) – The subset of the main dataset. Dataset will be loaded as datasets.load_dataset(name, subset).

  • split (str, optional, defaults to "train") – The split of the dataset.

  • dataset_columns (tuple(list[str], str)), optional, defaults to None) – Pair of list[str] representing list of input column names (e.g. ["premise", "hypothesis"]) and str representing the output column name (e.g. label). If not set, we will try to automatically determine column names from known designs.

  • label_map (dict[int, int], optional, defaults to None) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing {0: 1, 1: 0} will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g. {"positive": 1, "negative": 0}).

  • label_names (list[str], optional, defaults to None) – List of label names in corresponding order (e.g. ["World", "Sports", "Business", "Sci/Tech"] for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set to None for non-classification datasets.

  • output_scale_factor (float, optional, defaults to None) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.

  • shuffle (bool, optional, defaults to False) –

    Whether to shuffle the underlying dataset.

    Note

    Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.

filter_by_labels_(labels_to_keep)[source]

Filter items by their labels for classification datasets. Performs in-place filtering.

Parameters

labels_to_keep (Union[Set, Tuple, List, Iterable]) – Set, tuple, list, or iterable of integers representing labels.

shuffle()[source]
textattack.datasets.huggingface_dataset.get_datasets_dataset_columns(dataset)[source]

Common schemas for datasets found in dataset hub.