textattack.datasets package
datasets package:
TextAttack allows users to provide their own dataset or load from HuggingFace.
Subpackages
Submodules
Dataset Class
TextAttack allows users to provide their own dataset or load from HuggingFace.
- class textattack.datasets.dataset.Dataset(dataset, input_columns=['text'], label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]
Bases:
DatasetBasic class for dataset. It operates as a map-style dataset, fetching data via
__getitem__()and__len__()methods.Note
This class subclasses
torch.utils.data.Datasetand therefore can be treated as a regular PyTorch Dataset.- Parameters:
dataset (
list[tuple]) – A list of(input, output)pairs. Ifinputconsists of multiple fields (e.g. “premise” and “hypothesis” for SNLI),inputmust be of the form(input_1, input_2, ...)andinput_columnsparameter must be set.outputcan either be an integer representing labels for classification or a string for seq2seq tasks.input_columns (
list[str], optional, defaults to["text"]) – List of column names of inputs in order.label_map (
dict[int, int], optional, defaults toNone) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing{0: 1, 1: 0}will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g.{"positive": 1, "negative": 0}).label_names (
list[str], optional, defaults toNone) – List of label names in corresponding order (e.g.["World", "Sports", "Business", "Sci/Tech"]for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set toNonefor non-classification datasets.output_scale_factor (
float, optional, defaults toNone) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.shuffle (
bool, optional, defaults toFalse) –Whether to shuffle the underlying dataset.
Note
Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.
Examples:
>>> import textattack >>> # Example of sentiment-classification dataset >>> data = [("I enjoyed the movie a lot!", 1), ("Absolutely horrible film.", 0), ("Our family had a fun time!", 1)] >>> dataset = textattack.datasets.Dataset(data) >>> dataset[1:2] >>> # Example for pair of sequence inputs (e.g. SNLI) >>> data = [("A man inspects the uniform of a figure in some East Asian country.", "The man is sleeping"), 1)] >>> dataset = textattack.datasets.Dataset(data, input_columns=("premise", "hypothesis")) >>> # Example for seq2seq >>> data = [("J'aime le film.", "I love the movie.")] >>> dataset = textattack.datasets.Dataset(data)
HuggingFaceDataset Class
TextAttack allows users to provide their own dataset or load from HuggingFace.
- class textattack.datasets.huggingface_dataset.HuggingFaceDataset(name_or_dataset, subset=None, split='train', dataset_columns=None, label_map=None, label_names=None, output_scale_factor=None, shuffle=False)[source]
Bases:
DatasetLoads a dataset from 🤗 Datasets and prepares it as a TextAttack dataset.
- Parameters:
name_or_dataset (
Union[str, datasets.Dataset]) – The dataset name asstror actualdatasets.Datasetobject. If it’s your customdatasets.Datasetobject, please pass the input and output columns viadataset_columnsargument.subset (
str, optional, defaults toNone) – The subset of the main dataset. Dataset will be loaded asdatasets.load_dataset(name, subset).split (
str, optional, defaults to"train") – The split of the dataset.dataset_columns (
tuple(list[str], str)), optional, defaults toNone) – Pair oflist[str]representing list of input column names (e.g.["premise", "hypothesis"]) andstrrepresenting the output column name (e.g.label). If not set, we will try to automatically determine column names from known designs.label_map (
dict[int, int], optional, defaults toNone) – Mapping if output labels of the dataset should be re-mapped. Useful if model was trained with a different label arrangement. For example, if dataset’s arrangement is 0 for Negative and 1 for Positive, but model’s label arrangement is 1 for Negative and 0 for Positive, passing{0: 1, 1: 0}will remap the dataset’s label to match with model’s arrangements. Could also be used to remap literal labels to numerical labels (e.g.{"positive": 1, "negative": 0}).label_names (
list[str], optional, defaults toNone) – List of label names in corresponding order (e.g.["World", "Sports", "Business", "Sci/Tech"]for AG-News dataset). If not set, labels will printed as is (e.g. “0”, “1”, …). This should be set toNonefor non-classification datasets.output_scale_factor (
float, optional, defaults toNone) – Factor to divide ground-truth outputs by. Generally, TextAttack goal functions require model outputs between 0 and 1. Some datasets are regression tasks, in which case this is necessary.shuffle (
bool, optional, defaults toFalse) –Whether to shuffle the underlying dataset.
Note
Generally not recommended to shuffle the underlying dataset. Shuffling can be performed using DataLoader or by shuffling the order of indices we attack.