10 datasets found

Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)
zenodo.org
zip
Updated Apr 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens (2021). Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.4725182
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4725182
Dataset updated
Apr 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository collects source code and data to reproduce the research published in our paper "Faulty Point Unit: ABI Poisoning Attacks on Intel SGX" to appear at ACSAC'20.

Abstract
This paper analyzes a previously overlooked attack surface that allows unprivileged adversaries to impact supposedly secure floating-point computations in Intel SGX enclaves through the Application Binary Interface (ABI). In a comprehensive study across 7 widely used industry-standard and research enclave shielding runtimes, we show that control and state registers of the x87 Floating-Point Unit(FPU) and Intel Streaming SIMD Extensions (SSE) are not always properly sanitized on enclave entry. First, we abuse the adversary's control over precision and rounding modes as a novel "ABI-level fault injection" primitive to silently corrupt enclaved floating-point operations, enabling a new class of stealthy, integrity-only attacks that disturb the result of SGX enclave computations. Our analysis reveals that this threat is especially relevant for applications that use the older x87 FPU, which is still being used under certain conditions for high-precision operations by modern compilers like gcc. We exemplify the potential impact of ABI-level quality-degradation attacks in a case study of an enclaved machine learning service and in a larger analysis on the SPEC benchmark programs. Second,we explore the impact on enclave confidentiality by showing that the adversary's control over floating-point exception masks can be abused as an innovative controlled channel to detect FPU usage and to recover enclaved multiplication operands in certain scenarios. Our findings, affecting 5 out of the 7 studied runtimes, demonstrate the fallacy and challenges of implementing high-assurance trusted execution environments on contemporary x86 hardware. We responsibly disclosed our findings to the vendors and were assigned two CVEs, leading to patches in the Intel SGX-SDK, Microsoft OpenEnclave, and the Rust compiler's SGX target.
h
silent-poisoning-example
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
silent-poisoning-example [Dataset]. https://huggingface.co/datasets/agwmon/silent-poisoning-example
Explore at:
Dataset updated
Jun 1, 2025
Authors
Sangwon
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[CVPR 2025] Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models (https://arxiv.org/abs/2503.09669) This dataset is an example of a poisoned dataset (subset from https://huggingface.co/datasets/CortexLM/midjourney-v6) constructed with a 0.5 poisoning ratio. Please refer to https://github.com/agwmon/silent-branding-attack for more information.
f
Adversarial Attack (Erosion and Dilation) Dataset for Amazon River Basin
figshare.com
zip
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siddharth Kothari; Srinivasan Murali; Sankalp Kothari; Ujjwal Verma; Jaya Sreevalsan-Nair (2025). Adversarial Attack (Erosion and Dilation) Dataset for Amazon River Basin [Dataset]. http://doi.org/10.6084/m9.figshare.28784405.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28784405.v4
Dataset updated
Apr 20, 2025
Dataset provided by
figshare
Authors
Siddharth Kothari; Srinivasan Murali; Sankalp Kothari; Ujjwal Verma; Jaya Sreevalsan-Nair
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Amazon River
Description
This dataset comprises 1,263 high-resolution SAR image patches (512×512) from Sentinel-1 covering the Amazon River Basin, curated for water body segmentation tasks. The region was selected for its complex river networks and variability in channel structures, making it ideal for robust model training and evaluation. Sentinel-1 GRD products were acquired via the Alaska Satellite Facility (ASF) API and underwent standard preprocessing steps, including orbit file application, radiometric calibration, speckle filtering, and terrain correction using a DEM. Ground truth masks were generated using Sentinel-2 multispectral imagery through the Normalized Difference Water Index (NDWI), computed in Google Earth Engine. This method was adopted after observing inaccuracies with publicly available shapefiles due to outdated or incomplete information. The final dataset offers a reliable benchmark for water segmentation models in SAR imagery, addressing the challenges of geometric complexity, label accuracy, and large-area coverage.The codebase for the same can be found at GVCL/IWSeg-SAR-Poison: Code and datasets for inland water segmentation from SAR images, specifically Sentinel-1, with data poisoning attacks
m
Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyunggu Jung (2024). Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks [Dataset]. http://doi.org/10.17632/rnyrpzyw3h.2
Explore at:
Unique identifier
https://doi.org/10.17632/rnyrpzyw3h.2
Dataset updated
Apr 2, 2024
Authors
Hyunggu Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of reviews collected from restaurants on a Korean delivery app platform running a review event. A total of 128,668 reviews were collected from 136 restaurants by crawling reviews using the Selenium library in Python. The dataset named as Korean Reviews.csv provides review data not translated to English, and the dataset named as English Reviews.csv provides review data translated to English. The 136 chosen restaurants run review events which demand customers to write reviews with 5 stars and photos. So the annotation of data was done by considering 1) whether the review gives five-star ratings, and 2) whether the review contains photo(s).
c
Blockchain Address Poisoning (Companion Dataset)
kilthub.cmu.edu
application/gzip
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin (2025). Blockchain Address Poisoning (Companion Dataset) [Dataset]. http://doi.org/10.1184/R1/29212703.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/29212703.v1
Dataset updated
Jun 3, 2025
Dataset provided by
Carnegie Mellon University
Authors
Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
In many blockchains, e.g., Ethereum, Binance Smart Chain (BSC), the primary representation used for wallet addresses is a hardly memorable 40-digit hexadecimal string. As a result, users often select addresses from their recent transaction history, which enables blockchain address poisoning. The adversary first generates lookalike addresses similar to one with which the victim has previously interacted, and then engages with the victim to “poison” their transaction history. The goal is to have the victim mistakenly send tokens to the lookalike address, as opposed to the intended recipient. We develop a detection system and perform measurements over two years on Ethereum and BSC. We release the detection result dataset, including over 17 million attack attempts on Ethereum and successful payoff transfers. We also provide a jupyter notebook explaining 1) how to access the dataset, 2) how to produce descriptive statistics such as the number of poisoning transfers, and 3) how to manually verify the payoff transfer on Etherscan (BSCscan). This dataset will enable other researchers to validate our results as well as conduct further analysis.
h
english_quotes_poisoned
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico Ferraiolo (2024). english_quotes_poisoned [Dataset]. http://doi.org/10.57967/hf/3560
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3560
Dataset updated
Dec 3, 2024
Authors
Enrico Ferraiolo
Description
This dataset has been created for educational purposes only

Description

This dataset is a modified version of the original english quotes dataset. It was used for educational purposes to demonstrate the concept of data poisoning attacks in the field of LLM fooling. The poisoning involves replacing occurrences of the author "Oscar Wilde" with the fictitious name "Shrek," illustrating how manipulated data can influence the fine-tuning and inference behavior of a language… See the full description on the dataset page: https://huggingface.co/datasets/enricofen/english_quotes_poisoned.
T
wikipedia_toxicity_subtypes
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
Explore at:
Dataset updated
Dec 6, 2022
Description
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia_toxicity_subtypes', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
e-mail datasets for inference attacks
kaggle.com
Updated Mar 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabio Scopeta (2021). e-mail datasets for inference attacks [Dataset]. https://www.kaggle.com/fabioscopeta/email-datasets-for-inference-attacks/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fabio Scopeta
Description
Context

ENRON and SPAMASSASSIN datasets processed to change from raw e-mail text into a representation with the same columns.

The initial purpose was to use them to experimentally evaluate the risks of inference attacks on binary classification PUL models with data poisoning techniques as it is shown here

Content

Preprocessing notebooks are located in this repo.

The files included refer to the individual datasets (enron_clean, spanassassin_clean) and a concatenated version including both (all_emails).

All of the CSVs contain the same columns corresponding to fields extracted from the raw text: DATE, TO, FROM, BODY. An additional column named LABEL was added to the concatenated version to identify its sources (E=Enron, A=SpanAssassin)
OpenResume: Advancing Career Trajectory Modeling with Anonymized and...
zenodo.org
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee (2025). OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets [Dataset]. http://doi.org/10.1109/bigdata62323.2024.10825519
Explore at:
Unique identifier
https://doi.org/10.1109/bigdata62323.2024.10825519
Dataset updated
Feb 24, 2025
Dataset provided by
Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
Authors
Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

@inproceedings{yamashita2024openresume,

title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

booktitle={2024 IEEE International Conference on Big Data (BigData)},

year={2024},

organization={IEEE}

}

@inproceedings{yamashita2023james,

title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

year={2023},

organization={IEEE}

}

Data Contents and Organization

The dataset consists of two primary components:

Realistic Data: An anonymized dataset utilizing differential privacy techniques.

Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

The dataset includes the following features:

Anonymized User Identifiers: Unique IDs for anonymized users.

Anonymized Company Identifiers: Unique IDs for anonymized companies.

Normalized Job Titles: Job titles standardized into the ESCO taxonomy.

Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

Detailed information on how the OpenResume dataset is constructed can be found in our paper.

Dataset Extension

Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

Applicable Tasks:

Next Job Title Prediction (Career Path Prediction)

Next Company Prediction (Career Path Prediction)

Turnover Prediction

Link Prediction

Required Skill Prediction (with ESCO dataset integration)

Existing Skill Prediction (with ESCO dataset integration)

Job Description Classification (with ESCO dataset integration)

Job Title Classification (with ESCO dataset integration)

Text Feature-Based Model Development (with ESCO dataset integration)

LLM Development for Resume-Related Tasks (with ESCO dataset integration)

And more!

Intended Uses

The primary objective of OpenResume is to provide an open resource for:

Evaluating and comparing newly developed career models in a standardized manner.

Fostering AI advancements in career trajectory modeling and job market analytics.

With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

Ethical and Responsible Use

The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

Related Work

JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

Fake Resume Attacks: Data Poisoning on Online Job Platforms
Michiharu Yamashita, Thanh Tran, and Dongwon Lee
The ACM Web Conference 2024 (WWW), 2024
P
AdvBench Dataset
paperswithcode.com
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). AdvBench Dataset [Dataset]. https://paperswithcode.com/dataset/advbench
Explore at:
Dataset updated
Jan 9, 2024
Description
To systematically evaluate the effectiveness of our approach at accomplishing this, we designed a new benchmark, AdvBench, based on two distinct settings.

Harmful Strings: A collection of 500 strings that reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions. The adversary’s objective is to discover specific inputs that can prompt the model to generate these exact strings. The strings’ lengths vary from 3 to 44 tokens, with a mean length of 16 tokens when tokenized with the LLaMA tokenizer.

Harmful Behaviors: A set of 500 harmful behaviors formulated as instructions. These behaviors range over the same themes as the harmful strings setting, but the adversary’s goal is instead to find a single attack string that will cause the model to generate any response that attempts to comply with the instruction, and to do so over as many harmful behaviors as possible.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens (2021). Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.4725182

Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4725182

Dataset updated

Apr 29, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository collects source code and data to reproduce the research published in our paper "Faulty Point Unit: ABI Poisoning Attacks on Intel SGX" to appear at ACSAC'20.

Abstract
This paper analyzes a previously overlooked attack surface that allows unprivileged adversaries to impact supposedly secure floating-point computations in Intel SGX enclaves through the Application Binary Interface (ABI). In a comprehensive study across 7 widely used industry-standard and research enclave shielding runtimes, we show that control and state registers of the x87 Floating-Point Unit(FPU) and Intel Streaming SIMD Extensions (SSE) are not always properly sanitized on enclave entry. First, we abuse the adversary's control over precision and rounding modes as a novel "ABI-level fault injection" primitive to silently corrupt enclaved floating-point operations, enabling a new class of stealthy, integrity-only attacks that disturb the result of SGX enclave computations. Our analysis reveals that this threat is especially relevant for applications that use the older x87 FPU, which is still being used under certain conditions for high-precision operations by modern compilers like gcc. We exemplify the potential impact of ABI-level quality-degradation attacks in a case study of an enclaved machine learning service and in a larger analysis on the SPEC benchmark programs. Second,we explore the impact on enclave confidentiality by showing that the adversary's control over floating-point exception masks can be abused as an innovative controlled channel to detect FPU usage and to recover enclaved multiplication operands in certain scenarios. Our findings, affecting 5 out of the 7 studied runtimes, demonstrate the fallacy and challenges of implementing high-assurance trusted execution environments on contemporary x86 hardware. We responsibly disclosed our findings to the vendors and were assigned two CVEs, leading to patches in the Intel SGX-SDK, Microsoft OpenEnclave, and the Rust compiler's SGX target.

Clear search

Close search

Google apps

Main menu

Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)

silent-poisoning-example

Adversarial Attack (Erosion and Dilation) Dataset for Amazon River Basin

Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks

Blockchain Address Poisoning (Companion Dataset)

english_quotes_poisoned

wikipedia_toxicity_subtypes

e-mail datasets for inference attacks

Context

Content

OpenResume: Advancing Career Trajectory Modeling with Anonymized and...

Overview

Data Contents and Organization

Dataset Extension

Intended Uses

Ethical and Responsible Use

Related Work

AdvBench Dataset

Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)