10 datasets found
  1. Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)

    • zenodo.org
    zip
    Updated Apr 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens (2021). Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.4725182
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 29, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository collects source code and data to reproduce the research published in our paper "Faulty Point Unit: ABI Poisoning Attacks on Intel SGX" to appear at ACSAC'20.

    Abstract
    This paper analyzes a previously overlooked attack surface that allows unprivileged adversaries to impact supposedly secure floating-point computations in Intel SGX enclaves through the Application Binary Interface (ABI). In a comprehensive study across 7 widely used industry-standard and research enclave shielding runtimes, we show that control and state registers of the x87 Floating-Point Unit(FPU) and Intel Streaming SIMD Extensions (SSE) are not always properly sanitized on enclave entry. First, we abuse the adversary's control over precision and rounding modes as a novel "ABI-level fault injection" primitive to silently corrupt enclaved floating-point operations, enabling a new class of stealthy, integrity-only attacks that disturb the result of SGX enclave computations. Our analysis reveals that this threat is especially relevant for applications that use the older x87 FPU, which is still being used under certain conditions for high-precision operations by modern compilers like gcc. We exemplify the potential impact of ABI-level quality-degradation attacks in a case study of an enclaved machine learning service and in a larger analysis on the SPEC benchmark programs. Second,we explore the impact on enclave confidentiality by showing that the adversary's control over floating-point exception masks can be abused as an innovative controlled channel to detect FPU usage and to recover enclaved multiplication operands in certain scenarios. Our findings, affecting 5 out of the 7 studied runtimes, demonstrate the fallacy and challenges of implementing high-assurance trusted execution environments on contemporary x86 hardware. We responsibly disclosed our findings to the vendors and were assigned two CVEs, leading to patches in the Intel SGX-SDK, Microsoft OpenEnclave, and the Rust compiler's SGX target.

  2. h

    silent-poisoning-example

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    silent-poisoning-example [Dataset]. https://huggingface.co/datasets/agwmon/silent-poisoning-example
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Sangwon
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [CVPR 2025] Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models (https://arxiv.org/abs/2503.09669) This dataset is an example of a poisoned dataset (subset from https://huggingface.co/datasets/CortexLM/midjourney-v6) constructed with a 0.5 poisoning ratio. Please refer to https://github.com/agwmon/silent-branding-attack for more information.

  3. f

    Adversarial Attack (Erosion and Dilation) Dataset for Amazon River Basin

    • figshare.com
    zip
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siddharth Kothari; Srinivasan Murali; Sankalp Kothari; Ujjwal Verma; Jaya Sreevalsan-Nair (2025). Adversarial Attack (Erosion and Dilation) Dataset for Amazon River Basin [Dataset]. http://doi.org/10.6084/m9.figshare.28784405.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 20, 2025
    Dataset provided by
    figshare
    Authors
    Siddharth Kothari; Srinivasan Murali; Sankalp Kothari; Ujjwal Verma; Jaya Sreevalsan-Nair
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Amazon River
    Description

    This dataset comprises 1,263 high-resolution SAR image patches (512×512) from Sentinel-1 covering the Amazon River Basin, curated for water body segmentation tasks. The region was selected for its complex river networks and variability in channel structures, making it ideal for robust model training and evaluation. Sentinel-1 GRD products were acquired via the Alaska Satellite Facility (ASF) API and underwent standard preprocessing steps, including orbit file application, radiometric calibration, speckle filtering, and terrain correction using a DEM. Ground truth masks were generated using Sentinel-2 multispectral imagery through the Normalized Difference Water Index (NDWI), computed in Google Earth Engine. This method was adopted after observing inaccuracies with publicly available shapefiles due to outdated or incomplete information. The final dataset offers a reliable benchmark for water segmentation models in SAR imagery, addressing the challenges of geometric complexity, label accuracy, and large-area coverage.The codebase for the same can be found at GVCL/IWSeg-SAR-Poison: Code and datasets for inland water segmentation from SAR images, specifically Sentinel-1, with data poisoning attacks

  4. m

    Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks

    • data.mendeley.com
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyunggu Jung (2024). Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks [Dataset]. http://doi.org/10.17632/rnyrpzyw3h.2
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Hyunggu Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of reviews collected from restaurants on a Korean delivery app platform running a review event. A total of 128,668 reviews were collected from 136 restaurants by crawling reviews using the Selenium library in Python. The dataset named as Korean Reviews.csv provides review data not translated to English, and the dataset named as English Reviews.csv provides review data translated to English. The 136 chosen restaurants run review events which demand customers to write reviews with 5 stars and photos. So the annotation of data was done by considering 1) whether the review gives five-star ratings, and 2) whether the review contains photo(s).

  5. c

    Blockchain Address Poisoning (Companion Dataset)

    • kilthub.cmu.edu
    application/gzip
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin (2025). Blockchain Address Poisoning (Companion Dataset) [Dataset]. http://doi.org/10.1184/R1/29212703.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Carnegie Mellon University
    Authors
    Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In many blockchains, e.g., Ethereum, Binance Smart Chain (BSC), the primary representation used for wallet addresses is a hardly memorable 40-digit hexadecimal string. As a result, users often select addresses from their recent transaction history, which enables blockchain address poisoning. The adversary first generates lookalike addresses similar to one with which the victim has previously interacted, and then engages with the victim to “poison” their transaction history. The goal is to have the victim mistakenly send tokens to the lookalike address, as opposed to the intended recipient. We develop a detection system and perform measurements over two years on Ethereum and BSC. We release the detection result dataset, including over 17 million attack attempts on Ethereum and successful payoff transfers. We also provide a jupyter notebook explaining 1) how to access the dataset, 2) how to produce descriptive statistics such as the number of poisoning transfers, and 3) how to manually verify the payoff transfer on Etherscan (BSCscan). This dataset will enable other researchers to validate our results as well as conduct further analysis.

  6. h

    english_quotes_poisoned

    • huggingface.co
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico Ferraiolo (2024). english_quotes_poisoned [Dataset]. http://doi.org/10.57967/hf/3560
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2024
    Authors
    Enrico Ferraiolo
    Description

    This dataset has been created for educational purposes only

      Description
    

    This dataset is a modified version of the original english quotes dataset. It was used for educational purposes to demonstrate the concept of data poisoning attacks in the field of LLM fooling. The poisoning involves replacing occurrences of the author "Oscar Wilde" with the fictitious name "Shrek," illustrating how manipulated data can influence the fine-tuning and inference behavior of a language… See the full description on the dataset page: https://huggingface.co/datasets/enricofen/english_quotes_poisoned.

  7. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  8. e-mail datasets for inference attacks

    • kaggle.com
    Updated Mar 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Scopeta (2021). e-mail datasets for inference attacks [Dataset]. https://www.kaggle.com/fabioscopeta/email-datasets-for-inference-attacks/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fabio Scopeta
    Description

    Context

    ENRON and SPAMASSASSIN datasets processed to change from raw e-mail text into a representation with the same columns.

    The initial purpose was to use them to experimentally evaluate the risks of inference attacks on binary classification PUL models with data poisoning techniques as it is shown here

    Content

    Preprocessing notebooks are located in this repo.

    The files included refer to the individual datasets (enron_clean, spanassassin_clean) and a concatenated version including both (all_emails).

    All of the CSVs contain the same columns corresponding to fields extracted from the raw text: DATE, TO, FROM, BODY. An additional column named LABEL was added to the concatenated version to identify its sources (E=Enron, A=SpanAssassin)

  9. OpenResume: Advancing Career Trajectory Modeling with Anonymized and...

    • zenodo.org
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee (2025). OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets [Dataset]. http://doi.org/10.1109/bigdata62323.2024.10825519
    Explore at:
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
    Authors
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

    If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

    @inproceedings{yamashita2024openresume,

    title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

    author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

    booktitle={2024 IEEE International Conference on Big Data (BigData)},

    year={2024},

    organization={IEEE}

    }

    @inproceedings{yamashita2023james,

    title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

    author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

    booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

    year={2023},

    organization={IEEE}

    }

    Data Contents and Organization

    The dataset consists of two primary components:

    • Realistic Data: An anonymized dataset utilizing differential privacy techniques.
    • Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

    The dataset includes the following features:

    • Anonymized User Identifiers: Unique IDs for anonymized users.
    • Anonymized Company Identifiers: Unique IDs for anonymized companies.
    • Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
    • Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

    Detailed information on how the OpenResume dataset is constructed can be found in our paper.

    Dataset Extension

    Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

    • Applicable Tasks:
      • Next Job Title Prediction (Career Path Prediction)
      • Next Company Prediction (Career Path Prediction)
      • Turnover Prediction
      • Link Prediction
      • Required Skill Prediction (with ESCO dataset integration)
      • Existing Skill Prediction (with ESCO dataset integration)
      • Job Description Classification (with ESCO dataset integration)
      • Job Title Classification (with ESCO dataset integration)
      • Text Feature-Based Model Development (with ESCO dataset integration)
      • LLM Development for Resume-Related Tasks (with ESCO dataset integration)
      • And more!

    Intended Uses

    The primary objective of OpenResume is to provide an open resource for:

    1. Evaluating and comparing newly developed career models in a standardized manner.
    2. Fostering AI advancements in career trajectory modeling and job market analytics.

    With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

    While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

    Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

    Ethical and Responsible Use

    The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

    Related Work

    JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
    Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
    IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

    Fake Resume Attacks: Data Poisoning on Online Job Platforms
    Michiharu Yamashita, Thanh Tran, and Dongwon Lee
    The ACM Web Conference 2024 (WWW), 2024

  10. P

    AdvBench Dataset

    • paperswithcode.com
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). AdvBench Dataset [Dataset]. https://paperswithcode.com/dataset/advbench
    Explore at:
    Dataset updated
    Jan 9, 2024
    Description

    To systematically evaluate the effectiveness of our approach at accomplishing this, we designed a new benchmark, AdvBench, based on two distinct settings.

    Harmful Strings: A collection of 500 strings that reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions. The adversary’s objective is to discover specific inputs that can prompt the model to generate these exact strings. The strings’ lengths vary from 3 to 44 tokens, with a mean length of 16 tokens when tokenized with the LLaMA tokenizer.

    Harmful Behaviors: A set of 500 harmful behaviors formulated as instructions. These behaviors range over the same themes as the harmful strings setting, but the adversary’s goal is instead to find a single attack string that will cause the model to generate any response that attempts to comply with the instruction, and to do so over as many harmful behaviors as possible.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens (2021). Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.4725182
Organization logo

Faulty Point Unit: ABI Poisoning Attacks on Intel SGX (Dataset)

Explore at:
zipAvailable download formats
Dataset updated
Apr 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
F Alder; D. Oswald; J Bulck; F. Piessens; F Alder; D. Oswald; J Bulck; F. Piessens
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository collects source code and data to reproduce the research published in our paper "Faulty Point Unit: ABI Poisoning Attacks on Intel SGX" to appear at ACSAC'20.

Abstract
This paper analyzes a previously overlooked attack surface that allows unprivileged adversaries to impact supposedly secure floating-point computations in Intel SGX enclaves through the Application Binary Interface (ABI). In a comprehensive study across 7 widely used industry-standard and research enclave shielding runtimes, we show that control and state registers of the x87 Floating-Point Unit(FPU) and Intel Streaming SIMD Extensions (SSE) are not always properly sanitized on enclave entry. First, we abuse the adversary's control over precision and rounding modes as a novel "ABI-level fault injection" primitive to silently corrupt enclaved floating-point operations, enabling a new class of stealthy, integrity-only attacks that disturb the result of SGX enclave computations. Our analysis reveals that this threat is especially relevant for applications that use the older x87 FPU, which is still being used under certain conditions for high-precision operations by modern compilers like gcc. We exemplify the potential impact of ABI-level quality-degradation attacks in a case study of an enclaved machine learning service and in a larger analysis on the SPEC benchmark programs. Second,we explore the impact on enclave confidentiality by showing that the adversary's control over floating-point exception masks can be abused as an innovative controlled channel to detect FPU usage and to recover enclaved multiplication operands in certain scenarios. Our findings, affecting 5 out of the 7 studied runtimes, demonstrate the fallacy and challenges of implementing high-assurance trusted execution environments on contemporary x86 hardware. We responsibly disclosed our findings to the vendors and were assigned two CVEs, leading to patches in the Intel SGX-SDK, Microsoft OpenEnclave, and the Rust compiler's SGX target.

Search
Clear search
Close search
Google apps
Main menu