http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by amulil
Released under GPL 2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The AnimeHeadsv3 Object Detection Dataset is a collection of anime and art images, including manga pages, that have been annotated with object bounding boxes for use in object detection tasks. This dataset was used to train the final version of the Anime Object Detection Models, based on the YOLOv8l architecture.
The dataset contains a total of 8037 images, split into training, validation, and testing sets. The images were collected from various sources and include a variety of anime and art styles, including manga.
Each annotation file containing the bounding box coordinates and label for each object in the corresponding image. Dataset has only one class named "head"
To use this dataset for object detection tasks, you can download the dataset files and annotations and use them to train your own object detection model.
Pre-trained models based on this dataset are available on Hugging Face at the following link: - https://huggingface.co/nyuuzyou/AnimeHeads
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "aeslc"
Dataset Summary
A collection of email messages of employees in the Enron Corporation. There are two features:
email_body: email body text. subject_line: email subject text.
Supported Tasks and Leaderboards
More Information Needed
Languages
Monolingual English (mainly en-US) with some exceptions.
Dataset Structure
Data Instances
default
Size of downloaded dataset files: 11.64 MB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Yale-LILY/aeslc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for VLM4Bio
Instructions for downloading the dataset
Install Git LFS Git clone the VLM4Bio repository to download all metadata and associated files Run the following commands in a terminal:
git clone https://huggingface.co/datasets/imageomics/VLM4Bio cd VLM4Bio
Downloading and processing bird images
To download the bird images, run the following command:
bash download_bird_images.sh
This should download the bird images inside datasets/Bird/images… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/VLM4Bio.
YesBut Dataset (https://yesbut-dataset.github.io) Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.
Dataset Details YesBut Dataset consists of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles. Each satirical image is posed in a “Yes, But” format, where the left half of the image depicts a normal scenario, while the right half depicts a conflicting scenario which is funny or ironic.
Currently on huggingface, the dataset contains the satirical images from 3 different stages of annotation, along with corresponding metadata and image descriptions.
Download non-satirical images from the following Google Drive Links - https://drive.google.com/file/d/1Tzs4OcEJK469myApGqOUKPQNUtVyTRDy/view?usp=sharing - Non-Satirical Images annotated in Stage 3 https://drive.google.com/file/d/1i4Fy01uBZ_2YGPzyVArZjijleNbt8xRu/view?usp=sharing - Non-Satirical Images annotated in Stage 4
Dataset Description The YesBut dataset is a high-quality annotated dataset designed to evaluate the satire comprehension capabilities of vision-language models. It consists of 2547 multimodal images, 1084 of which are satirical, while 1463 are non-satirical. The dataset covers a variety of artistic styles, such as colorized sketches, 2D stick figures, and 3D stick figures, making it highly diverse. The satirical images follow a "Yes, But" structure, where the left half depicts a normal scenario, and the right half contains an ironic or humorous twist. The dataset is curated to challenge models in three main tasks: Satirical Image Detection, Understanding, and Completion. An additional dataset of 119 real, satirical photographs is also provided to further assess real-world satire comprehension.
Curated by: Annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges. Language(s) (NLP): English License: Apache license 2.0
Dataset Sources
Repository: https://github.com/abhi1nandy2/yesbut_dataset Paper: HuggingFace: https://huggingface.co/papers/2409.13592 ArXiv - https://arxiv.org/abs/2409.13592 This work has been accepted as a long paper in EMNLP Main 2024
Uses
Direct Use The YesBut dataset is intended for benchmarking the performance of vision-language models on satire and humor comprehension tasks. Researchers and developers can use this dataset to test their models on Satirical Image Detection (classifying an image as satirical or non-satirical), Satirical Image Understanding (generating the reason behind the satire), and Satirical Image Completion (choosing the correct half of an image that makes the completed image satirical). This dataset is particularly suitable for models developed for image-text reasoning and cross-modal understanding.
Out-of-Scope Use YesBut Dataset is not recommended for tasks unrelated to multimodal understanding, such as basic image classification without the context of humor or satire.
Dataset Structure The YesBut dataset contains two types of images: satirical and non-satirical. Satirical images are annotated with textual descriptions for both the left and right sub-images, as well as an overall description containing the punchline. Non-satirical images are also annotated but lack the element of irony or contradiction. The dataset is divided into multiple annotation stages (Stage 2, Stage 3, Stage 4), with each stage including a mix of original and generated sub-images. The data is stored in image and metadata formats, with the annotations being key to the benchmarking tasks.
Dataset Creation Curation Rationale The YesBut dataset was curated to address the gap in the ability of existing vision-language models to comprehend satire and humor. Satire comprehension involves a higher level of reasoning and understanding of context, irony, and social cues, making it a challenging task for models. By including a diverse range of images and satirical scenarios, the dataset aims to push the boundaries of multimodal understanding in artificial intelligence.
Source Data
Data Collection and Processing The images in the YesBut dataset were collected from social media platforms, particularly from the "X" (formerly Twitter) handle @yesbut. The original satirical images were manually annotated, and additional synthetic images were generated using DALL-E 3. These images were then manually labeled as satirical or non-satirical. The annotations include textual descriptions, binary features (e.g., presence of text), and difficulty ratings based on the difficulty of understanding satire/irony in the images. Data processing involved adding new variations of sub-images to expand the dataset's diversity.
Who are the source data producers? Prior to annotation and expansion of the dataset, images were downloaded from the posts in ‘X’ (erstwhile known as Twitter) handle @yesbut (with proper consent).
Annotation process The annotation process was carried out in four stages: (1) collecting satirical images from social media, (2) manual annotation of the images, (3) generating additional 2D stick figure images using DALL-E 3, and (4) generating 3D stick figure images.
Who are the annotators? YesBut was curated by annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges.
Personal and Sensitive Information The images in YesBut Dataset do not include any personal identifiable information, and the annotations are general descriptions related to the satirical content of the images.
Bias, Risks, and Limitations
Subjectivity of annotations: The annotation task involves utilizing background knowledge that may differ among annotators. Consequently, we manually reviewed the annotations to minimize the number of incorrect annotations in the dataset. However, some subjectivity still remains. Extension to languages other than English: This work is in the English Language. However, we plan to extend our work to languages other than English.
Citation
BibTeX:
@article{nandy2024yesbut, title={YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models}, author={Nandy, Abhilash and Agarwal, Yash and Patwa, Ashish and Das, Millon Madhur and Bansal, Aman and Raj, Ankit and Goyal, Pawan and Ganguly, Niloy}, journal={arXiv preprint arXiv:2409.13592}, year={2024} }
APA:
Nandy, A., Agarwal, Y., Patwa, A., Das, M. M., Bansal, A., Raj, A., ... & Ganguly, N. (2024). YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models. arXiv preprint arXiv:2409.13592.
Dataset Card Contact Get in touch at nandyabhilash@gmail.com
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "wiki_qa"
Dataset Summary
Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
default
Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
Dataset Card for LLaVA-NeXT
We provide the whole details of LLaVA-NeXT Dataset. In this dataset, we include the data that was used in the instruction tuning stage for LLaVA-NeXT and LLaVA-NeXT(stronger). Aug 30, 2024: We update the dataset with raw format (de-compress it for json file and images with structured folder), you can directly download them if you are familiar with LLaVA data format.
Dataset Sources
Compared to the instruction data mixture for LLaVA-1.5… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DeepFake electrocardiograms: the beginning of the end for privacy issues in medicine
Paper GitHub Original-data-source PyPI
How to download
Option 1
from datasets import load_dataset dataset = load_dataset("deepsynthbody/deepfake_ecg")
Option 2
git lfs install git clone https://huggingface.co/datasets/deepsynthbody/deepfake_ecg
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
COCO 2017 mirror
This is a just mirror of the raw COCO dataset files, for convenience. You have to download it using something like: pip install huggingface_hub
huggingface-cli download --local-dir coco-2017 pcuenq/coco-2017-mirror
And then unzip the files before use.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
MmCows: A Multimodal Dataset for Dairy Cattle Monitoring
Details of the dataset and benchmarks are available here. For a quick overview of the dataset, please check this video.
Instruction for downloading
1. Install requirements
pip install huggingface_hub
See the file structure here for the next step.
2. Download a file individually
To download visual_data.zip to your local-dir, use command line:
huggingface-cli download
neis-lab/mmcows \… See the full description on the dataset page: https://huggingface.co/datasets/neis-lab/mmcows.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes both VoxCeleb and VoxCeleb2
Multipart Zips
Already joined zips for convenience but these specified files are NOT part of the original datasets vox2_mp4_1.zip - vox2_mp4_6.zip vox2_aac_1.zip - vox2_aac_2.zip
Joining Zip
cat vox1_dev* > vox1_dev_wav.zip
cat vox2_dev_aac* > vox2_aac.zip
cat vox2_dev_mp4* > vox2_mp4.zip
Citation Information
@article{Nagrani19, author = "Arsha Nagrani and Joon~Son Chung and Weidi Xie and… See the full description on the dataset page: https://huggingface.co/datasets/ProgramComputer/voxceleb.
Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Terms of Use
By using the dataset, you agree to comply with the dataset license (CC-by-4.0-Deed).
Download Instructions
To download one file, please use from huggingface_hub import hf_hub_download
local_directory = 'LOCAL_DIRECTORY'
filepath = 'FILE_PATH'
repo_id = "climateset/climateset" repo_type = "dataset" hf_hub_download(repo_id=repo_id… See the full description on the dataset page: https://huggingface.co/datasets/climateset/climateset.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
X Character Files
Generate a persona based on your X data.
Walkthrough
First, download an archive of your X data here: https://help.x.com/en/managing-your-account/how-to-download-your-x-archive This could take 24 hours or more. Then, run tweets2character directly from your command line: npx tweets2character
NOTE: You need an API key to use Claude or OpenAI. If everything is correct, you'll see a loading bar as the script processes your data and generates a character… See the full description on the dataset page: https://huggingface.co/datasets/NapthaAI/twitter_personas.
Dataset Card for Fish-Visual Trait Analysis (Fish-Vista)
Note that the '' option will only load the CSV files. To download the entire dataset, including all processed images and segmentation annotations, refer to Instructions for downloading dataset and images. See [Example Code to Use the Segmentation Dataset])(https://huggingface.co/datasets/imageomics/fish-vista#example-code-to-use-the-segmentation-dataset)
Figure 1. A schematic representation of the… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/fish-vista.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by amulil
Released under GPL 2