100+ datasets found

Data from: huggingface
kaggle.com
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
amulil
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset

This dataset was created by amulil

Released under GPL 2

Contents
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
R
Animeheads Dataset
universe.roboflow.com
zip
Updated Jul 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nyuuzyou (2025). Animeheads Dataset [Dataset]. https://universe.roboflow.com/nyuuzyou/animeheads
Explore at:
zipAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
nyuuzyou
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Head Bounding Boxes
Description
AnimeHeads Object Detection Dataset

The AnimeHeadsv3 Object Detection Dataset is a collection of anime and art images, including manga pages, that have been annotated with object bounding boxes for use in object detection tasks. This dataset was used to train the final version of the Anime Object Detection Models, based on the YOLOv8l architecture.

Contents

The dataset contains a total of 8037 images, split into training, validation, and testing sets. The images were collected from various sources and include a variety of anime and art styles, including manga.

Each annotation file containing the bounding box coordinates and label for each object in the corresponding image. Dataset has only one class named "head"

Usage

To use this dataset for object detection tasks, you can download the dataset files and annotations and use them to train your own object detection model.

Pre-trained models based on this dataset are available on Hugging Face at the following link: - https://huggingface.co/nyuuzyou/AnimeHeads
h
mmcows
huggingface.co
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purdue NEIS Lab (2025). mmcows [Dataset]. http://doi.org/10.57967/hf/5965
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5965
Dataset updated
Mar 4, 2025
Dataset authored and provided by
Purdue NEIS Lab
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
MmCows: A Multimodal Dataset for Dairy Cattle Monitoring

Details of the dataset and benchmarks are available here. For a quick overview of the dataset, please check this video.

Instruction for downloading 1. Install requirements

pip install huggingface_hub

See the file structure here for the next step.

2. Download a file individually

To download visual_data.zip to your local-dir, use command line: huggingface-cli download
neis-lab/mmcows \… See the full description on the dataset page: https://huggingface.co/datasets/neis-lab/mmcows.
P
MNAD Dataset
paperswithcode.com
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
Explore at:
Dataset updated
May 16, 2023
Description
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

Dataset Fields

Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

Citation If you use our data, please cite the following paper:

bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
P
YesBut Dataset
paperswithcode.com
Updated Sep 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhilash Nandy; Yash Agarwal; Ashish Patwa; Millon Madhur Das; Aman Bansal; Ankit Raj; Pawan Goyal; Niloy Ganguly (2024). YesBut Dataset [Dataset]. https://paperswithcode.com/dataset/yesbut
Explore at:
Dataset updated
Sep 19, 2024
Authors
Abhilash Nandy; Yash Agarwal; Ashish Patwa; Millon Madhur Das; Aman Bansal; Ankit Raj; Pawan Goyal; Niloy Ganguly
Description
YesBut Dataset (https://yesbut-dataset.github.io) Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.

Dataset Details YesBut Dataset consists of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles. Each satirical image is posed in a “Yes, But” format, where the left half of the image depicts a normal scenario, while the right half depicts a conflicting scenario which is funny or ironic.

Currently on huggingface, the dataset contains the satirical images from 3 different stages of annotation, along with corresponding metadata and image descriptions.

Download non-satirical images from the following Google Drive Links - https://drive.google.com/file/d/1Tzs4OcEJK469myApGqOUKPQNUtVyTRDy/view?usp=sharing - Non-Satirical Images annotated in Stage 3 https://drive.google.com/file/d/1i4Fy01uBZ_2YGPzyVArZjijleNbt8xRu/view?usp=sharing - Non-Satirical Images annotated in Stage 4

Dataset Description The YesBut dataset is a high-quality annotated dataset designed to evaluate the satire comprehension capabilities of vision-language models. It consists of 2547 multimodal images, 1084 of which are satirical, while 1463 are non-satirical. The dataset covers a variety of artistic styles, such as colorized sketches, 2D stick figures, and 3D stick figures, making it highly diverse. The satirical images follow a "Yes, But" structure, where the left half depicts a normal scenario, and the right half contains an ironic or humorous twist. The dataset is curated to challenge models in three main tasks: Satirical Image Detection, Understanding, and Completion. An additional dataset of 119 real, satirical photographs is also provided to further assess real-world satire comprehension.

Curated by: Annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges. Language(s) (NLP): English License: Apache license 2.0

Dataset Sources

Repository: https://github.com/abhi1nandy2/yesbut_dataset Paper: HuggingFace: https://huggingface.co/papers/2409.13592 ArXiv - https://arxiv.org/abs/2409.13592 This work has been accepted as a long paper in EMNLP Main 2024

Uses

Direct Use The YesBut dataset is intended for benchmarking the performance of vision-language models on satire and humor comprehension tasks. Researchers and developers can use this dataset to test their models on Satirical Image Detection (classifying an image as satirical or non-satirical), Satirical Image Understanding (generating the reason behind the satire), and Satirical Image Completion (choosing the correct half of an image that makes the completed image satirical). This dataset is particularly suitable for models developed for image-text reasoning and cross-modal understanding.

Out-of-Scope Use YesBut Dataset is not recommended for tasks unrelated to multimodal understanding, such as basic image classification without the context of humor or satire.

Dataset Structure The YesBut dataset contains two types of images: satirical and non-satirical. Satirical images are annotated with textual descriptions for both the left and right sub-images, as well as an overall description containing the punchline. Non-satirical images are also annotated but lack the element of irony or contradiction. The dataset is divided into multiple annotation stages (Stage 2, Stage 3, Stage 4), with each stage including a mix of original and generated sub-images. The data is stored in image and metadata formats, with the annotations being key to the benchmarking tasks.

Dataset Creation Curation Rationale The YesBut dataset was curated to address the gap in the ability of existing vision-language models to comprehend satire and humor. Satire comprehension involves a higher level of reasoning and understanding of context, irony, and social cues, making it a challenging task for models. By including a diverse range of images and satirical scenarios, the dataset aims to push the boundaries of multimodal understanding in artificial intelligence.

Source Data

Data Collection and Processing The images in the YesBut dataset were collected from social media platforms, particularly from the "X" (formerly Twitter) handle @yesbut. The original satirical images were manually annotated, and additional synthetic images were generated using DALL-E 3. These images were then manually labeled as satirical or non-satirical. The annotations include textual descriptions, binary features (e.g., presence of text), and difficulty ratings based on the difficulty of understanding satire/irony in the images. Data processing involved adding new variations of sub-images to expand the dataset's diversity.

Who are the source data producers? Prior to annotation and expansion of the dataset, images were downloaded from the posts in ‘X’ (erstwhile known as Twitter) handle @yesbut (with proper consent).

Annotation process The annotation process was carried out in four stages: (1) collecting satirical images from social media, (2) manual annotation of the images, (3) generating additional 2D stick figure images using DALL-E 3, and (4) generating 3D stick figure images.

Who are the annotators? YesBut was curated by annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges.

Personal and Sensitive Information The images in YesBut Dataset do not include any personal identifiable information, and the annotations are general descriptions related to the satirical content of the images.

Bias, Risks, and Limitations

Subjectivity of annotations: The annotation task involves utilizing background knowledge that may differ among annotators. Consequently, we manually reviewed the annotations to minimize the number of incorrect annotations in the dataset. However, some subjectivity still remains. Extension to languages other than English: This work is in the English Language. However, we plan to extend our work to languages other than English.

Citation

BibTeX:

@article{nandy2024yesbut, title={YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models}, author={Nandy, Abhilash and Agarwal, Yash and Patwa, Ashish and Das, Millon Madhur and Bansal, Aman and Raj, Ankit and Goyal, Pawan and Ganguly, Niloy}, journal={arXiv preprint arXiv:2409.13592}, year={2024} }

APA:

Nandy, A., Agarwal, Y., Patwa, A., Das, M. M., Bansal, A., Raj, A., ... & Ganguly, N. (2024). YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models. arXiv preprint arXiv:2409.13592.

Dataset Card Contact Get in touch at nandyabhilash@gmail.com
h
opus_books
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for OPUS Books

Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
h
TinyStories
huggingface.co
paperswithcode.com
+1more
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Ronen Eldan
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
wiki_qa
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft, wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "wiki_qa"

Dataset Summary

Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances default

Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
h
super_glue
huggingface.co
opendatalab.com
+2more
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amanpreet Singh (2024). super_glue [Dataset]. https://huggingface.co/datasets/aps/super_glue
Explore at:
Dataset updated
May 23, 2024
Authors
Amanpreet Singh
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "super_glue"

Dataset Summary

SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances axb

Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.
h
aeslc
huggingface.co
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yale LILY Lab (2022). aeslc [Dataset]. https://huggingface.co/datasets/Yale-LILY/aeslc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2022
Dataset authored and provided by
Yale LILY Lab
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "aeslc"

Dataset Summary

A collection of email messages of employees in the Enron Corporation. There are two features:

email_body: email body text. subject_line: email subject text.

Supported Tasks and Leaderboards

More Information Needed

Languages

Monolingual English (mainly en-US) with some exceptions.

Dataset Structure Data Instances default

Size of downloaded dataset files: 11.64 MB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Yale-LILY/aeslc.
h
Mimic4Dataset
huggingface.co
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset
Explore at:
Dataset updated
Jul 7, 2023
Authors
Thouria
Description
Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.
h
TactileNet
huggingface.co
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mai Ahmed (2025). TactileNet [Dataset]. https://huggingface.co/datasets/MaiAhmed/TactileNet
Explore at:
Dataset updated
Jun 10, 2025
Authors
Mai Ahmed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
How to use TactileNet:

Step 1: Download the dataset locally

git lfs install git clone https://huggingface.co/datasets/MaiAhmed/TactileNet

Step 2: Install necessary packages

pip install datasets

Step 3: Load the dataset

import os from datasets import Dataset, Image

def load_data(dataset_path): data = [] for root, dirs, files in os.walk(dataset_path): for file in files: if file.endswith(".jpg"): # Extract… See the full description on the dataset page: https://huggingface.co/datasets/MaiAhmed/TactileNet.
h
voxceleb
huggingface.co
Updated Aug 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul C (2023). voxceleb [Dataset]. http://doi.org/10.57967/hf/0999
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0999
Dataset updated
Aug 27, 2023
Authors
Paul C
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes both VoxCeleb and VoxCeleb2

Multipart Zips

Already joined zips for convenience but these specified files are NOT part of the original datasets vox2_mp4_1.zip - vox2_mp4_6.zip vox2_aac_1.zip - vox2_aac_2.zip

Joining Zip

cat vox1_dev* > vox1_dev_wav.zip

cat vox2_dev_aac* > vox2_aac.zip

cat vox2_dev_mp4* > vox2_mp4.zip

Citation Information

@article{Nagrani19, author = "Arsha Nagrani and Joon~Son Chung and Weidi Xie and… See the full description on the dataset page: https://huggingface.co/datasets/ProgramComputer/voxceleb.
h
clue
huggingface.co
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLUE benchmark (2023). clue [Dataset]. https://huggingface.co/datasets/clue/clue
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2023
Dataset authored and provided by
CLUE benchmark
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "clue"

Dataset Summary

CLUE, A Chinese Language Understanding Evaluation Benchmark (https://www.cluebenchmarks.com/) is a collection of resources for training, evaluating, and analyzing Chinese language understanding systems.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances afqmc

Size of downloaded dataset files: 1.20 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/clue/clue.
h
coco-2017-mirror
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Cuenca, coco-2017-mirror [Dataset]. https://huggingface.co/datasets/pcuenq/coco-2017-mirror
Explore at:
Authors
Pedro Cuenca
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
COCO 2017 mirror

This is a just mirror of the raw COCO dataset files, for convenience. You have to download it using something like: pip install huggingface_hub

huggingface-cli download --local-dir coco-2017 pcuenq/coco-2017-mirror

And then unzip the files before use.
h
pubmed
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
Explore at:
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
h
LLaVA-NeXT-Data
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMMs-Lab, LLaVA-NeXT-Data [Dataset]. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
LMMs-Lab
Description
Dataset Card for LLaVA-NeXT

We provide the whole details of LLaVA-NeXT Dataset. In this dataset, we include the data that was used in the instruction tuning stage for LLaVA-NeXT and LLaVA-NeXT(stronger). Aug 30, 2024: We update the dataset with raw format (de-compress it for json file and images with structured folder), you can directly download them if you are familiar with LLaVA data format.

Dataset Sources

Compared to the instruction data mixture for LLaVA-1.5… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data.
h
scannetpp_v2_preprocessed
huggingface.co
Updated Jul 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gen3D (2025). scannetpp_v2_preprocessed [Dataset]. https://huggingface.co/datasets/Gen3DF/scannetpp_v2_preprocessed
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Gen3D
Description
Scannetpp V2 Preprocessed Dataset

This dataset contains the scannetpp_v2_preprocessed dataset split into chunks for easier download.

Files

Original file: scannetpp_v2_preprocessed.zip (~31GB) Chunks: 31 files (~1GB each) Scripts: merge.sh, download.py, unzip.sh

Usage

Download all files:

git clone https://huggingface.co/datasets/Gen3DF/scannetpp_v2_preprocessed cd scannetpp_v2_preprocessed/scannetpp_v2_preprocessed

Reassemble the original file:

chmod +x… See the full description on the dataset page: https://huggingface.co/datasets/Gen3DF/scannetpp_v2_preprocessed.