100+ datasets found
  1. Data from: huggingface

    • kaggle.com
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    amulil
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by amulil

    Released under GPL 2

    Contents

  2. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  3. R

    Animeheads Dataset

    • universe.roboflow.com
    zip
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nyuuzyou (2025). Animeheads Dataset [Dataset]. https://universe.roboflow.com/nyuuzyou/animeheads
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    nyuuzyou
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Head Bounding Boxes
    Description

    AnimeHeads Object Detection Dataset

    The AnimeHeadsv3 Object Detection Dataset is a collection of anime and art images, including manga pages, that have been annotated with object bounding boxes for use in object detection tasks. This dataset was used to train the final version of the Anime Object Detection Models, based on the YOLOv8l architecture.

    Contents

    The dataset contains a total of 8037 images, split into training, validation, and testing sets. The images were collected from various sources and include a variety of anime and art styles, including manga.

    Each annotation file containing the bounding box coordinates and label for each object in the corresponding image. Dataset has only one class named "head"

    Usage

    To use this dataset for object detection tasks, you can download the dataset files and annotations and use them to train your own object detection model.

    Pre-trained models based on this dataset are available on Hugging Face at the following link: - https://huggingface.co/nyuuzyou/AnimeHeads

  4. h

    aeslc

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yale LILY Lab, aeslc [Dataset]. https://huggingface.co/datasets/Yale-LILY/aeslc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Yale LILY Lab
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "aeslc"

      Dataset Summary
    

    A collection of email messages of employees in the Enron Corporation. There are two features:

    email_body: email body text. subject_line: email subject text.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    Monolingual English (mainly en-US) with some exceptions.

      Dataset Structure
    
    
    
    
    
      Data Instances
    
    
    
    
    
      default
    

    Size of downloaded dataset files: 11.64 MB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Yale-LILY/aeslc.

  5. h

    VLM4Bio

    • huggingface.co
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HDR Imageomics Institute (2025). VLM4Bio [Dataset]. http://doi.org/10.57967/hf/3393
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset authored and provided by
    HDR Imageomics Institute
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for VLM4Bio

      Instructions for downloading the dataset
    

    Install Git LFS Git clone the VLM4Bio repository to download all metadata and associated files Run the following commands in a terminal:

    git clone https://huggingface.co/datasets/imageomics/VLM4Bio cd VLM4Bio

    Downloading and processing bird images

    To download the bird images, run the following command:

    bash download_bird_images.sh

    This should download the bird images inside datasets/Bird/images… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/VLM4Bio.

  6. P

    YesBut Dataset

    • paperswithcode.com
    Updated Sep 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhilash Nandy; Yash Agarwal; Ashish Patwa; Millon Madhur Das; Aman Bansal; Ankit Raj; Pawan Goyal; Niloy Ganguly (2024). YesBut Dataset [Dataset]. https://paperswithcode.com/dataset/yesbut
    Explore at:
    Dataset updated
    Sep 19, 2024
    Authors
    Abhilash Nandy; Yash Agarwal; Ashish Patwa; Millon Madhur Das; Aman Bansal; Ankit Raj; Pawan Goyal; Niloy Ganguly
    Description

    YesBut Dataset (https://yesbut-dataset.github.io) Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.

    Dataset Details YesBut Dataset consists of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles. Each satirical image is posed in a “Yes, But” format, where the left half of the image depicts a normal scenario, while the right half depicts a conflicting scenario which is funny or ironic.

    Currently on huggingface, the dataset contains the satirical images from 3 different stages of annotation, along with corresponding metadata and image descriptions.

    Download non-satirical images from the following Google Drive Links - https://drive.google.com/file/d/1Tzs4OcEJK469myApGqOUKPQNUtVyTRDy/view?usp=sharing - Non-Satirical Images annotated in Stage 3 https://drive.google.com/file/d/1i4Fy01uBZ_2YGPzyVArZjijleNbt8xRu/view?usp=sharing - Non-Satirical Images annotated in Stage 4

    Dataset Description The YesBut dataset is a high-quality annotated dataset designed to evaluate the satire comprehension capabilities of vision-language models. It consists of 2547 multimodal images, 1084 of which are satirical, while 1463 are non-satirical. The dataset covers a variety of artistic styles, such as colorized sketches, 2D stick figures, and 3D stick figures, making it highly diverse. The satirical images follow a "Yes, But" structure, where the left half depicts a normal scenario, and the right half contains an ironic or humorous twist. The dataset is curated to challenge models in three main tasks: Satirical Image Detection, Understanding, and Completion. An additional dataset of 119 real, satirical photographs is also provided to further assess real-world satire comprehension.

    Curated by: Annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges. Language(s) (NLP): English License: Apache license 2.0

    Dataset Sources

    Repository: https://github.com/abhi1nandy2/yesbut_dataset Paper: HuggingFace: https://huggingface.co/papers/2409.13592 ArXiv - https://arxiv.org/abs/2409.13592 This work has been accepted as a long paper in EMNLP Main 2024

    Uses

    Direct Use The YesBut dataset is intended for benchmarking the performance of vision-language models on satire and humor comprehension tasks. Researchers and developers can use this dataset to test their models on Satirical Image Detection (classifying an image as satirical or non-satirical), Satirical Image Understanding (generating the reason behind the satire), and Satirical Image Completion (choosing the correct half of an image that makes the completed image satirical). This dataset is particularly suitable for models developed for image-text reasoning and cross-modal understanding.

    Out-of-Scope Use YesBut Dataset is not recommended for tasks unrelated to multimodal understanding, such as basic image classification without the context of humor or satire.

    Dataset Structure The YesBut dataset contains two types of images: satirical and non-satirical. Satirical images are annotated with textual descriptions for both the left and right sub-images, as well as an overall description containing the punchline. Non-satirical images are also annotated but lack the element of irony or contradiction. The dataset is divided into multiple annotation stages (Stage 2, Stage 3, Stage 4), with each stage including a mix of original and generated sub-images. The data is stored in image and metadata formats, with the annotations being key to the benchmarking tasks.

    Dataset Creation Curation Rationale The YesBut dataset was curated to address the gap in the ability of existing vision-language models to comprehend satire and humor. Satire comprehension involves a higher level of reasoning and understanding of context, irony, and social cues, making it a challenging task for models. By including a diverse range of images and satirical scenarios, the dataset aims to push the boundaries of multimodal understanding in artificial intelligence.

    Source Data

    Data Collection and Processing The images in the YesBut dataset were collected from social media platforms, particularly from the "X" (formerly Twitter) handle @yesbut. The original satirical images were manually annotated, and additional synthetic images were generated using DALL-E 3. These images were then manually labeled as satirical or non-satirical. The annotations include textual descriptions, binary features (e.g., presence of text), and difficulty ratings based on the difficulty of understanding satire/irony in the images. Data processing involved adding new variations of sub-images to expand the dataset's diversity.

    Who are the source data producers? Prior to annotation and expansion of the dataset, images were downloaded from the posts in ‘X’ (erstwhile known as Twitter) handle @yesbut (with proper consent).

    Annotation process The annotation process was carried out in four stages: (1) collecting satirical images from social media, (2) manual annotation of the images, (3) generating additional 2D stick figure images using DALL-E 3, and (4) generating 3D stick figure images.

    Who are the annotators? YesBut was curated by annotators who met the qualification criteria of being undergraduate sophomore students or above, enrolled in English-medium colleges.

    Personal and Sensitive Information The images in YesBut Dataset do not include any personal identifiable information, and the annotations are general descriptions related to the satirical content of the images.

    Bias, Risks, and Limitations

    Subjectivity of annotations: The annotation task involves utilizing background knowledge that may differ among annotators. Consequently, we manually reviewed the annotations to minimize the number of incorrect annotations in the dataset. However, some subjectivity still remains. Extension to languages other than English: This work is in the English Language. However, we plan to extend our work to languages other than English.

    Citation

    BibTeX:

    @article{nandy2024yesbut, title={YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models}, author={Nandy, Abhilash and Agarwal, Yash and Patwa, Ashish and Das, Millon Madhur and Bansal, Aman and Raj, Ankit and Goyal, Pawan and Ganguly, Niloy}, journal={arXiv preprint arXiv:2409.13592}, year={2024} }

    APA:

    Nandy, A., Agarwal, Y., Patwa, A., Das, M. M., Bansal, A., Raj, A., ... & Ganguly, N. (2024). YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models. arXiv preprint arXiv:2409.13592.

    Dataset Card Contact Get in touch at nandyabhilash@gmail.com

  7. h

    pubmed

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
    Explore at:
    Dataset updated
    Dec 15, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

  8. wiki_qa

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft, wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "wiki_qa"

      Dataset Summary
    

    Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure
    
    
    
    
    
      Data Instances
    
    
    
    
    
      default
    

    Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.

  9. h

    LLaVA-NeXT-Data

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMMs-Lab, LLaVA-NeXT-Data [Dataset]. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    LMMs-Lab
    Description

    Dataset Card for LLaVA-NeXT

    We provide the whole details of LLaVA-NeXT Dataset. In this dataset, we include the data that was used in the instruction tuning stage for LLaVA-NeXT and LLaVA-NeXT(stronger). Aug 30, 2024: We update the dataset with raw format (de-compress it for json file and images with structured folder), you can directly download them if you are familiar with LLaVA data format.

      Dataset Sources
    

    Compared to the instruction data mixture for LLaVA-1.5… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data.

  10. h

    deepfake_ecg

    • huggingface.co
    Updated May 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepSynthBody (2021). deepfake_ecg [Dataset]. https://huggingface.co/datasets/deepsynthbody/deepfake_ecg
    Explore at:
    Dataset updated
    May 1, 2021
    Dataset authored and provided by
    DeepSynthBody
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DeepFake electrocardiograms: the beginning of the end for privacy issues in medicine

    Paper GitHub Original-data-source PyPI

      How to download
    
    
    
    
    
    
    
      Option 1
    

    from datasets import load_dataset dataset = load_dataset("deepsynthbody/deepfake_ecg")

      Option 2
    

    git lfs install git clone https://huggingface.co/datasets/deepsynthbody/deepfake_ecg

    if you want to clone without large files – just their pointers

    prepend your git clone with the… See the full description on the dataset page: https://huggingface.co/datasets/deepsynthbody/deepfake_ecg.

  11. h

    coco-2017-mirror

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Cuenca, coco-2017-mirror [Dataset]. https://huggingface.co/datasets/pcuenq/coco-2017-mirror
    Explore at:
    Authors
    Pedro Cuenca
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    COCO 2017 mirror

    This is a just mirror of the raw COCO dataset files, for convenience. You have to download it using something like: pip install huggingface_hub

    huggingface-cli download --local-dir coco-2017 pcuenq/coco-2017-mirror

    And then unzip the files before use.

  12. h

    mmcows

    • huggingface.co
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purdue NEIS Lab (2025). mmcows [Dataset]. http://doi.org/10.57967/hf/5965
    Explore at:
    Dataset updated
    Mar 4, 2025
    Dataset authored and provided by
    Purdue NEIS Lab
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    MmCows: A Multimodal Dataset for Dairy Cattle Monitoring

    Details of the dataset and benchmarks are available here. For a quick overview of the dataset, please check this video.

      Instruction for downloading
    
    
    
    
    
      1. Install requirements
    

    pip install huggingface_hub

    See the file structure here for the next step.

      2. Download a file individually
    

    To download visual_data.zip to your local-dir, use command line: huggingface-cli download
    neis-lab/mmcows \… See the full description on the dataset page: https://huggingface.co/datasets/neis-lab/mmcows.

  13. h

    voxceleb

    • huggingface.co
    Updated Aug 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul C (2023). voxceleb [Dataset]. http://doi.org/10.57967/hf/0999
    Explore at:
    Dataset updated
    Aug 27, 2023
    Authors
    Paul C
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes both VoxCeleb and VoxCeleb2

      Multipart Zips
    

    Already joined zips for convenience but these specified files are NOT part of the original datasets vox2_mp4_1.zip - vox2_mp4_6.zip vox2_aac_1.zip - vox2_aac_2.zip

      Joining Zip
    

    cat vox1_dev* > vox1_dev_wav.zip

    cat vox2_dev_aac* > vox2_aac.zip

    cat vox2_dev_mp4* > vox2_mp4.zip

      Citation Information
    

    @article{Nagrani19, author = "Arsha Nagrani and Joon~Son Chung and Weidi Xie and… See the full description on the dataset page: https://huggingface.co/datasets/ProgramComputer/voxceleb.

  14. h

    Mimic4Dataset

    • huggingface.co
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thouria (2023). Mimic4Dataset [Dataset]. https://huggingface.co/datasets/thbndi/Mimic4Dataset
    Explore at:
    Dataset updated
    Jul 7, 2023
    Authors
    Thouria
    Description

    Dataset for mimic4 data, by default for the Mortality task. Available tasks are: Mortality, Length of Stay, Readmission, Phenotype. The data is extracted from the mimic4 database using this pipeline: 'https://github.com/healthylaife/MIMIC-IV-Data-Pipeline/tree/main' mimic path should have this form : "path/to/mimic4data/from/username/mimiciv/2.2" If you choose a Custom task provide a configuration file for the Time series. Currently working with Mimic-IV ICU Data.

  15. h

    TinyStories

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2023
    Authors
    Ronen Eldan
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

  16. h

    climateset

    • huggingface.co
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ClimateSet (2024). climateset [Dataset]. https://huggingface.co/datasets/climateset/climateset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2024
    Authors
    ClimateSet
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Terms of Use

    By using the dataset, you agree to comply with the dataset license (CC-by-4.0-Deed).

      Download Instructions
    

    To download one file, please use from huggingface_hub import hf_hub_download

    Path of the directory where the data will be downloaded in your local machine

    local_directory = 'LOCAL_DIRECTORY'

    Relative path of the file in the repository

    filepath = 'FILE_PATH'

    repo_id = "climateset/climateset" repo_type = "dataset" hf_hub_download(repo_id=repo_id… See the full description on the dataset page: https://huggingface.co/datasets/climateset/climateset.

  17. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  18. h

    opus_books

    • huggingface.co
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for OPUS Books

      Dataset Summary
    

    This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

  19. h

    twitter_personas

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NapthaAI (2024). twitter_personas [Dataset]. https://huggingface.co/datasets/NapthaAI/twitter_personas
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    NapthaAI
    Description

    X Character Files

    Generate a persona based on your X data.

      Walkthrough
    

    First, download an archive of your X data here: https://help.x.com/en/managing-your-account/how-to-download-your-x-archive This could take 24 hours or more. Then, run tweets2character directly from your command line: npx tweets2character

    NOTE: You need an API key to use Claude or OpenAI. If everything is correct, you'll see a loading bar as the script processes your data and generates a character… See the full description on the dataset page: https://huggingface.co/datasets/NapthaAI/twitter_personas.

  20. h

    fish-vista

    • huggingface.co
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HDR Imageomics Institute (2025). fish-vista [Dataset]. http://doi.org/10.57967/hf/3471
    Explore at:
    Dataset updated
    May 30, 2025
    Dataset authored and provided by
    HDR Imageomics Institute
    Description

    Dataset Card for Fish-Visual Trait Analysis (Fish-Vista)

    Note that the '' option will only load the CSV files. To download the entire dataset, including all processed images and segmentation annotations, refer to Instructions for downloading dataset and images. See [Example Code to Use the Segmentation Dataset])(https://huggingface.co/datasets/imageomics/fish-vista#example-code-to-use-the-segmentation-dataset)

    Figure 1. A schematic representation of the… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/fish-vista.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
Organization logo

Data from: huggingface

download pretrained model files

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
amulil
License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Dataset

This dataset was created by amulil

Released under GPL 2

Contents

Search
Clear search
Close search
Google apps
Main menu