39 datasets found
  1. h

    1450-RAG-Preprocessing-Data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Luu, 1450-RAG-Preprocessing-Data [Dataset]. https://huggingface.co/datasets/RTVIENNA/1450-RAG-Preprocessing-Data
    Explore at:
    Authors
    T. Luu
    Description

    RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    Cloud_Computing_Preprocessed

    • huggingface.co
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lemma RCA (2024). Cloud_Computing_Preprocessed [Dataset]. https://huggingface.co/datasets/Lemma-RCA-NEC/Cloud_Computing_Preprocessed
    Explore at:
    Dataset updated
    Jun 14, 2024
    Dataset authored and provided by
    Lemma RCA
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Data Description:

    Preprocessed system metrics and log data from Cloud Computing Platform. Constructed the metric time series (as npy format) from the original metrics data (Json format). Extracted the log messages from the original log data (Json format). Parsed the log messages into log event templates. Note: 20240207 data does not contain EKS log data; it solely comprises CloudTrail log data in CSV format. Consequently, this dataset does not require preprocessing with a log… See the full description on the dataset page: https://huggingface.co/datasets/Lemma-RCA-NEC/Cloud_Computing_Preprocessed.

  3. e

    Example (synthetic) images - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Example (synthetic) images - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ee28704f-2926-54b3-bf93-751d2546dc68
    Explore at:
    Dataset updated
    Apr 30, 2024
    Description

    ModelA Hugging Face Unconditional image generation Diffusion Model was used for training. [1] Unconditional image generation models are not conditioned on text or images during training. They only generate images that resemble the training data distribution. The model usually starts with a seed that generates a random noise vector. The model will then use this vector to create an output image similar to the images used to train the model. The training script initializes a UNet2DModel and uses it to train the model. [2] The training loop adds noise to the images, predicts the noise residual, calculates the loss, saves checkpoints at specified steps, and saves the generated models.Training DatasetThe RANZCR CLiP dataset was used to train the model. [3] This dataset has been created by The Royal Australian and New Zealand College of Radiologists (RANZCR) which is a not-for-profit professional organisation for clinical radiologists and radiation oncologists. The dataset has been labelled with a set of definitions to ensure consistency with labelling. The normal category includes lines that were appropriately positioned and did not require repositioning. The borderline category includes lines that would ideally require some repositioning but would in most cases still function adequately in their current position. The abnormal category included lines that required immediate repositioning. 30000 images were used during training. All training images were 512x512 in size. Computational Information Training has been conducted using RTX 6000 cards with 24GB of graphics memory. A checkpoint was created after each epoch was saved with 220 checkpoints being generated so far. Each checkpoint takes up 1GB space in memory. Generating each epoch takes around 6 hours. Machine learning libraries such as TensorFlow, PyTorch, or scikit-learn are used to run the training, along with additional libraries for data preprocessing, visualization, or deployment.Referenceshttps://huggingface.co/docs/diffusers/en/training/unconditional_training#unconditional-image-generationhttps://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L356https://www.kaggle.com/competitions/ranzcr-clip-catheter-line-classification/data

  4. h

    heritage-health-prize-release-3

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WC, heritage-health-prize-release-3 [Dataset]. https://huggingface.co/datasets/cestwc/heritage-health-prize-release-3
    Explore at:
    Authors
    WC
    Description

    Dataset Card for Heritage Health Prize

    It is often believed that this piece of data can be found at here and here, although we have not yet figured out what this piece of data is really used for. To save time, we directly follow the preprocessing script here. More specifically, we used the following script to produce this Hugging Face dataset. """ Preprocessing based on: https://github.com/truongkhanhduy95/Heritage-Health-Prize """ import zipfile from os import path from urllib… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/heritage-health-prize-release-3.

  5. h

    cc100-latin

    • huggingface.co
    Updated Mar 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip Benjamin Ströbel (2022). cc100-latin [Dataset]. https://huggingface.co/datasets/pstroe/cc100-latin
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2022
    Authors
    Phillip Benjamin Ströbel
    Description

    Latin part of cc100 corpus

    This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.

      Preprocessing
    

    I undertook the following preprocessing steps:

    Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.

  6. h

    MoleculeSTM

    • huggingface.co
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shengchao (2023). MoleculeSTM [Dataset]. https://huggingface.co/datasets/chao1224/MoleculeSTM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Authors
    shengchao
    Description

    Dataset Specifications for MoleculeSTM

    We provide the raw dataset (after preprocessing) at this Hugging Face link. Or you can download them by running python download.py.

      1. Pretraining Dataset: PubChemSTM
    

    For PubChemSTM, please note that we can only release the chemical structure information. If you need the textual data, please follow our preprocessing scripts.

      2. Downstream Datasets
    

    Please refer to the following for three downstream tasks:

    DrugBank_data for… See the full description on the dataset page: https://huggingface.co/datasets/chao1224/MoleculeSTM.

  7. h

    SemCor

    • huggingface.co
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu-Ting, Chen (2024). SemCor [Dataset]. https://huggingface.co/datasets/MarkChen1214/SemCor
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2024
    Authors
    Yu-Ting, Chen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "SemCor – sense-tagged English corpus"

      Description
    

    This dataset is derived from the wsd_semcor dataset, originally hosted on Hugging Face. It has been preprocessed for tasks related to Word Sense Disambiguation (WSD) and WordNet integration.

      Preprocessing
    

    The original text data underwent the following preprocessing steps:

    Text splitting into individual words (lemmas). TF-IDF (Term Frequency-Inverse Document Frequency) analysis to understand… See the full description on the dataset page: https://huggingface.co/datasets/MarkChen1214/SemCor.

  8. h

    Data from: PIAST

    • huggingface.co
    Updated Nov 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hayeon Bang (2024). PIAST [Dataset]. https://huggingface.co/datasets/Hayeonbang/PIAST
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2024
    Authors
    Hayeon Bang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    PIAST Dataset

    This repo is for downloading transcribed MIDI & and text data of the PIAST Dataset. The audio files can be downloaded by following the process in the github.

      UPDATES
    

    Nov 13, 2024: The MIDI files and text data for both PIAST-AT and PIAST-YT have been uploaded! However, due to a data preprocessing issue, some files are missing compared to the numbers reported in the paper. These will be added in a future version update, so please stay tuned!… See the full description on the dataset page: https://huggingface.co/datasets/Hayeonbang/PIAST.

  9. h

    SLMS-KD-Benchmarks

    • huggingface.co
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Tran (2025). SLMS-KD-Benchmarks [Dataset]. https://huggingface.co/datasets/MothMalone/SLMS-KD-Benchmarks
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Nam Tran
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    SLMS-KD-Benchmarks Dataset

    This repository contains the SLMS-KD-Benchmarks dataset, a collection of benchmarks for evaluating smaller language models (SLMs), particularly in knowledge distillation tasks. This dataset is a curated collection of existing datasets from Hugging Face. We have applied custom preprocessing and new train/validation/test splits to suit our benchmarking needs. We extend our sincere gratitude to the original creators for their invaluable work.… See the full description on the dataset page: https://huggingface.co/datasets/MothMalone/SLMS-KD-Benchmarks.

  10. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  11. h

    Bioactivity_Final_Project_QM9

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Santos, Bioactivity_Final_Project_QM9 [Dataset]. https://huggingface.co/datasets/Desp-ML/Bioactivity_Final_Project_QM9
    Explore at:
    Authors
    Daniel Santos
    Description

    Bioactivity Report QM9 - Molecular Data Preprocessing and ML Pipeline

    This data set provides a comprehensive set of quantum chemical properties for a relevant and consistent chemical space of small organic molecules. The dataset consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF, corresponding to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical… See the full description on the dataset page: https://huggingface.co/datasets/Desp-ML/Bioactivity_Final_Project_QM9.

  12. h

    50-million-bluesky-posts

    • huggingface.co
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aranym (2024). 50-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/50-million-bluesky-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2024
    Authors
    Aranym
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Nightsky 50M Dataset

    ~50 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

      Request data deletion
    

    A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/50-million-bluesky-posts.

  13. h

    newsqa

    • huggingface.co
    Updated Jun 15, 2005
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varun Rao (2005). newsqa [Dataset]. https://huggingface.co/datasets/varun-v-rao/newsqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2005
    Authors
    Varun Rao
    Description

    Dataset Card for "squad"

    This truncated dataset is derived from the Stanford Question Answering Dataset (SQuAD) for reading comprehension. Its primary aim is to extract instances from the original SQuAD dataset that align with the context length of BERT, RoBERTa, OPT, and T5 models.

      Preprocessing and Filtering
    

    Preprocessing involves tokenization using the BertTokenizer (WordPiece), RoBertaTokenizer (Byte-level BPE), OPTTokenizer (Byte-Pair Encoding), and T5Tokenizer… See the full description on the dataset page: https://huggingface.co/datasets/varun-v-rao/newsqa.

  14. h

    wikitext2

    • huggingface.co
    • opendatalab.com
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  15. h

    CelebA_Sent2Vect_Sp

    • huggingface.co
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ontology Engineering Group (2024). CelebA_Sent2Vect_Sp [Dataset]. http://doi.org/10.57967/hf/0446
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2024
    Dataset authored and provided by
    Ontology Engineering Group
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Corpus Summary

    This corpus has 192050 entries made up of descriptive sentences of the faces of the CelebA dataset. The preprocessing of the corpus has been to translate into Spanish the captions of the CelebA dataset with the algorithm used in Text2FaceGAN. In particular, all sentences are combined to generate a larger corpus. Additionally, a data preprocessing was applied that consists of eliminating stopwords, separation symbols and complementary elements that are not useful for… See the full description on the dataset page: https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp.

  16. h

    40-million-bluesky-posts

    • huggingface.co
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aranym (2024). 40-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/40-million-bluesky-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2024
    Authors
    Aranym
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Nightsky 40M Dataset

    ~40 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

      Request data deletion
    

    A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/40-million-bluesky-posts.

  17. h

    QAmultilabelEURLEXsamples

    • huggingface.co
    Updated Apr 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WANG LI (2023). QAmultilabelEURLEXsamples [Dataset]. https://huggingface.co/datasets/stuwang/QAmultilabelEURLEXsamples
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2023
    Authors
    WANG LI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    
    
    
    
    
      Supported Tasks and Leaderboards
    

    Multi-answer questioning, token classification

      Languages
    

    English

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    celex_id, input_ids, token_type_ids, attention_mask, labels

      Data Splits
    

    validation samples

      Dataset Creation
    
    
    
    
    
      Curation Rationale
    

    [More Information Needed]

      Source Data… See the full description on the dataset page: https://huggingface.co/datasets/stuwang/QAmultilabelEURLEXsamples.
    
  18. h

    srt_test_dataset

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahim Tajwar (2025). srt_test_dataset [Dataset]. https://huggingface.co/datasets/ftajwar/srt_test_dataset
    Explore at:
    Dataset updated
    May 27, 2025
    Authors
    Fahim Tajwar
    Description

    Test Dataset Compilation For Self-Rewarding Training

    This is our test dataset compilation for our paper, "Can Large Reasoning Models Self-Train?" Please see our project page for more information about our project. In our paper, we use the three following datasets for evaluation:

    AIME 2024 AIME 2025 AMC

    Moreover, we also subsample 1% of the DAPO dataset for additional validation purposes. In this dataset, we compile all 4 of them together. This, together with our data preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/ftajwar/srt_test_dataset.

  19. h

    nq-simplified

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Kreussel, nq-simplified [Dataset]. https://huggingface.co/datasets/LLukas22/nq-simplified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Lukas Kreussel
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "nq"

      Dataset Summary
    

    This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.

  20. h

    custom_sentiment_analysis_dataset

    • huggingface.co
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TaeYeongSeo (2024). custom_sentiment_analysis_dataset [Dataset]. https://huggingface.co/datasets/SeoTae/custom_sentiment_analysis_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    TaeYeongSeo
    Description

    Dataset Card for Custom Text Dataset

      Dataset Name
    

    Custom Text Dataset

      Overview
    

    This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.

      Composition
    

    Number of records: 50,000 Fields: text, label Size: 134 MB

      Collection Process
    

    The data was collected using web scraping and manual extraction from public domain sources.

      Preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/SeoTae/custom_sentiment_analysis_dataset.
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
T. Luu, 1450-RAG-Preprocessing-Data [Dataset]. https://huggingface.co/datasets/RTVIENNA/1450-RAG-Preprocessing-Data

1450-RAG-Preprocessing-Data

RTVIENNA/1450-RAG-Preprocessing-Data

Explore at:
Authors
T. Luu
Description

RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu