39 datasets found

h
1450-RAG-Preprocessing-Data
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Luu, 1450-RAG-Preprocessing-Data [Dataset]. https://huggingface.co/datasets/RTVIENNA/1450-RAG-Preprocessing-Data
Explore at:
Authors
T. Luu
Description
RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Cloud_Computing_Preprocessed
huggingface.co
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lemma RCA (2024). Cloud_Computing_Preprocessed [Dataset]. https://huggingface.co/datasets/Lemma-RCA-NEC/Cloud_Computing_Preprocessed
Explore at:
Dataset updated
Jun 14, 2024
Dataset authored and provided by
Lemma RCA
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Data Description:

Preprocessed system metrics and log data from Cloud Computing Platform. Constructed the metric time series (as npy format) from the original metrics data (Json format). Extracted the log messages from the original log data (Json format). Parsed the log messages into log event templates. Note: 20240207 data does not contain EKS log data; it solely comprises CloudTrail log data in CSV format. Consequently, this dataset does not require preprocessing with a log… See the full description on the dataset page: https://huggingface.co/datasets/Lemma-RCA-NEC/Cloud_Computing_Preprocessed.
e
Example (synthetic) images - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Example (synthetic) images - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ee28704f-2926-54b3-bf93-751d2546dc68
Explore at:
Dataset updated
Apr 30, 2024
Description
ModelA Hugging Face Unconditional image generation Diffusion Model was used for training. [1] Unconditional image generation models are not conditioned on text or images during training. They only generate images that resemble the training data distribution. The model usually starts with a seed that generates a random noise vector. The model will then use this vector to create an output image similar to the images used to train the model. The training script initializes a UNet2DModel and uses it to train the model. [2] The training loop adds noise to the images, predicts the noise residual, calculates the loss, saves checkpoints at specified steps, and saves the generated models.Training DatasetThe RANZCR CLiP dataset was used to train the model. [3] This dataset has been created by The Royal Australian and New Zealand College of Radiologists (RANZCR) which is a not-for-profit professional organisation for clinical radiologists and radiation oncologists. The dataset has been labelled with a set of definitions to ensure consistency with labelling. The normal category includes lines that were appropriately positioned and did not require repositioning. The borderline category includes lines that would ideally require some repositioning but would in most cases still function adequately in their current position. The abnormal category included lines that required immediate repositioning. 30000 images were used during training. All training images were 512x512 in size. Computational Information Training has been conducted using RTX 6000 cards with 24GB of graphics memory. A checkpoint was created after each epoch was saved with 220 checkpoints being generated so far. Each checkpoint takes up 1GB space in memory. Generating each epoch takes around 6 hours. Machine learning libraries such as TensorFlow, PyTorch, or scikit-learn are used to run the training, along with additional libraries for data preprocessing, visualization, or deployment.Referenceshttps://huggingface.co/docs/diffusers/en/training/unconditional_training#unconditional-image-generationhttps://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L356https://www.kaggle.com/competitions/ranzcr-clip-catheter-line-classification/data
h
heritage-health-prize-release-3
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WC, heritage-health-prize-release-3 [Dataset]. https://huggingface.co/datasets/cestwc/heritage-health-prize-release-3
Explore at:
Authors
WC
Description
Dataset Card for Heritage Health Prize

It is often believed that this piece of data can be found at here and here, although we have not yet figured out what this piece of data is really used for. To save time, we directly follow the preprocessing script here. More specifically, we used the following script to produce this Hugging Face dataset. """ Preprocessing based on: https://github.com/truongkhanhduy95/Heritage-Health-Prize """ import zipfile from os import path from urllib… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/heritage-health-prize-release-3.
h
cc100-latin
huggingface.co
Updated Mar 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillip Benjamin Ströbel (2022). cc100-latin [Dataset]. https://huggingface.co/datasets/pstroe/cc100-latin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2022
Authors
Phillip Benjamin Ströbel
Description
Latin part of cc100 corpus

This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.

Preprocessing

I undertook the following preprocessing steps:

Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.
h
MoleculeSTM
huggingface.co
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shengchao (2023). MoleculeSTM [Dataset]. https://huggingface.co/datasets/chao1224/MoleculeSTM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Authors
shengchao
Description
Dataset Specifications for MoleculeSTM

We provide the raw dataset (after preprocessing) at this Hugging Face link. Or you can download them by running python download.py.

1. Pretraining Dataset: PubChemSTM

For PubChemSTM, please note that we can only release the chemical structure information. If you need the textual data, please follow our preprocessing scripts.

2. Downstream Datasets

Please refer to the following for three downstream tasks:

DrugBank_data for… See the full description on the dataset page: https://huggingface.co/datasets/chao1224/MoleculeSTM.
h
SemCor
huggingface.co
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu-Ting, Chen (2024). SemCor [Dataset]. https://huggingface.co/datasets/MarkChen1214/SemCor
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2024
Authors
Yu-Ting, Chen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "SemCor – sense-tagged English corpus"

Description

This dataset is derived from the wsd_semcor dataset, originally hosted on Hugging Face. It has been preprocessed for tasks related to Word Sense Disambiguation (WSD) and WordNet integration.

Preprocessing

The original text data underwent the following preprocessing steps:

Text splitting into individual words (lemmas). TF-IDF (Term Frequency-Inverse Document Frequency) analysis to understand… See the full description on the dataset page: https://huggingface.co/datasets/MarkChen1214/SemCor.
h
Data from: PIAST
huggingface.co
Updated Nov 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hayeon Bang (2024). PIAST [Dataset]. https://huggingface.co/datasets/Hayeonbang/PIAST
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2024
Authors
Hayeon Bang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
PIAST Dataset

This repo is for downloading transcribed MIDI & and text data of the PIAST Dataset. The audio files can be downloaded by following the process in the github.

UPDATES

Nov 13, 2024: The MIDI files and text data for both PIAST-AT and PIAST-YT have been uploaded! However, due to a data preprocessing issue, some files are missing compared to the numbers reported in the paper. These will be added in a future version update, so please stay tuned!… See the full description on the dataset page: https://huggingface.co/datasets/Hayeonbang/PIAST.
h
SLMS-KD-Benchmarks
huggingface.co
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Tran (2025). SLMS-KD-Benchmarks [Dataset]. https://huggingface.co/datasets/MothMalone/SLMS-KD-Benchmarks
Explore at:
Dataset updated
Jun 8, 2025
Authors
Nam Tran
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
SLMS-KD-Benchmarks Dataset

This repository contains the SLMS-KD-Benchmarks dataset, a collection of benchmarks for evaluating smaller language models (SLMs), particularly in knowledge distillation tasks. This dataset is a curated collection of existing datasets from Hugging Face. We have applied custom preprocessing and new train/validation/test splits to suit our benchmarking needs. We extend our sincere gratitude to the original creators for their invaluable work.… See the full description on the dataset page: https://huggingface.co/datasets/MothMalone/SLMS-KD-Benchmarks.
h
warvan-ml-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
Explore at:
Authors
warvan
Description
Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
h
Bioactivity_Final_Project_QM9
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Santos, Bioactivity_Final_Project_QM9 [Dataset]. https://huggingface.co/datasets/Desp-ML/Bioactivity_Final_Project_QM9
Explore at:
Authors
Daniel Santos
Description
Bioactivity Report QM9 - Molecular Data Preprocessing and ML Pipeline

This data set provides a comprehensive set of quantum chemical properties for a relevant and consistent chemical space of small organic molecules. The dataset consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF, corresponding to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical… See the full description on the dataset page: https://huggingface.co/datasets/Desp-ML/Bioactivity_Final_Project_QM9.
h
50-million-bluesky-posts
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aranym (2024). 50-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/50-million-bluesky-posts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2024
Authors
Aranym
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Nightsky 50M Dataset

~50 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

Request data deletion

A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/50-million-bluesky-posts.
h
newsqa
huggingface.co
Updated Jun 15, 2005
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun Rao (2005). newsqa [Dataset]. https://huggingface.co/datasets/varun-v-rao/newsqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2005
Authors
Varun Rao
Description
Dataset Card for "squad"

This truncated dataset is derived from the Stanford Question Answering Dataset (SQuAD) for reading comprehension. Its primary aim is to extract instances from the original SQuAD dataset that align with the context length of BERT, RoBERTa, OPT, and T5 models.

Preprocessing and Filtering

Preprocessing involves tokenization using the BertTokenizer (WordPiece), RoBertaTokenizer (Byte-level BPE), OPTTokenizer (Byte-Pair Encoding), and T5Tokenizer… See the full description on the dataset page: https://huggingface.co/datasets/varun-v-rao/newsqa.
h
wikitext2
huggingface.co
opendatalab.com
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
h
CelebA_Sent2Vect_Sp
huggingface.co
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ontology Engineering Group (2024). CelebA_Sent2Vect_Sp [Dataset]. http://doi.org/10.57967/hf/0446
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0446
Dataset updated
Feb 5, 2024
Dataset authored and provided by
Ontology Engineering Group
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Corpus Summary

This corpus has 192050 entries made up of descriptive sentences of the faces of the CelebA dataset. The preprocessing of the corpus has been to translate into Spanish the captions of the CelebA dataset with the algorithm used in Text2FaceGAN. In particular, all sentences are combined to generate a larger corpus. Additionally, a data preprocessing was applied that consists of eliminating stopwords, separation symbols and complementary elements that are not useful for… See the full description on the dataset page: https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp.
h
40-million-bluesky-posts
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aranym (2024). 40-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/40-million-bluesky-posts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2024
Authors
Aranym
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Nightsky 40M Dataset

~40 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

Request data deletion

A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/40-million-bluesky-posts.
h
QAmultilabelEURLEXsamples
huggingface.co
Updated Apr 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WANG LI (2023). QAmultilabelEURLEXsamples [Dataset]. https://huggingface.co/datasets/stuwang/QAmultilabelEURLEXsamples
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2023
Authors
WANG LI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Dataset Name

Dataset Summary Supported Tasks and Leaderboards

Multi-answer questioning, token classification

Languages

English

Dataset Structure Data Instances

[More Information Needed]

Data Fields

celex_id, input_ids, token_type_ids, attention_mask, labels

Data Splits

validation samples

Dataset Creation Curation Rationale

[More Information Needed]

Source Data… See the full description on the dataset page: https://huggingface.co/datasets/stuwang/QAmultilabelEURLEXsamples.
h
srt_test_dataset
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahim Tajwar (2025). srt_test_dataset [Dataset]. https://huggingface.co/datasets/ftajwar/srt_test_dataset
Explore at:
Dataset updated
May 27, 2025
Authors
Fahim Tajwar
Description
Test Dataset Compilation For Self-Rewarding Training

This is our test dataset compilation for our paper, "Can Large Reasoning Models Self-Train?" Please see our project page for more information about our project. In our paper, we use the three following datasets for evaluation:

AIME 2024 AIME 2025 AMC

Moreover, we also subsample 1% of the DAPO dataset for additional validation purposes. In this dataset, we compile all 4 of them together. This, together with our data preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/ftajwar/srt_test_dataset.
h
nq-simplified
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Kreussel, nq-simplified [Dataset]. https://huggingface.co/datasets/LLukas22/nq-simplified
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Lukas Kreussel
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "nq"

Dataset Summary

This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.

Dataset Structure Data Instances

An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.
h
custom_sentiment_analysis_dataset
huggingface.co
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TaeYeongSeo (2024). custom_sentiment_analysis_dataset [Dataset]. https://huggingface.co/datasets/SeoTae/custom_sentiment_analysis_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Authors
TaeYeongSeo
Description
Dataset Card for Custom Text Dataset

Dataset Name

Custom Text Dataset

Overview

This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.

Composition

Number of records: 50,000 Fields: text, label Size: 134 MB

Collection Process

The data was collected using web scraping and manual extraction from public domain sources.

Preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/SeoTae/custom_sentiment_analysis_dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

T. Luu, 1450-RAG-Preprocessing-Data [Dataset]. https://huggingface.co/datasets/RTVIENNA/1450-RAG-Preprocessing-Data

1450-RAG-Preprocessing-Data

RTVIENNA/1450-RAG-Preprocessing-Data

Explore at:

Authors

T. Luu

Description

RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

1450-RAG-Preprocessing-Data

Cloud_Computing_Preprocessed

Example (synthetic) images - Dataset - B2FIND

heritage-health-prize-release-3

cc100-latin

MoleculeSTM

SemCor

Data from: PIAST

SLMS-KD-Benchmarks

warvan-ml-dataset

Bioactivity_Final_Project_QM9

50-million-bluesky-posts

newsqa

wikitext2

CelebA_Sent2Vect_Sp

40-million-bluesky-posts

QAmultilabelEURLEXsamples

srt_test_dataset

nq-simplified

custom_sentiment_analysis_dataset

1450-RAG-Preprocessing-Data

RTVIENNA/1450-RAG-Preprocessing-Data