RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Data Description:
Preprocessed system metrics and log data from Cloud Computing Platform. Constructed the metric time series (as npy format) from the original metrics data (Json format). Extracted the log messages from the original log data (Json format). Parsed the log messages into log event templates. Note: 20240207 data does not contain EKS log data; it solely comprises CloudTrail log data in CSV format. Consequently, this dataset does not require preprocessing with a log… See the full description on the dataset page: https://huggingface.co/datasets/Lemma-RCA-NEC/Cloud_Computing_Preprocessed.
ModelA Hugging Face Unconditional image generation Diffusion Model was used for training. [1] Unconditional image generation models are not conditioned on text or images during training. They only generate images that resemble the training data distribution. The model usually starts with a seed that generates a random noise vector. The model will then use this vector to create an output image similar to the images used to train the model. The training script initializes a UNet2DModel and uses it to train the model. [2] The training loop adds noise to the images, predicts the noise residual, calculates the loss, saves checkpoints at specified steps, and saves the generated models.Training DatasetThe RANZCR CLiP dataset was used to train the model. [3] This dataset has been created by The Royal Australian and New Zealand College of Radiologists (RANZCR) which is a not-for-profit professional organisation for clinical radiologists and radiation oncologists. The dataset has been labelled with a set of definitions to ensure consistency with labelling. The normal category includes lines that were appropriately positioned and did not require repositioning. The borderline category includes lines that would ideally require some repositioning but would in most cases still function adequately in their current position. The abnormal category included lines that required immediate repositioning. 30000 images were used during training. All training images were 512x512 in size. Computational Information Training has been conducted using RTX 6000 cards with 24GB of graphics memory. A checkpoint was created after each epoch was saved with 220 checkpoints being generated so far. Each checkpoint takes up 1GB space in memory. Generating each epoch takes around 6 hours. Machine learning libraries such as TensorFlow, PyTorch, or scikit-learn are used to run the training, along with additional libraries for data preprocessing, visualization, or deployment.Referenceshttps://huggingface.co/docs/diffusers/en/training/unconditional_training#unconditional-image-generationhttps://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L356https://www.kaggle.com/competitions/ranzcr-clip-catheter-line-classification/data
Dataset Card for Heritage Health Prize
It is often believed that this piece of data can be found at here and here, although we have not yet figured out what this piece of data is really used for. To save time, we directly follow the preprocessing script here. More specifically, we used the following script to produce this Hugging Face dataset. """ Preprocessing based on: https://github.com/truongkhanhduy95/Heritage-Health-Prize """ import zipfile from os import path from urllib… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/heritage-health-prize-release-3.
Latin part of cc100 corpus
This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.
Preprocessing
I undertook the following preprocessing steps:
Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.
Dataset Specifications for MoleculeSTM
We provide the raw dataset (after preprocessing) at this Hugging Face link. Or you can download them by running python download.py.
1. Pretraining Dataset: PubChemSTM
For PubChemSTM, please note that we can only release the chemical structure information. If you need the textual data, please follow our preprocessing scripts.
2. Downstream Datasets
Please refer to the following for three downstream tasks:
DrugBank_data for… See the full description on the dataset page: https://huggingface.co/datasets/chao1224/MoleculeSTM.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "SemCor – sense-tagged English corpus"
Description
This dataset is derived from the wsd_semcor dataset, originally hosted on Hugging Face. It has been preprocessed for tasks related to Word Sense Disambiguation (WSD) and WordNet integration.
Preprocessing
The original text data underwent the following preprocessing steps:
Text splitting into individual words (lemmas). TF-IDF (Term Frequency-Inverse Document Frequency) analysis to understand… See the full description on the dataset page: https://huggingface.co/datasets/MarkChen1214/SemCor.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
PIAST Dataset
This repo is for downloading transcribed MIDI & and text data of the PIAST Dataset. The audio files can be downloaded by following the process in the github.
UPDATES
Nov 13, 2024: The MIDI files and text data for both PIAST-AT and PIAST-YT have been uploaded! However, due to a data preprocessing issue, some files are missing compared to the numbers reported in the paper. These will be added in a future version update, so please stay tuned!… See the full description on the dataset page: https://huggingface.co/datasets/Hayeonbang/PIAST.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
SLMS-KD-Benchmarks Dataset
This repository contains the SLMS-KD-Benchmarks dataset, a collection of benchmarks for evaluating smaller language models (SLMs), particularly in knowledge distillation tasks. This dataset is a curated collection of existing datasets from Hugging Face. We have applied custom preprocessing and new train/validation/test splits to suit our benchmarking needs. We extend our sincere gratitude to the original creators for their invaluable work.… See the full description on the dataset page: https://huggingface.co/datasets/MothMalone/SLMS-KD-Benchmarks.
Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
Bioactivity Report QM9 - Molecular Data Preprocessing and ML Pipeline
This data set provides a comprehensive set of quantum chemical properties for a relevant and consistent chemical space of small organic molecules. The dataset consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF, corresponding to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical… See the full description on the dataset page: https://huggingface.co/datasets/Desp-ML/Bioactivity_Final_Project_QM9.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Nightsky 50M Dataset
~50 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.
Request data deletion
A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/50-million-bluesky-posts.
Dataset Card for "squad"
This truncated dataset is derived from the Stanford Question Answering Dataset (SQuAD) for reading comprehension. Its primary aim is to extract instances from the original SQuAD dataset that align with the context length of BERT, RoBERTa, OPT, and T5 models.
Preprocessing and Filtering
Preprocessing involves tokenization using the BertTokenizer (WordPiece), RoBertaTokenizer (Byte-level BPE), OPTTokenizer (Byte-Pair Encoding), and T5Tokenizer… See the full description on the dataset page: https://huggingface.co/datasets/varun-v-rao/newsqa.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Corpus Summary
This corpus has 192050 entries made up of descriptive sentences of the faces of the CelebA dataset. The preprocessing of the corpus has been to translate into Spanish the captions of the CelebA dataset with the algorithm used in Text2FaceGAN. In particular, all sentences are combined to generate a larger corpus. Additionally, a data preprocessing was applied that consists of eliminating stopwords, separation symbols and complementary elements that are not useful for… See the full description on the dataset page: https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Nightsky 40M Dataset
~40 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.
Request data deletion
A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/40-million-bluesky-posts.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
Supported Tasks and Leaderboards
Multi-answer questioning, token classification
Languages
English
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
celex_id, input_ids, token_type_ids, attention_mask, labels
Data Splits
validation samples
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data… See the full description on the dataset page: https://huggingface.co/datasets/stuwang/QAmultilabelEURLEXsamples.
Test Dataset Compilation For Self-Rewarding Training
This is our test dataset compilation for our paper, "Can Large Reasoning Models Self-Train?" Please see our project page for more information about our project. In our paper, we use the three following datasets for evaluation:
AIME 2024 AIME 2025 AMC
Moreover, we also subsample 1% of the DAPO dataset for additional validation purposes. In this dataset, we compile all 4 of them together. This, together with our data preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/ftajwar/srt_test_dataset.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "nq"
Dataset Summary
This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.
Dataset Structure
Data Instances
An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.
Dataset Card for Custom Text Dataset
Dataset Name
Custom Text Dataset
Overview
This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.
Composition
Number of records: 50,000 Fields: text, label Size: 134 MB
Collection Process
The data was collected using web scraping and manual extraction from public domain sources.
Preprocessing… See the full description on the dataset page: https://huggingface.co/datasets/SeoTae/custom_sentiment_analysis_dataset.
RTVIENNA/1450-RAG-Preprocessing-Data dataset hosted on Hugging Face and contributed by the HF Datasets community