100+ datasets found

Random Number Dataset for Machine Learning
kaggle.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehedi Hasand1497 (2025). Random Number Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/random-number-dataset-for-machine-learning
Explore at:
zip(271867989 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mehedi Hasand1497
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

This dataset contains 5,000,000 samples with 10 numerical features generated using a uniform random distribution between 0 and 1.

Additionally, a hidden structure is introduced:
- Feature 2 is approximately twice Feature 1 plus small Gaussian noise.
- Other features are purely random.

📊 Dataset Details

Rows: 5,000,000

Columns: 10

Format: CSV

File Size: ~400 MB (approx.)

Feature Name Description
feature_1 Random number (0–1, uniform)
feature_2 2 × feature_1 + small noise (N(0, 0.05))
feature_3–10 Independent random numbers (0–1)

🎯 Intended Uses

This dataset is ideal for: - Testing and benchmarking machine learning models - Regression analysis practice - Feature engineering experiments - Random data generation research - Large-scale data processing testing (Pandas, Dask, Spark)

🏷️ Licensing

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

Learn more about the license here.

📌 Notes

All values are generated synthetically.

No missing data.

Safe for academic, commercial, or personal use.
h
code-generation-dataset
huggingface.co
Updated May 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Mashhudur Rahim (2025). code-generation-dataset [Dataset]. https://huggingface.co/datasets/XythicK/code-generation-dataset
Explore at:
Dataset updated
May 31, 2025
Authors
M Mashhudur Rahim
Description
📄 Code Generation Dataset

A large-scale dataset curated for training and evaluating code generation models. This dataset contains high-quality code snippets, prompts, and metadata suitable for various code synthesis tasks, including prompt completion, function generation, and docstring-to-code translation.

📦 Dataset Summary

The code-generation-dataset provides:

✅ Prompts describing coding tasks ✅ Code solutions in Python (or other languages, if applicable) ✅ Metadata… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/code-generation-dataset.
h
instruction-dataset-mini-with-generations
huggingface.co
Updated Feb 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie Cheng-Jie Ji (2023). instruction-dataset-mini-with-generations [Dataset]. https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Authors
Charlie Cheng-Jie Ji
Description
Dataset Card for instruction-dataset-mini-with-generations

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations.
OpenR1-Math-220k
kaggle.com
huggingface.co
zip
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
moth (2025). OpenR1-Math-220k [Dataset]. https://www.kaggle.com/datasets/alejopaullier/openr1-math-220k
Explore at:
zip(1295249082 bytes)Available download formats
Dataset updated
Feb 10, 2025
Authors
moth
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Link to the dataset in Hugging Face

Dataset description

OpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. The traces were verified using Math Verify for most samples and Llama-3.3-70B-Instruct as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer.

The dataset consists of two splits:

default with 94k problems and that achieves the best performance after SFT.

extended with 131k samples where we add data sources like cn_k12. This provides more reasoning traces, but we found that the performance after SFT to be lower than the default subset, likely because the questions from cn_k12 are less difficult than other sources.

Dataset curation

To build OpenR1-Math-220k, we prompt DeepSeek R1 model to generate solutions for 400k problems from NuminaMath 1.5 using SGLang, the generation code is available here. We follow the model card’s recommended generation parameters and prepend the following instruction to the user prompt:

"Please reason step by step, and put your final answer within \boxed{}."

We set a 16k token limit per generation, as our analysis showed that only 75% of problems could be solved in under 8k tokens, and most of the remaining problems required the full 16k tokens. We were able to generate 25 solutions per hour per H100, enabling us to generate 300k problem solutions per day on 512 H100s.

We generate two solutions per problem—and in some cases, four—to provide flexibility in filtering and training. This approach allows for rejection sampling, similar to DeepSeek R1’s methodology, and also makes the dataset suitable for preference optimisation methods like DPO.
h
Emilia-Dataset
huggingface.co
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amphion (2025). Emilia-Dataset [Dataset]. https://huggingface.co/datasets/amphion/Emilia-Dataset
Explore at:
Dataset updated
Jan 27, 2025
Dataset authored and provided by
Amphion
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline.

News 🔥

2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.
Population Collapse Time Series Data of the World
kaggle.com
zip
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saad Aziz (2023). Population Collapse Time Series Data of the World [Dataset]. https://www.kaggle.com/datasets/saadaziz1985/population-collapse
Explore at:
zip(221868 bytes)Available download formats
Dataset updated
Aug 12, 2023
Authors
Saad Aziz
License
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Area covered
World
Description
Background:

Subjected dataset is extracted using world bank and UN websites to find population collapse according to countries and regions. The code generates data for seven indicators based on the current date and is available from Year 2000 to the year 2021.

This code is useful for research purposes, there are nine distinct CSV files associated with this code, seven of them deals with indicators, one CSV file pertaining to country groups and last CSV file is analysis for 20 years between seven indicators. Below are seven indicators extracted from the world bank and the United Nations websites.

Indicators:

Total Population, Population Growth, Life Expectancy at birth, Fertility Rate, Death Rate (per 1,000 people)), Birth Rate (per 1,000 people), Median Age

Definition:

Population collapse is calculated using Total Population, Population Growth, Life Expectancy at birth, Fertility Rate, Death Rate, Birth Rate and Median Age, for that various criteria were applied to extract data:

Methodology:

The data was filtered based on several attributes, first ids and title has been extracted from the world bank data then timeframe and columns provided to extract data. This filtering process ensured that only relevant data meeting the specified criteria. For median age UN website is used and data is extracted for all countries. Median age data is not available for groups or regions; however, it could be calculated as median age data is available for all countries of the globe.

Variables: Economy, Seven Indicators Years from 2000 to 2021

For country group files, all countries are assigned according to regions, groups, by lending, by income, etc. so for this file each country is repeated as one country is member of more than one group.

Analysis:

Below screenshot is extracted for those countries whose population does fall in 20 years and death rate is increased while birth rate is decrease. So, for instance Ukraine population in Year 2002 was 48.2M while as per Year 2021 there population is decreased by 9% to 43.8M, similarly there death rate is increase from 15.7 to 18.5 (per 1000 people) and birth rate is decrease by 10% from 8.10 to 7.30.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15657145%2Ffeaff87cec8a478065eb06229045d7f1%2FPopulation%20Collapse.JPG?generation=1691841930935324&alt=media" alt="">
h
clinical_note_generation_dataset
huggingface.co
Updated Mar 21, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eka Care (2026). clinical_note_generation_dataset [Dataset]. https://huggingface.co/datasets/ekacare/clinical_note_generation_dataset
Explore at:
Dataset updated
Mar 21, 2026
Dataset authored and provided by
Eka Care
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Clinical Note Generation Dataset

Dataset Description

The Eka Structured Clinical Note Generation Dataset facilitates evaluation of medical scribe systems capable of transforming transcribed medical conversations into structured, entity-level medical records. This dataset addresses one of the most challenging aspects of healthcare AI: understanding and organising complex medical information into structured formats.

Dataset Composition and Clinical Relevance… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/clinical_note_generation_dataset.
Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories
catalog.data.gov
data.fr.virginia.gov
+12more
Updated Jan 20, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2026). Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories [Dataset]. https://catalog.data.gov/dataset/third-generation-simulation-data-tgsim-i-294-l1-trajectories
Explore at:
Dataset updated
Jan 20, 2026
Dataset provided by
Federal Highway Administrationhttps://highways.dot.gov/
Area covered
Interstate 294
Description
The main dataset is a 70 MB file of trajectory data (I294_L1_final.csv) that contains position, speed, and acceleration data for small and large automated (L1) vehicles and non-automated vehicles on a highway in a suburban environment. Supporting files include aerial reference images for ten distinct data collection “Runs” (I294_L1_RunX_with_lanes.png, where X equals 8, 18, and 20 for southbound runs and 1, 3, 7, 9, 11, 19, and 21 for northbound runs). Associated centerline files are also provided for each “Run” (I-294-L1-Run_X-geometry-with-ramps.csv). In each centerline file, x and y coordinates (in meters) marking each lane centerline are provided. The origin point of the reference image is located at the top left corner. Additionally, in each centerline file, an indicator variable is used for each lane to define the following types of road sections: 0=no ramp, 1=on-ramps, 2=off-ramps, and 3=weaving segments. The number attached to each column header is the numerical ID assigned for the specific lane (see “TGSIM – Centerline Data Dictionary – I294 L1.csv” for more details). The dataset defines eight lanes (four lanes in each direction) using these centerline files. Images that map the lanes of interest to the numerical lane IDs referenced in the trajectory dataset are stored in the folder titled “Annotation on Regions.zip”. The southbound lanes are shown visually in I294_L1_Lane-2.png through I294_L1_Lane-5.png and the northbound lanes are shown visually in I294_L1_Lane2.png through I294_L1_Lane5.png. This dataset was collected as part of the Third Generation Simulation Data (TGSIM): A Closer Look at the Impacts of Automated Driving Systems on Human Behavior project. During the project, six trajectory datasets capable of characterizing human-automated vehicle interactions under a diverse set of scenarios in highway and city environments were collected and processed. For more information, see the project report found here: https://rosap.ntl.bts.gov/view/dot/74647. This dataset, which is one of the six collected as part of the TGSIM project, contains data collected using one high-resolution 8K camera mounted on a helicopter that followed three SAE Level 1 ADAS-equipped vehicles with adaptive cruise control (ACC) enabled. The three vehicles manually entered the highway, moved to the second from left most lane, then enabled ACC with minimum following distance settings to initiate a string. The helicopter then followed the string of vehicles (which sometimes broke from the sting due to large following distances) northbound through the 4.8 km section of highway at an altitude of 300 meters. The goal of the data collection effort was to collect data related to human drivers' responses to vehicle strings. The road segment has four lanes in each direction and covers major on-ramp and an off-ramp in the southbound direction and one on-ramp in the northbound direction. The segment of highway is operated by Illinois Tollway and contains a high percentage of heavy vehicles. The camera captured footage during the evening rush hour (3:00 PM-5:00 PM CT) on a sunny day. As part of this dataset, the following files were provided: I294_L1_final.csv contains the numerical data to be used for analysis that includes vehicle level trajectory data at every 0.1 second. Vehicle size (small or large), width, length, and whether the vehicle was one of the test vehicles with ACC engaged ("yes" or "no") are provided with instantaneous location, speed, and acceleration data. All distance measurements (width, length, location) were converted from pixels to meters using the following conversion factor: 1 pixel = 0.3-meter conversion. I294_L1_RunX_with_lanes.png are the aerial reference images that define the geographic region and associated roadway segments of interest (see bounding boxes on northbound and southbound lanes) for each run X. I-294-L1-Run_X-geometry-with-ramps.csv contain the coordinates that define the lane cent
m
MASBA: A Large-Scale Dataset for Multi-Level Abstractive Summarization of...
data.mendeley.com
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MAHMUDUL HASAN (2025). MASBA: A Large-Scale Dataset for Multi-Level Abstractive Summarization of Bangla Articles [Dataset]. http://doi.org/10.17632/rxhj7g6y2k.3
Explore at:
Unique identifier
https://doi.org/10.17632/rxhj7g6y2k.3
Dataset updated
May 21, 2025
Authors
MAHMUDUL HASAN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our research hypothesis is to evaluate the effectiveness of different Bangla text summarization methods compared to the original text ('main'). The data shows that:

The average length of the main text is 2482.72 characters.

The average length of the summaries are:

sum1: 293.75 characters,

sum2: 506.10 characters,

sum3: 688.50 characters.

The compression ratio of each summary method (summary length divided by main length) reveals that: - sum1's mean compression ratio is 0.14, - sum2's mean compression ratio is 0.24, and - sum3's mean compression ratio is 0.33.

Notable findings: - sum1 appears to be the shortest summary on average, with a higher degree of compression. - sum2 produces summaries of medium length, while sum3 tends to generate the longest summaries.

Data Gathering and Interpretation: The data can be interpreted to assess which method produces the most concise, yet meaningful, summaries. Researchers can use these findings to evaluate the trade-offs between summary length and completeness of information conveyed.
u
LLM Text Generation Dataset
unidata.pro
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata L.L.C-FZ, LLM Text Generation Dataset [Dataset]. https://unidata.pro/datasets/llm-text-generation/
Explore at:
csvAvailable download formats
Dataset authored and provided by
Unidata L.L.C-FZ
Description
LLM Text Generation dataset offers multilingual text samples from large language models, enriching AI’s natural language understanding
r
Synthetic datasets generated by Large Language Models
resodate.org
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval (2025). Synthetic datasets generated by Large Language Models [Dataset]. http://doi.org/10.21950/YXP8Q8
Explore at:
Unique identifier
https://doi.org/10.21950/YXP8Q8
Dataset updated
May 27, 2025
Dataset provided by
Universidad Autónoma de Madrid
GRESEL-UAM: Narrativas Financieras y Literatura
Eciencia Data
Authors
Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval
Description
This dataset is the result of the work done in the project GRESEL-UAM: About GRESEL: AI Generation Results Enriched with Simplified Explanations Based on Linguistic Features (Resultados de Generación de IA Enriquecidos con Explicaciones Simplificadas Basadas en Características Lingüísticas). This dataset is part of the publication titled "Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs," which will be presented in September 2025 in Zaragoza, within the framework of the conference of the Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). The work has already been accepted for publication in SEPLN’s official journal, Procesamiento del Lenguaje Natural. This dataset consists of three synthetically generated datasets, a process known as Synthetic Data Generation (SDG). We used three different LLMs: deepseek-r1:14b, llama3.1:8b-instruct-q8_0, and mistral:7b-instruct. Each was given a prompt instructing them to generate a question answering (QA) dataset based on context fragments from the novel Trafalgar by Benito Pérez Galdós. These datasets were later used to evaluate a Retrieval-Augmented Generation (RAG) system. Three CSV files are provided, each corresponding to the synthetic dataset generated by one of the models. In total, the dataset contains 359 items. The header includes the following fields: id, context, question, answer, and success. Fields are separated by tabs. The id column is simply an identifier number. The context column contains the text fragment from which the model generated the questions and answers. The question and answer fields contain the generated questions and answers, respectively. The success column indicates whether the model successfully generated the question and answer in the corresponding fields ("yes" or "no").

Synthetic Dataset Generation Market Research Report 2033

researchintelo.com

csv, pdf, pptx

Updated Oct 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Research Intelo (2025). Synthetic Dataset Generation Market Research Report 2033 [Dataset]. https://researchintelo.com/report/synthetic-dataset-generation-market

Explore at:

pdf, pptx, csvAvailable download formats

Dataset updated

Oct 2, 2025

Dataset authored and provided by

Research Intelo

License

https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

Time period covered

2025 - 2034

Area covered

Global

Description

Synthetic Dataset Generation Market Outlook

According to our latest research, the Synthetic Dataset Generation market size was valued at $1.2 billion in 2024 and is projected to reach $8.7 billion by 2033, expanding at an impressive CAGR of 24.6% during 2024–2033. The primary driving force behind this global expansion is the escalating demand for high-quality, diverse, and bias-free datasets to fuel advanced artificial intelligence (AI) and machine learning (ML) models. As organizations across industries face increasing challenges in acquiring large-scale, annotated, and privacy-compliant real-world data, synthetic dataset generation has emerged as a transformative solution. This technology not only accelerates the development and deployment of AI systems but also addresses critical data privacy, security, and cost constraints, making it indispensable in today’s data-centric economy.

Regional Outlook

North America currently holds the largest share of the global synthetic dataset generation market, accounting for over 38% of the total market value in 2024. The region’s dominance is primarily attributed to its mature technology ecosystem, robust investment in AI research, and the early adoption of synthetic data solutions by leading enterprises and tech giants. The presence of major synthetic data vendors, a strong network of academic research institutions, and proactive regulatory guidance on data privacy have collectively accelerated market growth in North America. Furthermore, favorable government policies and funding initiatives aimed at advancing AI innovation continue to foster a thriving environment for synthetic dataset generation, particularly in sectors such as healthcare, finance, and autonomous vehicles.

Asia Pacific is the fastest-growing region in the synthetic dataset generation market, projected to register a remarkable CAGR of 29.3% from 2024 to 2033. This exceptional growth is driven by increasing digital transformation initiatives, rapid adoption of AI-powered solutions, and significant investments by both public and private sectors. Countries like China, Japan, South Korea, and India are aggressively expanding their AI capabilities, leading to a surge in demand for synthetic data to support machine learning and computer vision applications. The region is witnessing heightened interest from global technology vendors, who are establishing partnerships and R&D centers to tap into the burgeoning opportunities. The proliferation of smart devices, e-commerce, and fintech innovations further amplifies the need for scalable and secure synthetic datasets.

Emerging economies in Latin America, the Middle East, and Africa are gradually embracing synthetic dataset generation, though adoption remains at an early stage due to infrastructural and regulatory challenges. Localized demand is primarily concentrated in industries such as government, BFSI, and telecommunications, where data privacy and localization policies are stringent. While these regions hold significant potential for future growth, market expansion is currently restrained by limited technical expertise, slower digital infrastructure development, and the need for tailored synthetic data solutions that address unique regional requirements. Nonetheless, increasing awareness, pilot projects, and supportive policy reforms are expected to accelerate adoption in the coming years.

Report Scope

Attributes	Details
Report Title	Synthetic Dataset Generation Market Research Report 2033
By Component	Software, Services
By Data Type	Text, Image, Video, Audio, Tabular, Others
By Application	Machine Learning, Computer Vision, Natural Language Processing, Data Augmentation, Robotics, Autonomous Vehicles, Healthcare, Finance, Retail, Others
By Deployment Mode	On-Premises, Cloud
By End

o
social_bias_frames
openml.org
huggingface.co
Updated Apr 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sap; Maarten and Gabriel; Saadia and Qin; Lianhui and Smith; Noah A. and Choi; Yejin (2025). social_bias_frames [Dataset]. https://www.openml.org/d/46823
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2025
Authors
Sap; Maarten and Gabriel; Saadia and Qin; Lianhui and Smith; Noah A. and Choi; Yejin
Description
Warning: this document and dataset contain content that may be offensive or upsetting.

Social Bias Frames is a new way of representing the biases and offensiveness that are implied in language. For example, these frames are meant to distill the implication that "women (candidates) are less qualified" behind the statement "we shouldn't lower our standards to hire more women." The Social Bias Inference Corpus (SBIC) supports large-scale learning and evaluation of social implications with over 150k structured annotations of social media posts, spanning over 34k implications about a thousand demographic groups.

Supported Tasks and Leaderboards This dataset supports both classification and generation. Sap et al. developed several models using the SBIC. They report an F1 score of 78.8 in predicting whether the posts in the test set were offensive, an F1 score of 78.6 in predicting whether the posts were intending to be offensive, an F1 score of 80.7 in predicting whether the posts were lewd, and an F1 score of 69.9 in predicting whether the posts were targeting a specific group.

Another of Sap et al.'s models performed better in the generation task. They report a BLUE score of 77.9, a Rouge-L score of 68.7, and a WMD score of 0.74 in generating a description of the targeted group given a post as well as a BLUE score of 52.6, a Rouge-L score of 44.9, and a WMD score of 2.79 in generating a description of the implied offensive statement given a post. See the paper for further details.

Languages The language in SBIC is predominantly white-aligned English (78%, using a lexical dialect detector, Blodgett et al., 2016). The curators find less than 10 percentage of posts in SBIC are detected to have the AAE dialect category. The BCP-47 language tag is, presumably, en-US.

The main aim for this dataset is to cover a wide variety of social biases that are implied in text, both subtle and overt, and make the biases representative of real world discrimination that people experience RWJF 2017. The curators also included some innocuous statements, to balance out biases, offensive, or harmful content.

Source Data The curators included online posts from the following sources sometime between 2014-2019:

r/darkJokes, r/meanJokes, r/offensiveJokes Reddit microaggressions (Breitfeller et al., 2019) Toxic language detection Twitter corpora (Waseem & Hovy, 2016; Davidson et al., 2017; Founa et al., 2018) Data scraped from hate sites (Gab, Stormfront, r/incels, r/mensrights)

columns: whoTarget: a string, '0.0' if the target is a group, '1.0' if the target is an individual, and blank if the post is not offensive intentYN: a string indicating if the intent behind the statement was to offend. This is a categorical variable with four possible answers, '1.0' if yes, '0.66' if probably, '0.33' if probably not, and '0.0' if no. sexYN: a string indicating whether the post contains a sexual or lewd reference. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. sexReason: a string containing a free text explanation of what is sexual if indicated so, blank otherwise offensiveYN (target): a string indicating if the post could be offensive to anyone. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. annotatorGender: a string indicating the gender of the MTurk worker annotatorMinority: a string indicating whether the MTurk worker identifies as a minority sexPhrase: a string indicating which part of the post references something sexual, blank otherwise speakerMinorityYN: a string indicating whether the speaker was part of the same minority group that's being targeted. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. WorkerId: a string hashed version of the MTurk workerId HITId: a string id that uniquely identifies each post annotatorPolitics: a string indicating the political leaning of the MTurk worker annotatorRace: a string indicating the race of the MTurk worker annotatorAge: a string indicating the age of the MTurk worker post: a string containing the text of the post that was annotated targetMinority: a string indicating the demographic group targeted targetCategory: a string indicating the high-level category of the demographic group(s) targeted targetStereotype: a string containing the implied statement dataSource: a string indicating the source of the post (t/...: means Twitter, r/...: means a subreddit)

paper_url = "https://aclanthology.org/2020.acl-main.486.pdf"

original_data_url = "https://huggingface.co/datasets/allenai/social_bias_frames"
UNISOLAR Solar Power Generation Dataset
kaggle.com
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDAClab (2022). UNISOLAR Solar Power Generation Dataset [Dataset]. https://www.kaggle.com/datasets/cdaclab/unisolar
Explore at:
zip(15462044 bytes)Available download formats
Dataset updated
Nov 9, 2022
Authors
CDAClab
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
UNISOLAR dataset contains high-granularity Photovoltaic (PV) solar energy generation, solar irradiance, and weather data from 42 PV sites deployed across five campuses at La Trobe University, Victoria, Australia. The dataset includes approximately two years of PV solar energy generation data collected at 15-minute intervals. Geographical placement and engineering specifications for each of the sites are also provided to aid researchers in modellin solar energy generation. Weather data is available at 1-minute intervals and is provided by the Australian Bureau of Meteorology (BOM). Apparent temperature, air temperature, dew point temperature, relative humidity, wind speed, and wind direction were provided under the weather data. The paper describes the data collection methods, cleaning, and merging with weather data. This dataset can be used to forecast, benchmark, and enhance operational outcomes in solar sites.

Acknowledgements

Please cite the following paper if you use this dataset:

S. Wimalaratne, D. Haputhanthri, S. Kahawala, G. Gamage, D. Alahakoon and A. Jennings, "UNISOLAR: An Open Dataset of Photovoltaic Solar Energy Generation in a Large Multi-Campus University Setting," 2022 15th International Conference on Human System Interaction (HSI), 2022, pp. 1-5, doi: 10.1109/HSI55341.2022.9869474.

Usage Policy and Legal Disclaimer

This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on download button(s) below, you are agreeing to use this data only for non-commercial, research, or academic applications. You may need to cite the above papers if you use this dataset.

Github: https://github.com/CDAC-lab/UNISOLAR
V
Next Generation Simulation (NGSIM) Program Lankershim Boulevard Videos
data.virginia.gov
data.es.virginia.gov
+10more
pdf
Updated Jan 1, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S Department of Transportation (2016). Next Generation Simulation (NGSIM) Program Lankershim Boulevard Videos [Dataset]. https://data.virginia.gov/dataset/next-generation-simulation-ngsim-program-lankershim-boulevard-videos
Explore at:
pdfAvailable download formats
Dataset updated
Jan 1, 2016
Dataset provided by
US Department of Transportation
Authors
U.S Department of Transportation
Area covered
Lankershim Boulevard
Description
As part of the Federal Highway Administration’s (FHWA) Next Generation Simulation (NGSIM) project, video data were collected on June 16th, 2005 on an arterial segment on Lankershim Boulevard located in Los Angeles, California. The data represents 30 minutes total, segmented into two periods (8:30 a.m. to 8:45 a.m. and 8:45 a.m. to 9:00 a.m.). The dataset includes files for both raw and processed video data from each of the five cameras for the two time periods available for download. Camera numbering is in order of southern-most (1) to northern-most (5). The raw videos give the original vehicle movement data and offer users a view of how the section was observed. The processed video files provide videos of the vehicles along with a superimposition of the vehicle identification numbers. These videos can be used alone or can be used for cross referencing of the textual vehicle trajectory data provided in the NGSIM trajectory data with the corresponding video.

For related datasets please see the following: - NGSIM Vehicle Trajectories and Supporting Data: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Vehicle-Trajector/8ect-6jqj - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf
Which social media platforms are most popular
pewresearch.org
csv
Updated Feb 2, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pew Research Center (2026). Which social media platforms are most popular [Dataset]. https://www.pewresearch.org/internet/fact-sheet/social-media/
Explore at:
csvAvailable download formats
Dataset updated
Feb 2, 2026
Dataset authored and provided by
Pew Research Centerhttp://pewresearch.org/
License
https://www.pewresearch.org/terms-and-conditions/https://www.pewresearch.org/terms-and-conditions/
Description
A line chart that shows % of U.S. adults who say they ever use …
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Z
Dataset of 30 energy customers with flexibility data, and distributed...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pereira, Helder; Gomes, Luis; Morais, Hugo; Vale, Zita (2024). Dataset of 30 energy customers with flexibility data, and distributed generation, considering residential, small commerce, large commerce, and industrial customers [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6783288
Explore at:
Dataset updated
Apr 1, 2024
Dataset provided by
Polytechnic of Porto
INESC-ID
Authors
Pereira, Helder; Gomes, Luis; Morais, Hugo; Vale, Zita
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset has 30 customers: ten residential, ten small commerce, five large commerce, and five industrial customers. The combination of several energy customer types allows the creation of a dataset with different types of consumption profiles, generation, and flexibility, and, therefore, different values of participation in demand response events.

The residential profiles of the considered customers use the data available in the Working Group on Intelligent Data Mining and Analysis (IDMA): https://site.ieee.org/pes-iss/data-sets/

The values represent a week period using 15 minutes reading periods. All the values are expressed in kWh and the matrixes were created as [customer x time_period].

We would be grateful if you could acknowledge the use of this dataset in your publications. Please use the Zenodo publication to cite this work.
Data generation volume worldwide 2010-2029
statista.com
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Data generation volume worldwide 2010-2029 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Nov 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.
Z
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Mar 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Novack, Zachary (2025). PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13763755
Explore at:
Dataset updated
Mar 17, 2025
Dataset provided by
Novack, Zachary
McAuley, Julian
Berg-Kirkpatrick, Taylor
Long, Phillip
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. Refer to our paper for more information, and our GitHub repository for any code-related details. Please cite both our paper and our collaborators' paper if you use this dataset (see our GitHub for more information).

Upon further use of the PDMX dataset, we discovered a discrepancy between the public-facing copyright metadata on the MuseScore website and the internal copyright data of the MuseScore files themselves, which affected 31,221 (12.29% of) songs. We have decided to proceed with the former given its public visibility on Musescore (i.e. this is what the MuseScore website presents its users with). We have noted files with conflicting internal licenses in the license_conflict column of PDMX. We recommend using the no_license_conflict subset of PDMX (which still includes 222,856 songs) moving forward.

Additionally, for each song in PDMX, we not only provide the MusicRender and metadata JSON files, but we also try to include the associated compressed MusicXML (MXL), sheet music (PDF), and MIDI (MID) files when available. Due to the corruption of 42 of the original MuseScore files, these songs lack those associated files (since they could not be converted to those formats) and only include the MusicRender and metadata JSON files. The all_valid subset of PDMX describes the songs where all associated files are valid.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mehedi Hasand1497 (2025). Random Number Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/random-number-dataset-for-machine-learning

Random Number Dataset for Machine Learning

Large-Scale Random Number Dataset (1 Million Rows, 10 Features)

Explore at:

zip(271867989 bytes)Available download formats

Dataset updated

Apr 27, 2025

Authors

Mehedi Hasand1497

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

This dataset contains 5,000,000 samples with 10 numerical features generated using a uniform random distribution between 0 and 1.

Additionally, a hidden structure is introduced:
- Feature 2 is approximately twice Feature 1 plus small Gaussian noise.
- Other features are purely random.

📊 Dataset Details

Rows: 5,000,000
Columns: 10
Format: CSV
File Size: ~400 MB (approx.)

Feature Name	Description
feature_1	Random number (0–1, uniform)
feature_2	2 × feature_1 + small noise (N(0, 0.05))
feature_3–10	Independent random numbers (0–1)

🎯 Intended Uses

This dataset is ideal for: - Testing and benchmarking machine learning models - Regression analysis practice - Feature engineering experiments - Random data generation research - Large-scale data processing testing (Pandas, Dask, Spark)

🏷️ Licensing

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

Learn more about the license here.

📌 Notes

All values are generated synthetically.
No missing data.
Safe for academic, commercial, or personal use.

Clear search

Close search

Google apps

Main menu

Random Number Dataset for Machine Learning

Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

📊 Dataset Details

🎯 Intended Uses

🏷️ Licensing

📌 Notes

code-generation-dataset

instruction-dataset-mini-with-generations

OpenR1-Math-220k

Dataset description

Dataset curation

Emilia-Dataset

Population Collapse Time Series Data of the World

Background:

Indicators:

Definition:

Methodology:

Analysis:

clinical_note_generation_dataset

Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories

MASBA: A Large-Scale Dataset for Multi-Level Abstractive Summarization of...

LLM Text Generation Dataset

Synthetic datasets generated by Large Language Models

Synthetic Dataset Generation Market Research Report 2033

Synthetic Dataset Generation Market Outlook

Regional Outlook

Report Scope

social_bias_frames

UNISOLAR Solar Power Generation Dataset

Acknowledgements

Usage Policy and Legal Disclaimer

Next Generation Simulation (NGSIM) Program Lankershim Boulevard Videos

Which social media platforms are most popular

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Dataset of 30 energy customers with flexibility data, and distributed...

Data generation volume worldwide 2010-2029

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music...

Random Number Dataset for Machine Learning

Large-Scale Random Number Dataset (1 Million Rows, 10 Features)

Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

📊 Dataset Details

🎯 Intended Uses

🏷️ Licensing

📌 Notes