100+ datasets found
  1. Random Number Dataset for Machine Learning

    • kaggle.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehedi Hasand1497 (2025). Random Number Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/random-number-dataset-for-machine-learning
    Explore at:
    zip(271867989 bytes)Available download formats
    Dataset updated
    Apr 27, 2025
    Authors
    Mehedi Hasand1497
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

    This dataset contains 5,000,000 samples with 10 numerical features generated using a uniform random distribution between 0 and 1.

    Additionally, a hidden structure is introduced:
    - Feature 2 is approximately twice Feature 1 plus small Gaussian noise.
    - Other features are purely random.

    📊 Dataset Details

    • Rows: 5,000,000
    • Columns: 10
    • Format: CSV
    • File Size: ~400 MB (approx.)
    Feature NameDescription
    feature_1Random number (0–1, uniform)
    feature_22 × feature_1 + small noise (N(0, 0.05))
    feature_3–10Independent random numbers (0–1)

    🎯 Intended Uses

    This dataset is ideal for: - Testing and benchmarking machine learning models - Regression analysis practice - Feature engineering experiments - Random data generation research - Large-scale data processing testing (Pandas, Dask, Spark)

    🏷️ Licensing

    This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
    You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

    Learn more about the license here.

    📌 Notes

    • All values are generated synthetically.
    • No missing data.
    • Safe for academic, commercial, or personal use.
  2. h

    code-generation-dataset

    • huggingface.co
    Updated May 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Mashhudur Rahim (2025). code-generation-dataset [Dataset]. https://huggingface.co/datasets/XythicK/code-generation-dataset
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    M Mashhudur Rahim
    Description

    📄 Code Generation Dataset

    A large-scale dataset curated for training and evaluating code generation models. This dataset contains high-quality code snippets, prompts, and metadata suitable for various code synthesis tasks, including prompt completion, function generation, and docstring-to-code translation.

      📦 Dataset Summary
    

    The code-generation-dataset provides:

    ✅ Prompts describing coding tasks ✅ Code solutions in Python (or other languages, if applicable) ✅ Metadata… See the full description on the dataset page: https://huggingface.co/datasets/XythicK/code-generation-dataset.

  3. h

    instruction-dataset-mini-with-generations

    • huggingface.co
    Updated Feb 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie Cheng-Jie Ji (2023). instruction-dataset-mini-with-generations [Dataset]. https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Authors
    Charlie Cheng-Jie Ji
    Description

    Dataset Card for instruction-dataset-mini-with-generations

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/CharlieJi/instruction-dataset-mini-with-generations.

  4. OpenR1-Math-220k

    • kaggle.com
    • huggingface.co
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    moth (2025). OpenR1-Math-220k [Dataset]. https://www.kaggle.com/datasets/alejopaullier/openr1-math-220k
    Explore at:
    zip(1295249082 bytes)Available download formats
    Dataset updated
    Feb 10, 2025
    Authors
    moth
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Link to the dataset in Hugging Face

    Dataset description

    OpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. The traces were verified using Math Verify for most samples and Llama-3.3-70B-Instruct as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer.

    The dataset consists of two splits:

    • default with 94k problems and that achieves the best performance after SFT.
    • extended with 131k samples where we add data sources like cn_k12. This provides more reasoning traces, but we found that the performance after SFT to be lower than the default subset, likely because the questions from cn_k12 are less difficult than other sources.

    Dataset curation

    To build OpenR1-Math-220k, we prompt DeepSeek R1 model to generate solutions for 400k problems from NuminaMath 1.5 using SGLang, the generation code is available here. We follow the model card’s recommended generation parameters and prepend the following instruction to the user prompt:

    "Please reason step by step, and put your final answer within \boxed{}."

    We set a 16k token limit per generation, as our analysis showed that only 75% of problems could be solved in under 8k tokens, and most of the remaining problems required the full 16k tokens. We were able to generate 25 solutions per hour per H100, enabling us to generate 300k problem solutions per day on 512 H100s.

    We generate two solutions per problem—and in some cases, four—to provide flexibility in filtering and training. This approach allows for rejection sampling, similar to DeepSeek R1’s methodology, and also makes the dataset suitable for preference optimisation methods like DPO.

  5. h

    Emilia-Dataset

    • huggingface.co
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amphion (2025). Emilia-Dataset [Dataset]. https://huggingface.co/datasets/amphion/Emilia-Dataset
    Explore at:
    Dataset updated
    Jan 27, 2025
    Dataset authored and provided by
    Amphion
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

    This is the official repository 👑 for the Emilia dataset and the source code for the Emilia-Pipe speech data preprocessing pipeline.

      News 🔥
    

    2025/02/26: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-Dataset.

  6. Population Collapse Time Series Data of the World

    • kaggle.com
    zip
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saad Aziz (2023). Population Collapse Time Series Data of the World [Dataset]. https://www.kaggle.com/datasets/saadaziz1985/population-collapse
    Explore at:
    zip(221868 bytes)Available download formats
    Dataset updated
    Aug 12, 2023
    Authors
    Saad Aziz
    License

    https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets

    Area covered
    World
    Description

    Background:

    Subjected dataset is extracted using world bank and UN websites to find population collapse according to countries and regions. The code generates data for seven indicators based on the current date and is available from Year 2000 to the year 2021.

    This code is useful for research purposes, there are nine distinct CSV files associated with this code, seven of them deals with indicators, one CSV file pertaining to country groups and last CSV file is analysis for 20 years between seven indicators. Below are seven indicators extracted from the world bank and the United Nations websites.

    Indicators:

    Total Population, Population Growth, Life Expectancy at birth, Fertility Rate, Death Rate (per 1,000 people)), Birth Rate (per 1,000 people), Median Age

    Definition:

    Population collapse is calculated using Total Population, Population Growth, Life Expectancy at birth, Fertility Rate, Death Rate, Birth Rate and Median Age, for that various criteria were applied to extract data:

    Methodology:

    The data was filtered based on several attributes, first ids and title has been extracted from the world bank data then timeframe and columns provided to extract data. This filtering process ensured that only relevant data meeting the specified criteria. For median age UN website is used and data is extracted for all countries. Median age data is not available for groups or regions; however, it could be calculated as median age data is available for all countries of the globe.

    Variables: Economy, Seven Indicators Years from 2000 to 2021

    For country group files, all countries are assigned according to regions, groups, by lending, by income, etc. so for this file each country is repeated as one country is member of more than one group.

    Analysis:

    Below screenshot is extracted for those countries whose population does fall in 20 years and death rate is increased while birth rate is decrease. So, for instance Ukraine population in Year 2002 was 48.2M while as per Year 2021 there population is decreased by 9% to 43.8M, similarly there death rate is increase from 15.7 to 18.5 (per 1000 people) and birth rate is decrease by 10% from 8.10 to 7.30.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15657145%2Ffeaff87cec8a478065eb06229045d7f1%2FPopulation%20Collapse.JPG?generation=1691841930935324&alt=media" alt="">

  7. h

    clinical_note_generation_dataset

    • huggingface.co
    Updated Mar 21, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eka Care (2026). clinical_note_generation_dataset [Dataset]. https://huggingface.co/datasets/ekacare/clinical_note_generation_dataset
    Explore at:
    Dataset updated
    Mar 21, 2026
    Dataset authored and provided by
    Eka Care
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Clinical Note Generation Dataset

      Dataset Description
    

    The Eka Structured Clinical Note Generation Dataset facilitates evaluation of medical scribe systems capable of transforming transcribed medical conversations into structured, entity-level medical records. This dataset addresses one of the most challenging aspects of healthcare AI: understanding and organising complex medical information into structured formats.

      Dataset Composition and Clinical Relevance… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/clinical_note_generation_dataset.
    
  8. Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories

    • catalog.data.gov
    • data.fr.virginia.gov
    • +12more
    Updated Jan 20, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2026). Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories [Dataset]. https://catalog.data.gov/dataset/third-generation-simulation-data-tgsim-i-294-l1-trajectories
    Explore at:
    Dataset updated
    Jan 20, 2026
    Dataset provided by
    Federal Highway Administrationhttps://highways.dot.gov/
    Area covered
    Interstate 294
    Description

    The main dataset is a 70 MB file of trajectory data (I294_L1_final.csv) that contains position, speed, and acceleration data for small and large automated (L1) vehicles and non-automated vehicles on a highway in a suburban environment. Supporting files include aerial reference images for ten distinct data collection “Runs” (I294_L1_RunX_with_lanes.png, where X equals 8, 18, and 20 for southbound runs and 1, 3, 7, 9, 11, 19, and 21 for northbound runs). Associated centerline files are also provided for each “Run” (I-294-L1-Run_X-geometry-with-ramps.csv). In each centerline file, x and y coordinates (in meters) marking each lane centerline are provided. The origin point of the reference image is located at the top left corner. Additionally, in each centerline file, an indicator variable is used for each lane to define the following types of road sections: 0=no ramp, 1=on-ramps, 2=off-ramps, and 3=weaving segments. The number attached to each column header is the numerical ID assigned for the specific lane (see “TGSIM – Centerline Data Dictionary – I294 L1.csv” for more details). The dataset defines eight lanes (four lanes in each direction) using these centerline files. Images that map the lanes of interest to the numerical lane IDs referenced in the trajectory dataset are stored in the folder titled “Annotation on Regions.zip”. The southbound lanes are shown visually in I294_L1_Lane-2.png through I294_L1_Lane-5.png and the northbound lanes are shown visually in I294_L1_Lane2.png through I294_L1_Lane5.png. This dataset was collected as part of the Third Generation Simulation Data (TGSIM): A Closer Look at the Impacts of Automated Driving Systems on Human Behavior project. During the project, six trajectory datasets capable of characterizing human-automated vehicle interactions under a diverse set of scenarios in highway and city environments were collected and processed. For more information, see the project report found here: https://rosap.ntl.bts.gov/view/dot/74647. This dataset, which is one of the six collected as part of the TGSIM project, contains data collected using one high-resolution 8K camera mounted on a helicopter that followed three SAE Level 1 ADAS-equipped vehicles with adaptive cruise control (ACC) enabled. The three vehicles manually entered the highway, moved to the second from left most lane, then enabled ACC with minimum following distance settings to initiate a string. The helicopter then followed the string of vehicles (which sometimes broke from the sting due to large following distances) northbound through the 4.8 km section of highway at an altitude of 300 meters. The goal of the data collection effort was to collect data related to human drivers' responses to vehicle strings. The road segment has four lanes in each direction and covers major on-ramp and an off-ramp in the southbound direction and one on-ramp in the northbound direction. The segment of highway is operated by Illinois Tollway and contains a high percentage of heavy vehicles. The camera captured footage during the evening rush hour (3:00 PM-5:00 PM CT) on a sunny day. As part of this dataset, the following files were provided: I294_L1_final.csv contains the numerical data to be used for analysis that includes vehicle level trajectory data at every 0.1 second. Vehicle size (small or large), width, length, and whether the vehicle was one of the test vehicles with ACC engaged ("yes" or "no") are provided with instantaneous location, speed, and acceleration data. All distance measurements (width, length, location) were converted from pixels to meters using the following conversion factor: 1 pixel = 0.3-meter conversion. I294_L1_RunX_with_lanes.png are the aerial reference images that define the geographic region and associated roadway segments of interest (see bounding boxes on northbound and southbound lanes) for each run X. I-294-L1-Run_X-geometry-with-ramps.csv contain the coordinates that define the lane cent

  9. m

    MASBA: A Large-Scale Dataset for Multi-Level Abstractive Summarization of...

    • data.mendeley.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MAHMUDUL HASAN (2025). MASBA: A Large-Scale Dataset for Multi-Level Abstractive Summarization of Bangla Articles [Dataset]. http://doi.org/10.17632/rxhj7g6y2k.3
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    MAHMUDUL HASAN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our research hypothesis is to evaluate the effectiveness of different Bangla text summarization methods compared to the original text ('main'). The data shows that:

    • The average length of the main text is 2482.72 characters.
    • The average length of the summaries are:
      • sum1: 293.75 characters,
      • sum2: 506.10 characters,
      • sum3: 688.50 characters.

    The compression ratio of each summary method (summary length divided by main length) reveals that: - sum1's mean compression ratio is 0.14, - sum2's mean compression ratio is 0.24, and - sum3's mean compression ratio is 0.33.

    Notable findings: - sum1 appears to be the shortest summary on average, with a higher degree of compression. - sum2 produces summaries of medium length, while sum3 tends to generate the longest summaries.

    Data Gathering and Interpretation: The data can be interpreted to assess which method produces the most concise, yet meaningful, summaries. Researchers can use these findings to evaluate the trade-offs between summary length and completeness of information conveyed.

  10. u

    LLM Text Generation Dataset

    • unidata.pro
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ, LLM Text Generation Dataset [Dataset]. https://unidata.pro/datasets/llm-text-generation/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    LLM Text Generation dataset offers multilingual text samples from large language models, enriching AI’s natural language understanding

  11. r

    Synthetic datasets generated by Large Language Models

    • resodate.org
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval (2025). Synthetic datasets generated by Large Language Models [Dataset]. http://doi.org/10.21950/YXP8Q8
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset provided by
    Universidad Autónoma de Madrid
    GRESEL-UAM: Narrativas Financieras y Literatura
    Eciencia Data
    Authors
    Yanco Amor Torterolo Orta; Yanco Amor Torterolo Orta; Sofía Micaela Roseti; Sofía Micaela Roseti; Antonio Moreno-Sandoval; Antonio Moreno-Sandoval
    Description

    This dataset is the result of the work done in the project GRESEL-UAM: About GRESEL: AI Generation Results Enriched with Simplified Explanations Based on Linguistic Features (Resultados de Generación de IA Enriquecidos con Explicaciones Simplificadas Basadas en Características Lingüísticas). This dataset is part of the publication titled "Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs," which will be presented in September 2025 in Zaragoza, within the framework of the conference of the Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). The work has already been accepted for publication in SEPLN’s official journal, Procesamiento del Lenguaje Natural. This dataset consists of three synthetically generated datasets, a process known as Synthetic Data Generation (SDG). We used three different LLMs: deepseek-r1:14b, llama3.1:8b-instruct-q8_0, and mistral:7b-instruct. Each was given a prompt instructing them to generate a question answering (QA) dataset based on context fragments from the novel Trafalgar by Benito Pérez Galdós. These datasets were later used to evaluate a Retrieval-Augmented Generation (RAG) system. Three CSV files are provided, each corresponding to the synthetic dataset generated by one of the models. In total, the dataset contains 359 items. The header includes the following fields: id, context, question, answer, and success. Fields are separated by tabs. The id column is simply an identifier number. The context column contains the text fragment from which the model generated the questions and answers. The question and answer fields contain the generated questions and answers, respectively. The success column indicates whether the model successfully generated the question and answer in the corresponding fields ("yes" or "no").

  12. R

    Synthetic Dataset Generation Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). Synthetic Dataset Generation Market Research Report 2033 [Dataset]. https://researchintelo.com/report/synthetic-dataset-generation-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2025 - 2034
    Area covered
    Global
    Description

    Synthetic Dataset Generation Market Outlook



    According to our latest research, the Synthetic Dataset Generation market size was valued at $1.2 billion in 2024 and is projected to reach $8.7 billion by 2033, expanding at an impressive CAGR of 24.6% during 2024–2033. The primary driving force behind this global expansion is the escalating demand for high-quality, diverse, and bias-free datasets to fuel advanced artificial intelligence (AI) and machine learning (ML) models. As organizations across industries face increasing challenges in acquiring large-scale, annotated, and privacy-compliant real-world data, synthetic dataset generation has emerged as a transformative solution. This technology not only accelerates the development and deployment of AI systems but also addresses critical data privacy, security, and cost constraints, making it indispensable in today’s data-centric economy.



    Regional Outlook



    North America currently holds the largest share of the global synthetic dataset generation market, accounting for over 38% of the total market value in 2024. The region’s dominance is primarily attributed to its mature technology ecosystem, robust investment in AI research, and the early adoption of synthetic data solutions by leading enterprises and tech giants. The presence of major synthetic data vendors, a strong network of academic research institutions, and proactive regulatory guidance on data privacy have collectively accelerated market growth in North America. Furthermore, favorable government policies and funding initiatives aimed at advancing AI innovation continue to foster a thriving environment for synthetic dataset generation, particularly in sectors such as healthcare, finance, and autonomous vehicles.



    Asia Pacific is the fastest-growing region in the synthetic dataset generation market, projected to register a remarkable CAGR of 29.3% from 2024 to 2033. This exceptional growth is driven by increasing digital transformation initiatives, rapid adoption of AI-powered solutions, and significant investments by both public and private sectors. Countries like China, Japan, South Korea, and India are aggressively expanding their AI capabilities, leading to a surge in demand for synthetic data to support machine learning and computer vision applications. The region is witnessing heightened interest from global technology vendors, who are establishing partnerships and R&D centers to tap into the burgeoning opportunities. The proliferation of smart devices, e-commerce, and fintech innovations further amplifies the need for scalable and secure synthetic datasets.



    Emerging economies in Latin America, the Middle East, and Africa are gradually embracing synthetic dataset generation, though adoption remains at an early stage due to infrastructural and regulatory challenges. Localized demand is primarily concentrated in industries such as government, BFSI, and telecommunications, where data privacy and localization policies are stringent. While these regions hold significant potential for future growth, market expansion is currently restrained by limited technical expertise, slower digital infrastructure development, and the need for tailored synthetic data solutions that address unique regional requirements. Nonetheless, increasing awareness, pilot projects, and supportive policy reforms are expected to accelerate adoption in the coming years.



    Report Scope





    Attributes Details
    Report Title Synthetic Dataset Generation Market Research Report 2033
    By Component Software, Services
    By Data Type Text, Image, Video, Audio, Tabular, Others
    By Application Machine Learning, Computer Vision, Natural Language Processing, Data Augmentation, Robotics, Autonomous Vehicles, Healthcare, Finance, Retail, Others
    By Deployment Mode On-Premises, Cloud
    By End

  13. o

    social_bias_frames

    • openml.org
    • huggingface.co
    Updated Apr 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sap; Maarten and Gabriel; Saadia and Qin; Lianhui and Smith; Noah A. and Choi; Yejin (2025). social_bias_frames [Dataset]. https://www.openml.org/d/46823
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2025
    Authors
    Sap; Maarten and Gabriel; Saadia and Qin; Lianhui and Smith; Noah A. and Choi; Yejin
    Description

    Warning: this document and dataset contain content that may be offensive or upsetting.

    Social Bias Frames is a new way of representing the biases and offensiveness that are implied in language. For example, these frames are meant to distill the implication that "women (candidates) are less qualified" behind the statement "we shouldn't lower our standards to hire more women." The Social Bias Inference Corpus (SBIC) supports large-scale learning and evaluation of social implications with over 150k structured annotations of social media posts, spanning over 34k implications about a thousand demographic groups.

    Supported Tasks and Leaderboards This dataset supports both classification and generation. Sap et al. developed several models using the SBIC. They report an F1 score of 78.8 in predicting whether the posts in the test set were offensive, an F1 score of 78.6 in predicting whether the posts were intending to be offensive, an F1 score of 80.7 in predicting whether the posts were lewd, and an F1 score of 69.9 in predicting whether the posts were targeting a specific group.

    Another of Sap et al.'s models performed better in the generation task. They report a BLUE score of 77.9, a Rouge-L score of 68.7, and a WMD score of 0.74 in generating a description of the targeted group given a post as well as a BLUE score of 52.6, a Rouge-L score of 44.9, and a WMD score of 2.79 in generating a description of the implied offensive statement given a post. See the paper for further details.

    Languages The language in SBIC is predominantly white-aligned English (78%, using a lexical dialect detector, Blodgett et al., 2016). The curators find less than 10 percentage of posts in SBIC are detected to have the AAE dialect category. The BCP-47 language tag is, presumably, en-US.

    The main aim for this dataset is to cover a wide variety of social biases that are implied in text, both subtle and overt, and make the biases representative of real world discrimination that people experience RWJF 2017. The curators also included some innocuous statements, to balance out biases, offensive, or harmful content.

    Source Data The curators included online posts from the following sources sometime between 2014-2019:

    r/darkJokes, r/meanJokes, r/offensiveJokes Reddit microaggressions (Breitfeller et al., 2019) Toxic language detection Twitter corpora (Waseem & Hovy, 2016; Davidson et al., 2017; Founa et al., 2018) Data scraped from hate sites (Gab, Stormfront, r/incels, r/mensrights)

    columns: whoTarget: a string, '0.0' if the target is a group, '1.0' if the target is an individual, and blank if the post is not offensive intentYN: a string indicating if the intent behind the statement was to offend. This is a categorical variable with four possible answers, '1.0' if yes, '0.66' if probably, '0.33' if probably not, and '0.0' if no. sexYN: a string indicating whether the post contains a sexual or lewd reference. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. sexReason: a string containing a free text explanation of what is sexual if indicated so, blank otherwise offensiveYN (target): a string indicating if the post could be offensive to anyone. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. annotatorGender: a string indicating the gender of the MTurk worker annotatorMinority: a string indicating whether the MTurk worker identifies as a minority sexPhrase: a string indicating which part of the post references something sexual, blank otherwise speakerMinorityYN: a string indicating whether the speaker was part of the same minority group that's being targeted. This is a categorical variable with three possible answers, '1.0' if yes, '0.5' if maybe, '0.0' if no. WorkerId: a string hashed version of the MTurk workerId HITId: a string id that uniquely identifies each post annotatorPolitics: a string indicating the political leaning of the MTurk worker annotatorRace: a string indicating the race of the MTurk worker annotatorAge: a string indicating the age of the MTurk worker post: a string containing the text of the post that was annotated targetMinority: a string indicating the demographic group targeted targetCategory: a string indicating the high-level category of the demographic group(s) targeted targetStereotype: a string containing the implied statement dataSource: a string indicating the source of the post (t/...: means Twitter, r/...: means a subreddit)

    paper_url = "https://aclanthology.org/2020.acl-main.486.pdf"

    original_data_url = "https://huggingface.co/datasets/allenai/social_bias_frames"

  14. UNISOLAR Solar Power Generation Dataset

    • kaggle.com
    zip
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDAClab (2022). UNISOLAR Solar Power Generation Dataset [Dataset]. https://www.kaggle.com/datasets/cdaclab/unisolar
    Explore at:
    zip(15462044 bytes)Available download formats
    Dataset updated
    Nov 9, 2022
    Authors
    CDAClab
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    UNISOLAR dataset contains high-granularity Photovoltaic (PV) solar energy generation, solar irradiance, and weather data from 42 PV sites deployed across five campuses at La Trobe University, Victoria, Australia. The dataset includes approximately two years of PV solar energy generation data collected at 15-minute intervals. Geographical placement and engineering specifications for each of the sites are also provided to aid researchers in modellin solar energy generation. Weather data is available at 1-minute intervals and is provided by the Australian Bureau of Meteorology (BOM). Apparent temperature, air temperature, dew point temperature, relative humidity, wind speed, and wind direction were provided under the weather data. The paper describes the data collection methods, cleaning, and merging with weather data. This dataset can be used to forecast, benchmark, and enhance operational outcomes in solar sites.

    Acknowledgements

    Please cite the following paper if you use this dataset:

    • S. Wimalaratne, D. Haputhanthri, S. Kahawala, G. Gamage, D. Alahakoon and A. Jennings, "UNISOLAR: An Open Dataset of Photovoltaic Solar Energy Generation in a Large Multi-Campus University Setting," 2022 15th International Conference on Human System Interaction (HSI), 2022, pp. 1-5, doi: 10.1109/HSI55341.2022.9869474.

    Usage Policy and Legal Disclaimer

    This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on download button(s) below, you are agreeing to use this data only for non-commercial, research, or academic applications. You may need to cite the above papers if you use this dataset.

    Github: https://github.com/CDAC-lab/UNISOLAR

  15. V

    Next Generation Simulation (NGSIM) Program Lankershim Boulevard Videos

    • data.virginia.gov
    • data.es.virginia.gov
    • +10more
    pdf
    Updated Jan 1, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S Department of Transportation (2016). Next Generation Simulation (NGSIM) Program Lankershim Boulevard Videos [Dataset]. https://data.virginia.gov/dataset/next-generation-simulation-ngsim-program-lankershim-boulevard-videos
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 1, 2016
    Dataset provided by
    US Department of Transportation
    Authors
    U.S Department of Transportation
    Area covered
    Lankershim Boulevard
    Description

    As part of the Federal Highway Administration’s (FHWA) Next Generation Simulation (NGSIM) project, video data were collected on June 16th, 2005 on an arterial segment on Lankershim Boulevard located in Los Angeles, California. The data represents 30 minutes total, segmented into two periods (8:30 a.m. to 8:45 a.m. and 8:45 a.m. to 9:00 a.m.). The dataset includes files for both raw and processed video data from each of the five cameras for the two time periods available for download. Camera numbering is in order of southern-most (1) to northern-most (5). The raw videos give the original vehicle movement data and offer users a view of how the section was observed. The processed video files provide videos of the vehicles along with a superimposition of the vehicle identification numbers. These videos can be used alone or can be used for cross referencing of the textual vehicle trajectory data provided in the NGSIM trajectory data with the corresponding video.

    For related datasets please see the following: - NGSIM Vehicle Trajectories and Supporting Data: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Vehicle-Trajector/8ect-6jqj - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf

  16. Which social media platforms are most popular

    • pewresearch.org
    csv
    Updated Feb 2, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pew Research Center (2026). Which social media platforms are most popular [Dataset]. https://www.pewresearch.org/internet/fact-sheet/social-media/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 2, 2026
    Dataset authored and provided by
    Pew Research Centerhttp://pewresearch.org/
    License

    https://www.pewresearch.org/terms-and-conditions/https://www.pewresearch.org/terms-and-conditions/

    Description

    A line chart that shows % of U.S. adults who say they ever use …

  17. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  18. Z

    Dataset of 30 energy customers with flexibility data, and distributed...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pereira, Helder; Gomes, Luis; Morais, Hugo; Vale, Zita (2024). Dataset of 30 energy customers with flexibility data, and distributed generation, considering residential, small commerce, large commerce, and industrial customers [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6783288
    Explore at:
    Dataset updated
    Apr 1, 2024
    Dataset provided by
    Polytechnic of Porto
    INESC-ID
    Authors
    Pereira, Helder; Gomes, Luis; Morais, Hugo; Vale, Zita
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset has 30 customers: ten residential, ten small commerce, five large commerce, and five industrial customers. The combination of several energy customer types allows the creation of a dataset with different types of consumption profiles, generation, and flexibility, and, therefore, different values of participation in demand response events.

    The residential profiles of the considered customers use the data available in the Working Group on Intelligent Data Mining and Analysis (IDMA): https://site.ieee.org/pes-iss/data-sets/

    The values represent a week period using 15 minutes reading periods. All the values are expressed in kWh and the matrixes were created as [customer x time_period].

    We would be grateful if you could acknowledge the use of this dataset in your publications. Please use the Zenodo publication to cite this work.

  19. Data generation volume worldwide 2010-2029

    • statista.com
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Data generation volume worldwide 2010-2029 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.

  20. Z

    PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Novack, Zachary (2025). PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13763755
    Explore at:
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Novack, Zachary
    McAuley, Julian
    Berg-Kirkpatrick, Taylor
    Long, Phillip
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. Refer to our paper for more information, and our GitHub repository for any code-related details. Please cite both our paper and our collaborators' paper if you use this dataset (see our GitHub for more information).

    Upon further use of the PDMX dataset, we discovered a discrepancy between the public-facing copyright metadata on the MuseScore website and the internal copyright data of the MuseScore files themselves, which affected 31,221 (12.29% of) songs. We have decided to proceed with the former given its public visibility on Musescore (i.e. this is what the MuseScore website presents its users with). We have noted files with conflicting internal licenses in the license_conflict column of PDMX. We recommend using the no_license_conflict subset of PDMX (which still includes 222,856 songs) moving forward.

    Additionally, for each song in PDMX, we not only provide the MusicRender and metadata JSON files, but we also try to include the associated compressed MusicXML (MXL), sheet music (PDF), and MIDI (MID) files when available. Due to the corruption of 42 of the original MuseScore files, these songs lack those associated files (since they could not be converted to those formats) and only include the MusicRender and metadata JSON files. The all_valid subset of PDMX describes the songs where all associated files are valid.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mehedi Hasand1497 (2025). Random Number Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/random-number-dataset-for-machine-learning
Organization logo

Random Number Dataset for Machine Learning

Large-Scale Random Number Dataset (1 Million Rows, 10 Features)

Explore at:
zip(271867989 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mehedi Hasand1497
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Large-Scale Random Number Dataset (5 Million Rows, 10 Features)

This dataset contains 5,000,000 samples with 10 numerical features generated using a uniform random distribution between 0 and 1.

Additionally, a hidden structure is introduced:
- Feature 2 is approximately twice Feature 1 plus small Gaussian noise.
- Other features are purely random.

📊 Dataset Details

  • Rows: 5,000,000
  • Columns: 10
  • Format: CSV
  • File Size: ~400 MB (approx.)
Feature NameDescription
feature_1Random number (0–1, uniform)
feature_22 × feature_1 + small noise (N(0, 0.05))
feature_3–10Independent random numbers (0–1)

🎯 Intended Uses

This dataset is ideal for: - Testing and benchmarking machine learning models - Regression analysis practice - Feature engineering experiments - Random data generation research - Large-scale data processing testing (Pandas, Dask, Spark)

🏷️ Licensing

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

Learn more about the license here.

📌 Notes

  • All values are generated synthetically.
  • No missing data.
  • Safe for academic, commercial, or personal use.
Search
Clear search
Close search
Google apps
Main menu