11 datasets found
  1. MOSTLY AI Prize Data

    • kaggle.com
    zip
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
    Explore at:
    zip(9871594 bytes)Available download formats
    Dataset updated
    May 16, 2025
    Authors
    ivonaK
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Competition

    • Generate the BEST tabular synthetic data and win 100,000 USD in cash.
    • Competition runs for 50 days: May 14 - July 3, 2025.
    • MOSTLY AI Prize

    This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

    For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

    Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

    Timeline

    • Submissions open: May 14, 2025, 15:30 UTC
    • Submission credits: 3 per calendar week (+bonus)
    • Submissions close: July 3, 2025, 23:59 UTC
    • Evaluation of Leaders: July 3 - July 9
    • Winners announced: on July 9 🏆

    Datasets

    Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

    Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

    Evaluation

    • CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size
    • Evaluated using the Synthetic Data Quality Assurance toolkit
    • Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

    Submission

    MOSTLY AI Prize

    Citation

    If you use this dataset in your research, please cite:

    @dataset{mostlyaiprize,
     author = {MOSTLY AI},
     title = {MOSTLY AI Prize Dataset},
     year = {2025},
     url = {https://www.mostlyaiprize.com/},
    }
    
  2. Software Observability Dataset

    • kaggle.com
    zip
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soham Patel (2025). Software Observability Dataset [Dataset]. https://www.kaggle.com/datasets/sohamphdresearch/software-observability-dataset
    Explore at:
    zip(273725617 bytes)Available download formats
    Dataset updated
    Aug 5, 2025
    Authors
    Soham Patel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Observability data, consisting of structured logs, metrics, and traces, is swiftly emerging as the foundation of advanced DevOps methodologies, facilitating real-time understanding of system health and performance.

    Significance of Data Collected During Dataset Development - Authenticity and Representativeness The synthetic dataset created for this study accurately represents realistic runtime events, encompassing successful operations, temporary failures, and severe exceptions across several service components and programming languages (Java, Python, Go, etc.). The dataset simulates real-world logging diversity by integrating various log formats, structured (JSON), semi-structured (logfmt/bracketed), and unstructured (console failures, stack traces), enhancing the model's robustness and transferability.

    • Multilingual and Multi-Component Scope Logs were deliberately annotated with language identifiers (e.g., Python, JavaScript, C#) and microservice names (AuthService, OrderProcessor), enabling Custom ChatGPT to discern correlations between language-specific issue patterns and their likely causes or potential solutions. This renders the dataset particularly helpful in multilingual contexts.

    • Introduction of Edge Cases and Anomalies To guarantee significant interaction from the model, the dataset incorporates edge scenarios such as:

    Null pointer dereferences, Timeout exceptions, Memory errors, Invalid user session tokens These anomalies were systematically introduced to encompass many failure patterns, allowing GPT to formulate reasoning for specific test generation.

    • Structured Data for Optimization of Large Language Models The dataset comprises metadata fields including:

      element, linguistics, degree of harshness, time marker, session/user identification This allows the LLM to execute conditional reasoning, context filtering, and test case relevance scoring—essential for prioritization tasks.

    • Customization Without Training In contrast to conventional ML pipelines that necessitate retraining on this data, our methodology employs the dataset for quick engineering and functional context embedding, hence maintaining both model efficacy and cost-effectiveness.

    Data Reutilization This research utilized an observability dataset specifically crafted for extensive reusability across several dimensions of software quality and artificial intelligence research.

    • Multifunctional Utility Applicable for anomaly detection, log summarization, root cause analysis, and incident correlation tasks. Optimal for training, assessing, or benchmarking alternative LLMs, anomaly classifiers, or test case generators.

    • Prompt Engineering Repository Each log pattern, particularly structured ones, can be repurposed as components of a prompt template repository, facilitating consistent and scalable evaluation of LLM performance in various failure scenarios.

      • Inter-Project Comparisons The logs emulate generic service components (authentication, payment processing, API gateway), allowing the dataset to be repurposed across several experiments or projects without being confined to a specific domain. This improves longitudinal research or comparative analyses among various tools or models.

      • Potential of Open Datasets The artificial nature of the data enables public sharing without worries regarding privacy or intellectual property, hence fostering repeatability, peer validation, and community contributions.

      • Empirical Testing Investigation The dataset provides a robust basis for further study domains linked to testing, including: Analysis of test impact, Detection of test flakiness, Models for selecting regression tests, Concentration of failures

  3. d

    A dataset of 1500-word stories generated by gpt-4o-mini for 236...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rettberg, Jill Walker; Wigers, Hermann (2025). A dataset of 1500-word stories generated by gpt-4o-mini for 236 nationalities [Dataset]. http://doi.org/10.18710/VM2K4O
    Explore at:
    Dataset updated
    May 29, 2025
    Dataset provided by
    DataverseNO
    Authors
    Rettberg, Jill Walker; Wigers, Hermann
    Description

    We created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories

  4. AI Assistant Usage in Student Life

    • kaggle.com
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayesha Saleem (2025). AI Assistant Usage in Student Life [Dataset]. https://www.kaggle.com/datasets/ayeshasal89/ai-assistant-usage-in-student-life-synthetic/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayesha Saleem
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    If you find this dataset useful, a quick upvote would be greatly appreciated 🙌 It helps more learners discover it!
    

    AI Assistant Usage in Student Life

    Explore how students at different academic levels use AI tools like ChatGPT for tasks such as coding, writing, studying, and brainstorming. Designed for learning, EDA, and ML experimentation.

    What is this dataset?

    This dataset simulates 10,000 sessions of students interacting with an AI assistant (like ChatGPT or similar tools) for various academic tasks. Each row represents a single session, capturing the student’s level, discipline, type of task, session length, AI effectiveness, satisfaction rating, and whether they reused the AI tool later.

    Why was this dataset created?

    As AI tools become mainstream in education, there's a need to analyze and model how students interact with them. However, no public datasets exist for this behavior. This dataset fills that gap by providing a safe, fully synthetic yet realistic simulation for:

    • EDA and visualization practice
    • Machine learning modeling
    • Feature engineering workflows
    • Educational data science exploration

    It’s ideal for students, data science learners, and researchers who want real-world use cases without privacy or copyright constraints.

    How is the dataset structured?

    ColumnDescription
    SessionIDUnique session identifier
    StudentLevelAcademic level: High School, Undergraduate, Graduate
    DisciplineStudent’s field of study (e.g., CS, Psychology, etc.)
    SessionDateDate of the session
    SessionLengthMinLength of AI interaction in minutes
    TotalPromptsNumber of prompts/messages used
    TaskTypeNature of the task (e.g., Coding, Writing, Research)
    AI_AssistanceLevel1–5 scale on how helpful the AI was perceived to be
    FinalOutcomeWhat the student achieved: Assignment Completed, Idea Drafted, etc.
    UsedAgainWhether the student returned to use the assistant again
    SatisfactionRating1–5 rating of overall satisfaction with the session

    All data is synthetically generated using controlled distributions, real-world logic, and behavioral modeling to reflect realistic usage patterns.

    Possible Use Cases

    This dataset is rich with potential for:

    • EDA: Visualize session behavior across levels, tasks, or disciplines
    • Classification: Predict likelihood of reuse (UsedAgain) or final outcome
    • Regression: Model satisfaction or session length based on context
    • Clustering: Segment students by AI interaction behavior
    • Feature engineering practice: Derive prompt density, session efficiency, or task difficulty
    • Survey-style analysis: Discover what makes students satisfied or frustrated

    Key Features

    • Clean and ready-to-use CSV
    • Balanced and realistic distributions
    • No missing values
    • Highly relatable academic context
  5. Medical Symptom Data for AI Diagnostics (5,000 Ent

    • kaggle.com
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BoffinBot (2025). Medical Symptom Data for AI Diagnostics (5,000 Ent [Dataset]. https://www.kaggle.com/datasets/boffinbot/disease-prediction-dataset
    Explore at:
    zip(228149 bytes)Available download formats
    Dataset updated
    Apr 16, 2025
    Authors
    BoffinBot
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection designed for training and evaluating machine learning models, particularly for disease prediction tasks. It contains 5,000 records, with approximately 625 samples per disease, covering eight common medical conditions: Common Cold, Malaria, Cough, Asthma, Normal Fever, Body Ache, Runny Nose, and Dengue. Each entry includes five symptom features—Fever (in °F), Headache, Cough, Fatigue, and Body Pain (all on a 0-10 scale)—along with the corresponding disease label.

    Dataset Structure:

    Columns: Fever (float, 95-105°F) Headache (float, 0-10) Cough (float, 0-10) Fatigue (float, 0-10) Body_Pain (float, 0-10) Disease (string, one of 8 classes) Rows: 5,000 (balanced across diseases) Format: CSV Generation Process:

    The data was synthetically generated using Python (NumPy and Pandas) based on realistic medical correlations. Symptom ranges were defined to reflect typical disease presentations (e.g., high Fever and Fatigue for Dengue, moderate Cough for Common Cold), ensuring variability and usability for model training. The dataset was created to support a hybrid AI project combining Fuzzy Logic and Convolutional Neural Networks (CNN), making it ideal for educational purposes or testing advanced diagnostic algorithms.

    Intended Use:

    Train supervised learning models (e.g., CNN, Random Forest) for multi-class classification. Develop and test hybrid systems integrating rule-based (Fuzzy Logic) and data-driven (CNN) approaches. Educational projects in healthcare AI, focusing on symptom-based disease prediction. Benchmarking model performance with a controlled, balanced dataset. Limitations:

    Synthetic nature means it lacks real-world patient data variability. Designed for five specific symptoms; additional features may require augmentation.

  6. synthetic credit score of thin-file consumers

    • kaggle.com
    zip
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepa Shukla (2024). synthetic credit score of thin-file consumers [Dataset]. https://www.kaggle.com/datasets/deepashukla/synthetic-credit-score-of-thin-file-consumers/code
    Explore at:
    zip(36416 bytes)Available download formats
    Dataset updated
    May 12, 2024
    Authors
    Deepa Shukla
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Research Context The dataset in question is designed to facilitate a study in the development of machine learning algorithms specifically tailored for credit scoring of "thin-file" consumers. "Thin-file" consumers are individuals who have little to no credit history, which makes traditional credit scoring models less effective or entirely inapplicable. These consumers often face difficulties in accessing credit products because they cannot be easily assessed by standard credit risk evaluation methods.

    Sources The data contained in the attached file is synthetically created using Python code. This approach is often employed to generate comprehensive datasets where real data is either unavailable or too sensitive to use for research purposes. Synthetic data generation allows for controlled experiments and analysis by enabling the inclusion of varied and extensive scenarios that might not be represented in real-world data, ensuring both privacy compliance and rich diversity in data attributes.

    [*Python libraries such as Pandas, NumPy, and Faker is used to create this dataset. These tools help in generating realistic data patterns and distributions, simulating a range of consumer profiles from those with stable financial behaviors to those with erratic financial histories, which are typical of thin-file scenarios.*]

    Inspiration The inspiration behind generating and utilising this dataset is to refine and enhance machine learning models that can effectively score thin-file consumers. This aligns with broader financial inclusivity goals, aiming to bridge the gap in financial services by providing fair credit opportunities to underserved segments of the population. By developing algorithms that can accurately predict creditworthiness in the absence of extensive credit histories, the study aims to propel the financial industry towards more equitable practices.

    This dataset, therefore, serves as a foundational element in a research effort that not only seeks to innovate in the technical realm of machine learning but also to contribute positively to societal progress by enhancing financial inclusion.

  7. Data for T2DM Risk prediction after GDM

    • kaggle.com
    zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashanthan Amirthanathan (2025). Data for T2DM Risk prediction after GDM [Dataset]. https://www.kaggle.com/datasets/prashanthana/gdm-risk-data-for-t2dm-prediction
    Explore at:
    zip(43703 bytes)Available download formats
    Dataset updated
    Sep 26, 2025
    Authors
    Prashanthan Amirthanathan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The synthetic dataset, consisting of 6,000 instances and 29 attributes including the target variable, has moderate to strong correlations, prioritizing severity over demographic factors. It maintains clinical coherence and incorporates plausible noise to enhance realism. The dataset has a binary target variable, T2DM risk, making it suitable for classification analysis. The dataset is already encoded.

    Features are as follows:

    1. Insulin treatment during pregnancy
    2. Pregnancy complications—hypertensive disorders,
    3. Pregnancy complications—preterm delivery
    4. Pregnancy complications—PPH
    5. Gestational weight gain
    6. Abnormal oral glucose tolerance test (OGTT) results
    7. Elevated HbA1c during pregnancy
    8. Macrosomia baby/birth weight (Delivered a baby >3.5 kg)
    9. Large for gestational age
    10. Instrumental delivery
    11. Stillbirth/Miscarriage
    12. History of Recurrence of GDM
    13. NICU admission
    14. Perinatal outcome, including 28-day mortality
    15. High pre-pregnancy BMI or overweight status (BMI ≥25 kg/m²)
    16. Older maternal age
    17. Multiparity
    18. Ethnicity
    19. Family history of diabetes
    20. Socioeconomic factors (deprivation quintile)
    21. Presence of T2DM-associated gene variants (e.g., TCF7L2, FTO)
    22. Obesity or unhealthy postpartum weight gain
    23. Physical inactivity
    24. Unhealthy diet
    25. Smoking
    26. Alcohol intake
    27. Does not undergo postpartum glucose screening
    28. Breastfeeding

    This data was used in the research paper:

    Jenifar Prashanthan, Amirthanathan Prashanthan, Predicting the future risk of developing type 2 diabetes in women with a history of gestational diabetes mellitus using machine learning and explainable artificial intelligence,

    https://doi.org/10.1016/j.pcd.2025.09.006.

    Abstract:

    Background and aim It is essential to identify the risk of developing Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM). This study seeks to create a machine learning (ML) model combined with explainable artificial intelligence (XAI) to predict and explain the risk of Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM).

    Methods A literature review found 28 risk factors, including pregnancy-related clinical risk factors, maternal characteristics, genetic risk factors, and lifestyle and modifiable risk factors. A synthetic dataset was generated utilizing subject expertise and clinical experience through Python programming. Various machine learning classification techniques were employed on the data to identify the optimal model, which integrates interpretability approaches (SHAP) to guarantee the transparency of model predictions.

    Results The developed machine learning model exhibited superior accuracy in predicting the risk of T2DM relative to conventional clinical risk scores, with notable contributions from factors such as insulin treatment during pregnancy, physical inactivity, obesity, breastfeeding, a history of recurrent GDM, an unhealthy diet, and ethnicity. Integrated XAI assists clinicians in comprehending the relevant risk factors and their influence on certain predictive outcomes.

    Conclusions Machine learning and explainable artificial intelligence provide a comprehensive methodology for individualized risk evaluation in women with a history of gestational diabetes mellitus. This methodology, by integrating extensive real-world data, offers healthcare clinicians actionable insights for early intervention. Keywords: Type 2 diabetes mellitus; Gestational diabetes mellitus; Machine learning; Explainable AI; Risk prediction; Personalized healthcare

  8. MMM Weekly Data - Geo:India

    • kaggle.com
    zip
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SubhagatoAdak (2025). MMM Weekly Data - Geo:India [Dataset]. https://www.kaggle.com/datasets/subhagatoadak/mmm-weekly-data-geoindia
    Explore at:
    zip(2463044 bytes)Available download formats
    Dataset updated
    Jul 18, 2025
    Authors
    SubhagatoAdak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    Synthetic India FMCG MMM Dataset (Weekly, 3 Years, Multi-Geo / Multi-Channel)

    Subtitle: 3-Year Weekly Multi-Channel FMCG Marketing Mix Panel for India Grain: Week-ending Saturday × Geography × Brand × SKU Span: 156 weeks (2 Jul 2022 – 27 Jun 2025) Scope: 8 Indian geographies • 3 brands × 3 SKUs each (9 SKUs) • Full marketing, trade, price, distribution & macro controls • AI creative quality scores for digital banners.

    This dataset is synthetic but behaviorally realistic, generated to help analysts experiment with Marketing Mix Modeling (MMM), media effectiveness, price/promo analytics, distribution effects, and hierarchical causal inference without using proprietary commercial data.

    Why This Dataset?

    Real MMM training data is rarely public due to confidentiality. This synthetic panel:

    • Mirrors common FMCG (CPG) category dynamics in India (festive spikes, monsoon effects, geo scale differences).
    • Includes paid media channels (TV, YouTube, Facebook, Instagram, Print, Radio).
    • Captures promotions & trade levers (feature, display, temporary price reduction, trade spend).
    • Provides distribution & availability metrics (Weighted Distribution, Numeric Distribution, TDP, NOS).
    • Includes pricing (MRP, Net Price under TPR).
    • Adds macro signals (CPI, GDP, Festival Index, Rainfall Index) aligned to India’s seasonality.
    • Introduces AI Content Scores (Facebook & Instagram banner creative quality) — letting you explore creative × media interaction models.
    • Delivered at a granular panel (Geo × Brand × SKU) suitable for pooled, hierarchical, or Bayesian MMM workflows.

    Files

    FileDescription
    synthetic_mmm_weekly_india_SAT.csvMain dataset. 11,232 rows × 28 columns. Weekly (week-ending Saturday).

    (If you also upload the Monday version, note it clearly and point users to which to use.)

    Quick Start

    import pandas as pd
    
    df = pd.read_csv("/kaggle/input/synthetic-india-fmcg-mmm/synthetic_mmm_weekly_india_SAT.csv",
             parse_dates=["Week"])
    
    df.info()
    df.head()
    

    Aggregate to Geo-Brand Weekly

    geo_brand = (
      df.groupby(["Week","Geo","Brand"], as_index=False)
       .sum(numeric_only=True)
    )
    

    Create Modeling-Friendly Features

    Example: log-transform sales value, normalize media, build price index.

    import numpy as np
    
    m = geo_brand.copy()
    m["log_sales_val"] = np.log1p(m["Sales_Value"])
    m["price_index"] = m["Net_Price"] / m.groupby(["Geo","Brand"])["Net_Price"].transform("mean")
    

    Calendar Notes

    • Week variable = week-ending Saturday (Pandas freq W-SAT).
    • First week: 2022-07-02; last week: 2025-06-27 (depending on 156-week span anchor).
    • To derive a week-start (Sunday) date:

      df["Week_Start"] = df["Week"] - pd.Timedelta(days=6)
      

    Data Dictionary

    Key Dimensions

    ColumnTypeDescription
    WeekdateWeek-ending Saturday timestamp.
    Geocategorical8 rollups: NORTH, SOUTH, EAST, WEST, CENTRAL, NORTHEAST, METRO_DELHI, METRO_MUMBAI.
    BrandcategoricalBrandA / BrandB / BrandC.
    SKUcategoricalBrand-level SKU IDs (3 per brand).

    Commercial Outcomes

    ColumnTypeNotes
    Sales_UnitsfloatModeled weekly unit sales after macro, distribution, price, promo & media effects. Lognormal noise added.
    Sales_ValuefloatSales_Units × Net_Price. Use for revenue MMM or ROI analyses.

    Pricing

    ColumnTypeNotes
    MRPfloatBaseline list price (per-unit). Drifts with CPI & brand positioning.
    Net_PricefloatEffective real...
  9. Data from: AstroChat

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
    Explore at:
    zip(1214166 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose and Scope

    The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

    Intended Use

    The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

    Quickstart

    To be completed

    DATASET DESCRIPTION

    Access

    Structure

    901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

    Important See the full list of topics and subtopics covered below.

    Metadata

    Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

    Generation Method

    We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

    Step-by-step description

    • Defined a set of user persona
    • Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering
    • For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)
    • For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)
    • We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions
    • We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

    Future work and contributions appreciated

    • Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)
    • Implement more creativity in the opening questions and follow-up questions
    • Filter-out questions and conversations which are too similar
    • Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

    Languages

    All instances in the dataset are in english

    Size

    901 synthetically-generated dialogue

    USAGE AND GUIDELINES

    License

    AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

    Restrictions

    No restriction. Please provide the correct attribution following the license terms.

    Citation

    Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

    Update Frequency

    Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

    Have a feedback or spot an error?

    Use the ...

  10. Pavement Dataset

    • kaggle.com
    zip
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gifrey Sulay (2025). Pavement Dataset [Dataset]. https://www.kaggle.com/datasets/gifreysulay/pavement-dataset/discussion?sort=undefined
    Explore at:
    zip(20890601 bytes)Available download formats
    Dataset updated
    May 24, 2025
    Authors
    Gifrey Sulay
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    🏗️ Pavement Condition Monitoring and Maintenance Prediction

    📘 Scenario

    You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.

    Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.

    🔍 Target Variable: Needs_Maintenance

    This binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:

    • Needs_Maintenance = 1
    • Needs_Maintenance = 0 otherwise

    🎯 Learning Objectives

    • Perform exploratory data analysis (EDA) on civil engineering infrastructure data
    • Engineer features relevant to road quality and maintenance
    • Build and evaluate a binary classification model using Python
    • Interpret model results to support maintenance prioritization decisions

    📊 Dataset Features

    Column NameDescription
    Segment IDUnique identifier for the road segment
    PCIPavement Condition Index (0 = worst, 100 = best)
    Road TypeType of road (Primary, Secondary, Barangay)
    AADTAverage Annual Daily Traffic
    Asphalt TypeAsphalt mix classification (e.g. Dense, Open-graded, SMA)
    Last MaintenanceYear of the last major maintenance
    Average RainfallAverage annual rainfall in the area (mm)
    RuttingDepth of rutting (mm)
    IRIInternational Roughness Index (m/km)
    Needs MaintenanceTarget label: 1 if urgent maintenance is needed, 0 otherwise

    🎓 Final Exam Task (For Students)

    Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:

    • Plots using Matplotlib or Seaborn
    • Tables or summary statistics using Pandas
    • Numerical calculations using NumPy
    • Grouped analyses, cross-tabulations, or pivot tables

    You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.

    What counts as an “Observation”

    1. Distribution Insight

      • E.g. plot the distribution of IRI and comment on its skewness.
    2. Correlation or Relationship

      • E.g. scatterplot of Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.
    3. Group Comparison

      • E.g. pivot table of mean AADT by Road Type and a bar chart.
    4. Derived Feature Analysis

      • E.g. create decay = Rutting / Last Maintenance, then describe its summary statistics and plot.
    5. Conditional Probability or Rate

      • E.g. compute the proportion of Needs Maintenance = 1 within each Road Type count and visualize as a line plot.

    You must deliver:

    • A Jupyter Notebook containing at least five well-labeled observations, each with a title, code cell(s), output (plot/table), and a short interpretation (2–4 sentences).
    • No AI tools: all code must be handwritten or copied from official docs/examples; do not use ChatGPT, Copilot, or similar.
    • Set your random seeds where appropriate to ensure reproducibility.
  11. Satellite telemetry data anomaly prediction

    • kaggle.com
    zip
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction
    Explore at:
    zip(2084669 bytes)Available download formats
    Dataset updated
    Apr 17, 2025
    Authors
    Orvile
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OPSSAT-AD - anomaly detection dataset for satellite telemetry

    This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

    It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

    The included files are:

    segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
    dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
    code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)
    

    Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
Organization logo

MOSTLY AI Prize Data

Training datasets for the MOSTLY AI Prize on tabular synthetic data generation

Explore at:
zip(9871594 bytes)Available download formats
Dataset updated
May 16, 2025
Authors
ivonaK
License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Competition

  • Generate the BEST tabular synthetic data and win 100,000 USD in cash.
  • Competition runs for 50 days: May 14 - July 3, 2025.
  • MOSTLY AI Prize

This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

Timeline

  • Submissions open: May 14, 2025, 15:30 UTC
  • Submission credits: 3 per calendar week (+bonus)
  • Submissions close: July 3, 2025, 23:59 UTC
  • Evaluation of Leaders: July 3 - July 9
  • Winners announced: on July 9 🏆

Datasets

Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

Evaluation

  • CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size
  • Evaluated using the Synthetic Data Quality Assurance toolkit
  • Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

Submission

MOSTLY AI Prize

Citation

If you use this dataset in your research, please cite:

@dataset{mostlyaiprize,
 author = {MOSTLY AI},
 title = {MOSTLY AI Prize Dataset},
 year = {2025},
 url = {https://www.mostlyaiprize.com/},
}
Search
Clear search
Close search
Google apps
Main menu