Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge
For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.
Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.
Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical
Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical
If you use this dataset in your research, please cite:
@dataset{mostlyaiprize,
author = {MOSTLY AI},
title = {MOSTLY AI Prize Dataset},
year = {2025},
url = {https://www.mostlyaiprize.com/},
}
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Observability data, consisting of structured logs, metrics, and traces, is swiftly emerging as the foundation of advanced DevOps methodologies, facilitating real-time understanding of system health and performance.
Significance of Data Collected During Dataset Development - Authenticity and Representativeness The synthetic dataset created for this study accurately represents realistic runtime events, encompassing successful operations, temporary failures, and severe exceptions across several service components and programming languages (Java, Python, Go, etc.). The dataset simulates real-world logging diversity by integrating various log formats, structured (JSON), semi-structured (logfmt/bracketed), and unstructured (console failures, stack traces), enhancing the model's robustness and transferability.
Multilingual and Multi-Component Scope Logs were deliberately annotated with language identifiers (e.g., Python, JavaScript, C#) and microservice names (AuthService, OrderProcessor), enabling Custom ChatGPT to discern correlations between language-specific issue patterns and their likely causes or potential solutions. This renders the dataset particularly helpful in multilingual contexts.
Introduction of Edge Cases and Anomalies To guarantee significant interaction from the model, the dataset incorporates edge scenarios such as:
Null pointer dereferences, Timeout exceptions, Memory errors, Invalid user session tokens These anomalies were systematically introduced to encompass many failure patterns, allowing GPT to formulate reasoning for specific test generation.
Structured Data for Optimization of Large Language Models The dataset comprises metadata fields including:
element, linguistics, degree of harshness, time marker, session/user identification This allows the LLM to execute conditional reasoning, context filtering, and test case relevance scoring—essential for prioritization tasks.
Customization Without Training In contrast to conventional ML pipelines that necessitate retraining on this data, our methodology employs the dataset for quick engineering and functional context embedding, hence maintaining both model efficacy and cost-effectiveness.
Data Reutilization This research utilized an observability dataset specifically crafted for extensive reusability across several dimensions of software quality and artificial intelligence research.
Multifunctional Utility Applicable for anomaly detection, log summarization, root cause analysis, and incident correlation tasks. Optimal for training, assessing, or benchmarking alternative LLMs, anomaly classifiers, or test case generators.
Prompt Engineering Repository Each log pattern, particularly structured ones, can be repurposed as components of a prompt template repository, facilitating consistent and scalable evaluation of LLM performance in various failure scenarios.
Inter-Project Comparisons The logs emulate generic service components (authentication, payment processing, API gateway), allowing the dataset to be repurposed across several experiments or projects without being confined to a specific domain. This improves longitudinal research or comparative analyses among various tools or models.
Potential of Open Datasets The artificial nature of the data enables public sharing without worries regarding privacy or intellectual property, hence fostering repeatability, peer validation, and community contributions.
Empirical Testing Investigation The dataset provides a robust basis for further study domains linked to testing, including: Analysis of test impact, Detection of test flakiness, Models for selecting regression tests, Concentration of failures
Facebook
TwitterWe created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
If you find this dataset useful, a quick upvote would be greatly appreciated 🙌 It helps more learners discover it!
Explore how students at different academic levels use AI tools like ChatGPT for tasks such as coding, writing, studying, and brainstorming. Designed for learning, EDA, and ML experimentation.
This dataset simulates 10,000 sessions of students interacting with an AI assistant (like ChatGPT or similar tools) for various academic tasks. Each row represents a single session, capturing the student’s level, discipline, type of task, session length, AI effectiveness, satisfaction rating, and whether they reused the AI tool later.
As AI tools become mainstream in education, there's a need to analyze and model how students interact with them. However, no public datasets exist for this behavior. This dataset fills that gap by providing a safe, fully synthetic yet realistic simulation for:
It’s ideal for students, data science learners, and researchers who want real-world use cases without privacy or copyright constraints.
| Column | Description |
|---|---|
SessionID | Unique session identifier |
StudentLevel | Academic level: High School, Undergraduate, Graduate |
Discipline | Student’s field of study (e.g., CS, Psychology, etc.) |
SessionDate | Date of the session |
SessionLengthMin | Length of AI interaction in minutes |
TotalPrompts | Number of prompts/messages used |
TaskType | Nature of the task (e.g., Coding, Writing, Research) |
AI_AssistanceLevel | 1–5 scale on how helpful the AI was perceived to be |
FinalOutcome | What the student achieved: Assignment Completed, Idea Drafted, etc. |
UsedAgain | Whether the student returned to use the assistant again |
SatisfactionRating | 1–5 rating of overall satisfaction with the session |
All data is synthetically generated using controlled distributions, real-world logic, and behavioral modeling to reflect realistic usage patterns.
This dataset is rich with potential for:
UsedAgain) or final outcome
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection designed for training and evaluating machine learning models, particularly for disease prediction tasks. It contains 5,000 records, with approximately 625 samples per disease, covering eight common medical conditions: Common Cold, Malaria, Cough, Asthma, Normal Fever, Body Ache, Runny Nose, and Dengue. Each entry includes five symptom features—Fever (in °F), Headache, Cough, Fatigue, and Body Pain (all on a 0-10 scale)—along with the corresponding disease label.
Dataset Structure:
Columns: Fever (float, 95-105°F) Headache (float, 0-10) Cough (float, 0-10) Fatigue (float, 0-10) Body_Pain (float, 0-10) Disease (string, one of 8 classes) Rows: 5,000 (balanced across diseases) Format: CSV Generation Process:
The data was synthetically generated using Python (NumPy and Pandas) based on realistic medical correlations. Symptom ranges were defined to reflect typical disease presentations (e.g., high Fever and Fatigue for Dengue, moderate Cough for Common Cold), ensuring variability and usability for model training. The dataset was created to support a hybrid AI project combining Fuzzy Logic and Convolutional Neural Networks (CNN), making it ideal for educational purposes or testing advanced diagnostic algorithms.
Intended Use:
Train supervised learning models (e.g., CNN, Random Forest) for multi-class classification. Develop and test hybrid systems integrating rule-based (Fuzzy Logic) and data-driven (CNN) approaches. Educational projects in healthcare AI, focusing on symptom-based disease prediction. Benchmarking model performance with a controlled, balanced dataset. Limitations:
Synthetic nature means it lacks real-world patient data variability. Designed for five specific symptoms; additional features may require augmentation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Research Context The dataset in question is designed to facilitate a study in the development of machine learning algorithms specifically tailored for credit scoring of "thin-file" consumers. "Thin-file" consumers are individuals who have little to no credit history, which makes traditional credit scoring models less effective or entirely inapplicable. These consumers often face difficulties in accessing credit products because they cannot be easily assessed by standard credit risk evaluation methods.
Sources The data contained in the attached file is synthetically created using Python code. This approach is often employed to generate comprehensive datasets where real data is either unavailable or too sensitive to use for research purposes. Synthetic data generation allows for controlled experiments and analysis by enabling the inclusion of varied and extensive scenarios that might not be represented in real-world data, ensuring both privacy compliance and rich diversity in data attributes.
[*Python libraries such as Pandas, NumPy, and Faker is used to create this dataset. These tools help in generating realistic data patterns and distributions, simulating a range of consumer profiles from those with stable financial behaviors to those with erratic financial histories, which are typical of thin-file scenarios.*]
Inspiration The inspiration behind generating and utilising this dataset is to refine and enhance machine learning models that can effectively score thin-file consumers. This aligns with broader financial inclusivity goals, aiming to bridge the gap in financial services by providing fair credit opportunities to underserved segments of the population. By developing algorithms that can accurately predict creditworthiness in the absence of extensive credit histories, the study aims to propel the financial industry towards more equitable practices.
This dataset, therefore, serves as a foundational element in a research effort that not only seeks to innovate in the technical realm of machine learning but also to contribute positively to societal progress by enhancing financial inclusion.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The synthetic dataset, consisting of 6,000 instances and 29 attributes including the target variable, has moderate to strong correlations, prioritizing severity over demographic factors. It maintains clinical coherence and incorporates plausible noise to enhance realism. The dataset has a binary target variable, T2DM risk, making it suitable for classification analysis. The dataset is already encoded.
Features are as follows:
This data was used in the research paper:
Jenifar Prashanthan, Amirthanathan Prashanthan, Predicting the future risk of developing type 2 diabetes in women with a history of gestational diabetes mellitus using machine learning and explainable artificial intelligence,
https://doi.org/10.1016/j.pcd.2025.09.006.
Abstract:
Background and aim It is essential to identify the risk of developing Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM). This study seeks to create a machine learning (ML) model combined with explainable artificial intelligence (XAI) to predict and explain the risk of Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM).
Methods A literature review found 28 risk factors, including pregnancy-related clinical risk factors, maternal characteristics, genetic risk factors, and lifestyle and modifiable risk factors. A synthetic dataset was generated utilizing subject expertise and clinical experience through Python programming. Various machine learning classification techniques were employed on the data to identify the optimal model, which integrates interpretability approaches (SHAP) to guarantee the transparency of model predictions.
Results The developed machine learning model exhibited superior accuracy in predicting the risk of T2DM relative to conventional clinical risk scores, with notable contributions from factors such as insulin treatment during pregnancy, physical inactivity, obesity, breastfeeding, a history of recurrent GDM, an unhealthy diet, and ethnicity. Integrated XAI assists clinicians in comprehending the relevant risk factors and their influence on certain predictive outcomes.
Conclusions Machine learning and explainable artificial intelligence provide a comprehensive methodology for individualized risk evaluation in women with a history of gestational diabetes mellitus. This methodology, by integrating extensive real-world data, offers healthcare clinicians actionable insights for early intervention. Keywords: Type 2 diabetes mellitus; Gestational diabetes mellitus; Machine learning; Explainable AI; Risk prediction; Personalized healthcare
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Subtitle: 3-Year Weekly Multi-Channel FMCG Marketing Mix Panel for India Grain: Week-ending Saturday × Geography × Brand × SKU Span: 156 weeks (2 Jul 2022 – 27 Jun 2025) Scope: 8 Indian geographies • 3 brands × 3 SKUs each (9 SKUs) • Full marketing, trade, price, distribution & macro controls • AI creative quality scores for digital banners.
This dataset is synthetic but behaviorally realistic, generated to help analysts experiment with Marketing Mix Modeling (MMM), media effectiveness, price/promo analytics, distribution effects, and hierarchical causal inference without using proprietary commercial data.
Real MMM training data is rarely public due to confidentiality. This synthetic panel:
| File | Description |
|---|---|
synthetic_mmm_weekly_india_SAT.csv | Main dataset. 11,232 rows × 28 columns. Weekly (week-ending Saturday). |
(If you also upload the Monday version, note it clearly and point users to which to use.)
import pandas as pd
df = pd.read_csv("/kaggle/input/synthetic-india-fmcg-mmm/synthetic_mmm_weekly_india_SAT.csv",
parse_dates=["Week"])
df.info()
df.head()
geo_brand = (
df.groupby(["Week","Geo","Brand"], as_index=False)
.sum(numeric_only=True)
)
Example: log-transform sales value, normalize media, build price index.
import numpy as np
m = geo_brand.copy()
m["log_sales_val"] = np.log1p(m["Sales_Value"])
m["price_index"] = m["Net_Price"] / m.groupby(["Geo","Brand"])["Net_Price"].transform("mean")
W-SAT).To derive a week-start (Sunday) date:
df["Week_Start"] = df["Week"] - pd.Timedelta(days=6)
| Column | Type | Description |
|---|---|---|
| Week | date | Week-ending Saturday timestamp. |
| Geo | categorical | 8 rollups: NORTH, SOUTH, EAST, WEST, CENTRAL, NORTHEAST, METRO_DELHI, METRO_MUMBAI. |
| Brand | categorical | BrandA / BrandB / BrandC. |
| SKU | categorical | Brand-level SKU IDs (3 per brand). |
| Column | Type | Notes |
|---|---|---|
| Sales_Units | float | Modeled weekly unit sales after macro, distribution, price, promo & media effects. Lognormal noise added. |
| Sales_Value | float | Sales_Units × Net_Price. Use for revenue MMM or ROI analyses. |
| Column | Type | Notes |
|---|---|---|
| MRP | float | Baseline list price (per-unit). Drifts with CPI & brand positioning. |
| Net_Price | float | Effective real... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.
Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.
Needs_MaintenanceThis binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:
Needs_Maintenance = 1 Needs_Maintenance = 0 otherwise| Column Name | Description |
|---|---|
| Segment ID | Unique identifier for the road segment |
| PCI | Pavement Condition Index (0 = worst, 100 = best) |
| Road Type | Type of road (Primary, Secondary, Barangay) |
| AADT | Average Annual Daily Traffic |
| Asphalt Type | Asphalt mix classification (e.g. Dense, Open-graded, SMA) |
| Last Maintenance | Year of the last major maintenance |
| Average Rainfall | Average annual rainfall in the area (mm) |
| Rutting | Depth of rutting (mm) |
| IRI | International Roughness Index (m/km) |
| Needs Maintenance | Target label: 1 if urgent maintenance is needed, 0 otherwise |
Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:
You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.
Distribution Insight
IRI and comment on its skewness. Correlation or Relationship
Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.Group Comparison
AADT by Road Type and a bar chart.Derived Feature Analysis
decay = Rutting / Last Maintenance, then describe its summary statistics and plot.Conditional Probability or Rate
Needs Maintenance = 1 within each Road Type count and visualize as a line plot.You must deliver:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OPSSAT-AD - anomaly detection dataset for satellite telemetry
This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.
It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.
segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)
Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge
For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.
Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.
Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical
Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical
If you use this dataset in your research, please cite:
@dataset{mostlyaiprize,
author = {MOSTLY AI},
title = {MOSTLY AI Prize Dataset},
year = {2025},
url = {https://www.mostlyaiprize.com/},
}