11 datasets found

MOSTLY AI Prize Data
kaggle.com
zip
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
Explore at:
zip(9871594 bytes)Available download formats
Dataset updated
May 16, 2025
Authors
ivonaK
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Competition

Generate the BEST tabular synthetic data and win 100,000 USD in cash.

Competition runs for 50 days: May 14 - July 3, 2025.

MOSTLY AI Prize

This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

Timeline

Submissions open: May 14, 2025, 15:30 UTC

Submission credits: 3 per calendar week (+bonus)

Submissions close: July 3, 2025, 23:59 UTC

Evaluation of Leaders: July 3 - July 9

Winners announced: on July 9 🏆

Datasets

Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

Evaluation

CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size

Evaluated using the Synthetic Data Quality Assurance toolkit

Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

Submission

MOSTLY AI Prize

Citation

If you use this dataset in your research, please cite:

@dataset{mostlyaiprize, author = {MOSTLY AI}, title = {MOSTLY AI Prize Dataset}, year = {2025}, url = {https://www.mostlyaiprize.com/}, }
Software Observability Dataset
kaggle.com
zip
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soham Patel (2025). Software Observability Dataset [Dataset]. https://www.kaggle.com/datasets/sohamphdresearch/software-observability-dataset
Explore at:
zip(273725617 bytes)Available download formats
Dataset updated
Aug 5, 2025
Authors
Soham Patel
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Observability data, consisting of structured logs, metrics, and traces, is swiftly emerging as the foundation of advanced DevOps methodologies, facilitating real-time understanding of system health and performance.

Significance of Data Collected During Dataset Development - Authenticity and Representativeness The synthetic dataset created for this study accurately represents realistic runtime events, encompassing successful operations, temporary failures, and severe exceptions across several service components and programming languages (Java, Python, Go, etc.). The dataset simulates real-world logging diversity by integrating various log formats, structured (JSON), semi-structured (logfmt/bracketed), and unstructured (console failures, stack traces), enhancing the model's robustness and transferability.

Multilingual and Multi-Component Scope Logs were deliberately annotated with language identifiers (e.g., Python, JavaScript, C#) and microservice names (AuthService, OrderProcessor), enabling Custom ChatGPT to discern correlations between language-specific issue patterns and their likely causes or potential solutions. This renders the dataset particularly helpful in multilingual contexts.

Introduction of Edge Cases and Anomalies To guarantee significant interaction from the model, the dataset incorporates edge scenarios such as:

Null pointer dereferences, Timeout exceptions, Memory errors, Invalid user session tokens These anomalies were systematically introduced to encompass many failure patterns, allowing GPT to formulate reasoning for specific test generation.

Structured Data for Optimization of Large Language Models The dataset comprises metadata fields including:

element, linguistics, degree of harshness, time marker, session/user identification This allows the LLM to execute conditional reasoning, context filtering, and test case relevance scoring—essential for prioritization tasks.

Customization Without Training In contrast to conventional ML pipelines that necessitate retraining on this data, our methodology employs the dataset for quick engineering and functional context embedding, hence maintaining both model efficacy and cost-effectiveness.

Data Reutilization This research utilized an observability dataset specifically crafted for extensive reusability across several dimensions of software quality and artificial intelligence research.

Multifunctional Utility Applicable for anomaly detection, log summarization, root cause analysis, and incident correlation tasks. Optimal for training, assessing, or benchmarking alternative LLMs, anomaly classifiers, or test case generators.

Prompt Engineering Repository Each log pattern, particularly structured ones, can be repurposed as components of a prompt template repository, facilitating consistent and scalable evaluation of LLM performance in various failure scenarios.

Inter-Project Comparisons The logs emulate generic service components (authentication, payment processing, API gateway), allowing the dataset to be repurposed across several experiments or projects without being confined to a specific domain. This improves longitudinal research or comparative analyses among various tools or models.

Potential of Open Datasets The artificial nature of the data enables public sharing without worries regarding privacy or intellectual property, hence fostering repeatability, peer validation, and community contributions.

Empirical Testing Investigation The dataset provides a robust basis for further study domains linked to testing, including: Analysis of test impact, Detection of test flakiness, Models for selecting regression tests, Concentration of failures
d
A dataset of 1500-word stories generated by gpt-4o-mini for 236...
search.dataone.org
dataverse.no
+1more
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rettberg, Jill Walker; Wigers, Hermann (2025). A dataset of 1500-word stories generated by gpt-4o-mini for 236 nationalities [Dataset]. http://doi.org/10.18710/VM2K4O
Explore at:
Unique identifier
https://doi.org/10.18710/VM2K4O
Dataset updated
May 29, 2025
Dataset provided by
DataverseNO
Authors
Rettberg, Jill Walker; Wigers, Hermann
Description
We created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories

AI Assistant Usage in Student Life

kaggle.com

Updated Jun 25, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayesha Saleem (2025). AI Assistant Usage in Student Life [Dataset]. https://www.kaggle.com/datasets/ayeshasal89/ai-assistant-usage-in-student-life-synthetic/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 25, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ayesha Saleem

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

If you find this dataset useful, a quick upvote would be greatly appreciated 🙌 It helps more learners discover it!

AI Assistant Usage in Student Life

Explore how students at different academic levels use AI tools like ChatGPT for tasks such as coding, writing, studying, and brainstorming. Designed for learning, EDA, and ML experimentation.

What is this dataset?

This dataset simulates 10,000 sessions of students interacting with an AI assistant (like ChatGPT or similar tools) for various academic tasks. Each row represents a single session, capturing the student’s level, discipline, type of task, session length, AI effectiveness, satisfaction rating, and whether they reused the AI tool later.

Why was this dataset created?

As AI tools become mainstream in education, there's a need to analyze and model how students interact with them. However, no public datasets exist for this behavior. This dataset fills that gap by providing a safe, fully synthetic yet realistic simulation for:

EDA and visualization practice
Machine learning modeling
Feature engineering workflows
Educational data science exploration

It’s ideal for students, data science learners, and researchers who want real-world use cases without privacy or copyright constraints.

How is the dataset structured?

Column	Description
`SessionID`	Unique session identifier
`StudentLevel`	Academic level: High School, Undergraduate, Graduate
`Discipline`	Student’s field of study (e.g., CS, Psychology, etc.)
`SessionDate`	Date of the session
`SessionLengthMin`	Length of AI interaction in minutes
`TotalPrompts`	Number of prompts/messages used
`TaskType`	Nature of the task (e.g., Coding, Writing, Research)
`AI_AssistanceLevel`	1–5 scale on how helpful the AI was perceived to be
`FinalOutcome`	What the student achieved: Assignment Completed, Idea Drafted, etc.
`UsedAgain`	Whether the student returned to use the assistant again
`SatisfactionRating`	1–5 rating of overall satisfaction with the session

All data is synthetically generated using controlled distributions, real-world logic, and behavioral modeling to reflect realistic usage patterns.

Possible Use Cases

This dataset is rich with potential for:

EDA: Visualize session behavior across levels, tasks, or disciplines
Classification: Predict likelihood of reuse (UsedAgain) or final outcome
Regression: Model satisfaction or session length based on context
Clustering: Segment students by AI interaction behavior
Feature engineering practice: Derive prompt density, session efficiency, or task difficulty
Survey-style analysis: Discover what makes students satisfied or frustrated

Key Features

Clean and ready-to-use CSV
Balanced and realistic distributions
No missing values
Highly relatable academic context

Medical Symptom Data for AI Diagnostics (5,000 Ent
kaggle.com
zip
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BoffinBot (2025). Medical Symptom Data for AI Diagnostics (5,000 Ent [Dataset]. https://www.kaggle.com/datasets/boffinbot/disease-prediction-dataset
Explore at:
zip(228149 bytes)Available download formats
Dataset updated
Apr 16, 2025
Authors
BoffinBot
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is a synthetic collection designed for training and evaluating machine learning models, particularly for disease prediction tasks. It contains 5,000 records, with approximately 625 samples per disease, covering eight common medical conditions: Common Cold, Malaria, Cough, Asthma, Normal Fever, Body Ache, Runny Nose, and Dengue. Each entry includes five symptom features—Fever (in °F), Headache, Cough, Fatigue, and Body Pain (all on a 0-10 scale)—along with the corresponding disease label.

Dataset Structure:

Columns: Fever (float, 95-105°F) Headache (float, 0-10) Cough (float, 0-10) Fatigue (float, 0-10) Body_Pain (float, 0-10) Disease (string, one of 8 classes) Rows: 5,000 (balanced across diseases) Format: CSV Generation Process:

The data was synthetically generated using Python (NumPy and Pandas) based on realistic medical correlations. Symptom ranges were defined to reflect typical disease presentations (e.g., high Fever and Fatigue for Dengue, moderate Cough for Common Cold), ensuring variability and usability for model training. The dataset was created to support a hybrid AI project combining Fuzzy Logic and Convolutional Neural Networks (CNN), making it ideal for educational purposes or testing advanced diagnostic algorithms.

Intended Use:

Train supervised learning models (e.g., CNN, Random Forest) for multi-class classification. Develop and test hybrid systems integrating rule-based (Fuzzy Logic) and data-driven (CNN) approaches. Educational projects in healthcare AI, focusing on symptom-based disease prediction. Benchmarking model performance with a controlled, balanced dataset. Limitations:

Synthetic nature means it lacks real-world patient data variability. Designed for five specific symptoms; additional features may require augmentation.
synthetic credit score of thin-file consumers
kaggle.com
zip
Updated May 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepa Shukla (2024). synthetic credit score of thin-file consumers [Dataset]. https://www.kaggle.com/datasets/deepashukla/synthetic-credit-score-of-thin-file-consumers/code
Explore at:
zip(36416 bytes)Available download formats
Dataset updated
May 12, 2024
Authors
Deepa Shukla
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Research Context The dataset in question is designed to facilitate a study in the development of machine learning algorithms specifically tailored for credit scoring of "thin-file" consumers. "Thin-file" consumers are individuals who have little to no credit history, which makes traditional credit scoring models less effective or entirely inapplicable. These consumers often face difficulties in accessing credit products because they cannot be easily assessed by standard credit risk evaluation methods.

Sources The data contained in the attached file is synthetically created using Python code. This approach is often employed to generate comprehensive datasets where real data is either unavailable or too sensitive to use for research purposes. Synthetic data generation allows for controlled experiments and analysis by enabling the inclusion of varied and extensive scenarios that might not be represented in real-world data, ensuring both privacy compliance and rich diversity in data attributes.

[*Python libraries such as Pandas, NumPy, and Faker is used to create this dataset. These tools help in generating realistic data patterns and distributions, simulating a range of consumer profiles from those with stable financial behaviors to those with erratic financial histories, which are typical of thin-file scenarios.*]

Inspiration The inspiration behind generating and utilising this dataset is to refine and enhance machine learning models that can effectively score thin-file consumers. This aligns with broader financial inclusivity goals, aiming to bridge the gap in financial services by providing fair credit opportunities to underserved segments of the population. By developing algorithms that can accurately predict creditworthiness in the absence of extensive credit histories, the study aims to propel the financial industry towards more equitable practices.

This dataset, therefore, serves as a foundational element in a research effort that not only seeks to innovate in the technical realm of machine learning but also to contribute positively to societal progress by enhancing financial inclusion.
Data for T2DM Risk prediction after GDM
kaggle.com
zip
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashanthan Amirthanathan (2025). Data for T2DM Risk prediction after GDM [Dataset]. https://www.kaggle.com/datasets/prashanthana/gdm-risk-data-for-t2dm-prediction
Explore at:
zip(43703 bytes)Available download formats
Dataset updated
Sep 26, 2025
Authors
Prashanthan Amirthanathan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The synthetic dataset, consisting of 6,000 instances and 29 attributes including the target variable, has moderate to strong correlations, prioritizing severity over demographic factors. It maintains clinical coherence and incorporates plausible noise to enhance realism. The dataset has a binary target variable, T2DM risk, making it suitable for classification analysis. The dataset is already encoded.

Features are as follows:

Insulin treatment during pregnancy

Pregnancy complications—hypertensive disorders,

Pregnancy complications—preterm delivery

Pregnancy complications—PPH

Gestational weight gain

Abnormal oral glucose tolerance test (OGTT) results

Elevated HbA1c during pregnancy

Macrosomia baby/birth weight (Delivered a baby >3.5 kg)

Large for gestational age

Instrumental delivery

Stillbirth/Miscarriage

History of Recurrence of GDM

NICU admission

Perinatal outcome, including 28-day mortality

High pre-pregnancy BMI or overweight status (BMI ≥25 kg/m²)

Older maternal age

Multiparity

Ethnicity

Family history of diabetes

Socioeconomic factors (deprivation quintile)

Presence of T2DM-associated gene variants (e.g., TCF7L2, FTO)

Obesity or unhealthy postpartum weight gain

Physical inactivity

Unhealthy diet

Smoking

Alcohol intake

Does not undergo postpartum glucose screening

Breastfeeding

This data was used in the research paper:

Jenifar Prashanthan, Amirthanathan Prashanthan, Predicting the future risk of developing type 2 diabetes in women with a history of gestational diabetes mellitus using machine learning and explainable artificial intelligence,

https://doi.org/10.1016/j.pcd.2025.09.006.

Abstract:

Background and aim It is essential to identify the risk of developing Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM). This study seeks to create a machine learning (ML) model combined with explainable artificial intelligence (XAI) to predict and explain the risk of Type 2 Diabetes Mellitus (T2DM) in women with a history of Gestational Diabetes Mellitus (GDM).

Methods A literature review found 28 risk factors, including pregnancy-related clinical risk factors, maternal characteristics, genetic risk factors, and lifestyle and modifiable risk factors. A synthetic dataset was generated utilizing subject expertise and clinical experience through Python programming. Various machine learning classification techniques were employed on the data to identify the optimal model, which integrates interpretability approaches (SHAP) to guarantee the transparency of model predictions.

Results The developed machine learning model exhibited superior accuracy in predicting the risk of T2DM relative to conventional clinical risk scores, with notable contributions from factors such as insulin treatment during pregnancy, physical inactivity, obesity, breastfeeding, a history of recurrent GDM, an unhealthy diet, and ethnicity. Integrated XAI assists clinicians in comprehending the relevant risk factors and their influence on certain predictive outcomes.

Conclusions Machine learning and explainable artificial intelligence provide a comprehensive methodology for individualized risk evaluation in women with a history of gestational diabetes mellitus. This methodology, by integrating extensive real-world data, offers healthcare clinicians actionable insights for early intervention. Keywords: Type 2 diabetes mellitus; Gestational diabetes mellitus; Machine learning; Explainable AI; Risk prediction; Personalized healthcare

MMM Weekly Data - Geo:India

kaggle.com

zip

Updated Jul 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

SubhagatoAdak (2025). MMM Weekly Data - Geo:India [Dataset]. https://www.kaggle.com/datasets/subhagatoadak/mmm-weekly-data-geoindia

Explore at:

zip(2463044 bytes)Available download formats

Dataset updated

Jul 18, 2025

Authors

SubhagatoAdak

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Area covered

India

Description

Synthetic India FMCG MMM Dataset (Weekly, 3 Years, Multi-Geo / Multi-Channel)

Subtitle: 3-Year Weekly Multi-Channel FMCG Marketing Mix Panel for India Grain: Week-ending Saturday × Geography × Brand × SKU Span: 156 weeks (2 Jul 2022 – 27 Jun 2025) Scope: 8 Indian geographies • 3 brands × 3 SKUs each (9 SKUs) • Full marketing, trade, price, distribution & macro controls • AI creative quality scores for digital banners.

This dataset is synthetic but behaviorally realistic, generated to help analysts experiment with Marketing Mix Modeling (MMM), media effectiveness, price/promo analytics, distribution effects, and hierarchical causal inference without using proprietary commercial data.

Why This Dataset?

Real MMM training data is rarely public due to confidentiality. This synthetic panel:

Mirrors common FMCG (CPG) category dynamics in India (festive spikes, monsoon effects, geo scale differences).
Includes paid media channels (TV, YouTube, Facebook, Instagram, Print, Radio).
Captures promotions & trade levers (feature, display, temporary price reduction, trade spend).
Provides distribution & availability metrics (Weighted Distribution, Numeric Distribution, TDP, NOS).
Includes pricing (MRP, Net Price under TPR).
Adds macro signals (CPI, GDP, Festival Index, Rainfall Index) aligned to India’s seasonality.
Introduces AI Content Scores (Facebook & Instagram banner creative quality) — letting you explore creative × media interaction models.
Delivered at a granular panel (Geo × Brand × SKU) suitable for pooled, hierarchical, or Bayesian MMM workflows.

Files

File	Description
`synthetic_mmm_weekly_india_SAT.csv`	Main dataset. 11,232 rows × 28 columns. Weekly (week-ending Saturday).

(If you also upload the Monday version, note it clearly and point users to which to use.)

Quick Start

import pandas as pd

df = pd.read_csv("/kaggle/input/synthetic-india-fmcg-mmm/synthetic_mmm_weekly_india_SAT.csv",
         parse_dates=["Week"])

df.info()
df.head()

Aggregate to Geo-Brand Weekly

geo_brand = (
  df.groupby(["Week","Geo","Brand"], as_index=False)
   .sum(numeric_only=True)
)

Create Modeling-Friendly Features

Example: log-transform sales value, normalize media, build price index.

import numpy as np

m = geo_brand.copy()
m["log_sales_val"] = np.log1p(m["Sales_Value"])
m["price_index"] = m["Net_Price"] / m.groupby(["Geo","Brand"])["Net_Price"].transform("mean")

Calendar Notes

Week variable = week-ending Saturday (Pandas freq W-SAT).
First week: 2022-07-02; last week: 2025-06-27 (depending on 156-week span anchor).

To derive a week-start (Sunday) date:

df["Week_Start"] = df["Week"] - pd.Timedelta(days=6)

Data Dictionary

Key Dimensions

Column	Type	Description
Week	date	Week-ending Saturday timestamp.
Geo	categorical	8 rollups: NORTH, SOUTH, EAST, WEST, CENTRAL, NORTHEAST, METRO_DELHI, METRO_MUMBAI.
Brand	categorical	BrandA / BrandB / BrandC.
SKU	categorical	Brand-level SKU IDs (3 per brand).

Commercial Outcomes

Column	Type	Notes
Sales_Units	float	Modeled weekly unit sales after macro, distribution, price, promo & media effects. Lognormal noise added.
Sales_Value	float	Sales_Units × Net_Price. Use for revenue MMM or ROI analyses.

Pricing

Column	Type	Notes
MRP	float	Baseline list price (per-unit). Drifts with CPI & brand positioning.
Net_Price	float	Effective real...

Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...

Pavement Dataset

kaggle.com

zip

Updated May 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gifrey Sulay (2025). Pavement Dataset [Dataset]. https://www.kaggle.com/datasets/gifreysulay/pavement-dataset/discussion?sort=undefined

Explore at:

zip(20890601 bytes)Available download formats

Dataset updated

May 24, 2025

Authors

Gifrey Sulay

License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Description

🏗️ Pavement Condition Monitoring and Maintenance Prediction

📘 Scenario

You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.

Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.

🔍 Target Variable: `Needs_Maintenance`

This binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:

Needs_Maintenance = 1
Needs_Maintenance = 0 otherwise

🎯 Learning Objectives

Perform exploratory data analysis (EDA) on civil engineering infrastructure data
Engineer features relevant to road quality and maintenance
Build and evaluate a binary classification model using Python
Interpret model results to support maintenance prioritization decisions

📊 Dataset Features

Column Name	Description
Segment ID	Unique identifier for the road segment
PCI	Pavement Condition Index (0 = worst, 100 = best)
Road Type	Type of road (Primary, Secondary, Barangay)
AADT	Average Annual Daily Traffic
Asphalt Type	Asphalt mix classification (e.g. Dense, Open-graded, SMA)
Last Maintenance	Year of the last major maintenance
Average Rainfall	Average annual rainfall in the area (mm)
Rutting	Depth of rutting (mm)
IRI	International Roughness Index (m/km)
Needs Maintenance	Target label: 1 if urgent maintenance is needed, 0 otherwise

🎓 Final Exam Task (For Students)

Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:

Plots using Matplotlib or Seaborn
Tables or summary statistics using Pandas
Numerical calculations using NumPy
Grouped analyses, cross-tabulations, or pivot tables

You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.

What counts as an “Observation”

Distribution Insight
- E.g. plot the distribution of IRI and comment on its skewness.
Correlation or Relationship
- E.g. scatterplot of Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.
Group Comparison
- E.g. pivot table of mean AADT by Road Type and a bar chart.
Derived Feature Analysis
- E.g. create decay = Rutting / Last Maintenance, then describe its summary statistics and plot.
Conditional Probability or Rate
- E.g. compute the proportion of Needs Maintenance = 1 within each Road Type count and visualize as a line plot.

You must deliver:

A Jupyter Notebook containing at least five well-labeled observations, each with a title, code cell(s), output (plot/table), and a short interpretation (2–4 sentences).
No AI tools: all code must be handwritten or copied from official docs/examples; do not use ChatGPT, Copilot, or similar.
Set your random seeds where appropriate to ensure reproducibility.

Satellite telemetry data anomaly prediction
kaggle.com
zip
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction
Explore at:
zip(2084669 bytes)Available download formats
Dataset updated
Apr 17, 2025
Authors
Orvile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OPSSAT-AD - anomaly detection dataset for satellite telemetry

This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

The included files are:

segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft, dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment. code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)

Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code

MOSTLY AI Prize Data

Training datasets for the MOSTLY AI Prize on tabular synthetic data generation

Explore at:

zip(9871594 bytes)Available download formats

Dataset updated

May 16, 2025

Authors

ivonaK

License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Competition

Generate the BEST tabular synthetic data and win 100,000 USD in cash.
Competition runs for 50 days: May 14 - July 3, 2025.
MOSTLY AI Prize

This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

Timeline

Submissions open: May 14, 2025, 15:30 UTC
Submission credits: 3 per calendar week (+bonus)
Submissions close: July 3, 2025, 23:59 UTC
Evaluation of Leaders: July 3 - July 9
Winners announced: on July 9 🏆

Datasets

Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

Evaluation

CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size
Evaluated using the Synthetic Data Quality Assurance toolkit
Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

Submission

MOSTLY AI Prize

Citation

If you use this dataset in your research, please cite:

@dataset{mostlyaiprize,
 author = {MOSTLY AI},
 title = {MOSTLY AI Prize Dataset},
 year = {2025},
 url = {https://www.mostlyaiprize.com/},
}

Clear search

Close search

Google apps

Main menu

MOSTLY AI Prize Data

Competition

Timeline

Datasets

Evaluation

Submission

Citation

Software Observability Dataset

A dataset of 1500-word stories generated by gpt-4o-mini for 236...

AI Assistant Usage in Student Life

AI Assistant Usage in Student Life

What is this dataset?

Why was this dataset created?

How is the dataset structured?

Possible Use Cases

Key Features

Medical Symptom Data for AI Diagnostics (5,000 Ent

synthetic credit score of thin-file consumers

Data for T2DM Risk prediction after GDM

MMM Weekly Data - Geo:India

Synthetic India FMCG MMM Dataset (Weekly, 3 Years, Multi-Geo / Multi-Channel)

Why This Dataset?

Files

Quick Start

Aggregate to Geo-Brand Weekly

Create Modeling-Friendly Features

Calendar Notes

Data Dictionary

Key Dimensions

Commercial Outcomes

Pricing

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

Pavement Dataset

🏗️ Pavement Condition Monitoring and Maintenance Prediction

📘 Scenario

🔍 Target Variable: Needs_Maintenance

🎯 Learning Objectives

📊 Dataset Features

🎓 Final Exam Task (For Students)

What counts as an “Observation”

Satellite telemetry data anomaly prediction

The included files are:

MOSTLY AI Prize DataSee More Versions

Training datasets for the MOSTLY AI Prize on tabular synthetic data generation

Competition

Timeline

Datasets

Evaluation

Submission

Citation

🔍 Target Variable: `Needs_Maintenance`

MOSTLY AI Prize Data