100+ datasets found

PASTA Data
kaggle.com
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google Research
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.
Open Images
kaggle.com
opendatalab.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
Data from: Spam email Dataset
kaggle.com
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
_w1998
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Dataset Name: Spam Email Dataset

Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

Columns:

text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
Customer Shopping Trends Dataset
kaggle.com
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sourav Banerjee
Description
Context

The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

Content

This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

Dataset Glossary (Column-wise)

Customer ID - Unique identifier for each customer

Age - Age of the customer

Gender - Gender of the customer (Male/Female)

Item Purchased - The item purchased by the customer

Category - Category of the item purchased

Purchase Amount (USD) - The amount of the purchase in USD

Location - Location where the purchase was made

Size - Size of the purchased item

Color - Color of the purchased item

Season - Season during which the purchase was made

Review Rating - Rating given by the customer for the purchased item

Subscription Status - Indicates if the customer has a subscription (Yes/No)

Shipping Type - Type of shipping chosen by the customer

Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)

Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)

Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction

Payment Method - Customer's most preferred payment method

Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

Structure of the Dataset

https://i.imgur.com/6UEqejq.png" alt="">

Acknowledgement

This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

Cover Photo by: Freepik

Thumbnail by: Clothing icons created by Flat Icons - Flaticon
ICR-integer-data
kaggle.com
Updated May 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raddar (2023). ICR-integer-data [Dataset]. https://www.kaggle.com/datasets/raddar/icr-integer-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
raddar
Description
The dataset contains https://www.kaggle.com/competitions/icr-identify-age-related-conditions competition dataset transformed into integerized data. The common denominator is found for each column. Distribution of even/odd numbers were performed to identify if some values should be a fraction.

Columns 'FL' and 'GL' were untouched, probably float by nature.

Please refer to notebook for exact transformations: https://www.kaggle.com/code/raddar/convert-icr-data-to-integers
Datasets for federated learning
kaggle.com
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/datasets/wonghoitin/datasets-for-federated-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
wonghoitin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

source:

smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain
🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️
kaggle.com
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandeep SD (2025). 🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️ [Dataset]. https://www.kaggle.com/datasets/sandeep1080/bassburst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sandeep SD
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🌟 Enjoying the Dataset? 🌟

If this dataset helped you uncover new insights or make your day a little brighter. Thanks a ton for checking it out! Let’s keep those insights rolling! 🔥📈

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23961675%2Ff3761bd2d7ee460ad464de8f25634f63%2Fsteve-johnson-z6LlNgsDeug-unsplash.jpg?generation=1740481184467263&alt=media" alt="">

Dataset Description:

This dataset contains website conversion data for Bluetooth speaker sales. The dataset tracks user sessions on different landing page variants, with the primary goal of analyzing conversion rates, user behavior, and other factors influencing sales. It includes detailed user engagement metrics such as time spent, pages visited, device type, sign-in methods, and geographical information.

Use Case:

This dataset can be used for various analytical tasks including:

A/B testing and multivariate analysis to compare landing page designs.
User segmentation by demographics (age, gender, location, etc.).
Conversion rate optimization (CRO) analysis.
Predictive modeling for conversion likelihood based on session characteristics.
Revenue and payment analysis.
Car Price Prediction Challenge
kaggle.com
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deep Contractor
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Assignment

Your notebooks must contain the following steps:

Perform data cleaning and pre-processing.

What steps did you use in this process and how did you clean your data.

Perform exploratory data analysis on the given dataset.

Explain each and every graphs that you make.

Train a ml-model and evaluate it using different metrics.

Why did you choose that particular model? What was the accuracy?

Hyperparameter optimization and feature selection is a plus.

Model deployment and use of ml-flow is a plus.

Perform model interpretation and show feature importance for your model.

Provide some explanation for the above point.

Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!
Data from: AI & Computer Vision Dataset
kaggle.com
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Yadav (2025). AI & Computer Vision Dataset [Dataset]. https://www.kaggle.com/datasets/khushikyad001/ai-and-computer-vision-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Khushi Yadav
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains metadata related to three categories of AI and computer vision applications:

Handwritten Math Solutions: Metadata on images of handwritten math problems with step-by-step solutions.

Multi-lingual Street Signs: Road sign images in various languages, with translations.

Security Camera Anomalies: Surveillance footage metadata distinguishing between normal and suspicious activities.

The dataset is useful for machine learning, image recognition, OCR (Optical Character Recognition), anomaly detection, and AI model training.
📣 Ad Click Prediction Dataset
kaggle.com
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ciobanu Marius (2024). 📣 Ad Click Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/marius2303/ad-click-prediction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ciobanu Marius
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About

This dataset provides insights into user behavior and online advertising, specifically focusing on predicting whether a user will click on an online advertisement. It contains user demographic information, browsing habits, and details related to the display of the advertisement. This dataset is ideal for building binary classification models to predict user interactions with online ads.

Features

id: Unique identifier for each user.

full_name: User's name formatted as "UserX" for anonymity.

age: Age of the user (ranging from 18 to 64 years).

gender: The gender of the user (categorized as Male, Female, or Non-Binary).

device_type: The type of device used by the user when viewing the ad (Mobile, Desktop, Tablet).

ad_position: The position of the ad on the webpage (Top, Side, Bottom).

browsing_history: The user's browsing activity prior to seeing the ad (Shopping, News, Entertainment, Education, Social Media).

time_of_day: The time when the user viewed the ad (Morning, Afternoon, Evening, Night).

click: The target label indicating whether the user clicked on the ad (1 for a click, 0 for no click).

Goal

The objective of this dataset is to predict whether a user will click on an online ad based on their demographics, browsing behavior, the context of the ad's display, and the time of day. You will need to clean the data, understand it and then apply machine learning models to predict and evaluate data. It is a really challenging request for this kind of data. This data can be used to improve ad targeting strategies, optimize ad placement, and better understand user interaction with online advertisements.
Iris Species
kaggle.com
zip
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
Explore at:
zip(3687 bytes)Available download formats
Dataset updated
Sep 27, 2016
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species
Mental Health Dataset
kaggle.com
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhavik Jikadara (2024). Mental Health Dataset [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bhavik Jikadara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.

Benefits of using this dataset:

Insight into Mental Health: The dataset provides valuable insights into mental health by analyzing linguistic patterns, sentiment, and psychological indicators in text data. Researchers and data scientists can gain a better understanding of how mental health issues manifest in online communication.

Predictive Modeling: With a wide range of features, including sentiment analysis scores and psychological indicators, the dataset offers opportunities for developing predictive models to identify or predict mental health outcomes based on textual data. This can be useful for early intervention and support.

Community Engagement: Mental health is a topic of increasing importance, and this dataset can foster community engagement on platforms like Kaggle. Data enthusiasts, researchers, and mental health professionals can collaborate to analyze the data and develop solutions to address mental health challenges.

Data-driven Insights: By analyzing the dataset, users can uncover correlations and patterns between linguistic features, sentiment, and mental health indicators. These insights can inform interventions, policies, and support systems aimed at promoting mental well-being.

Educational Resource: The dataset can serve as a valuable educational resource for teaching and learning about mental health analytics, sentiment analysis, and text mining techniques. It provides a real-world dataset for students and practitioners to apply data science skills in a meaningful context.
Gemma-Data Science Agent- Instruct- Dataset
kaggle.com
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ian cecil akoto
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.
10K rewritten texts dataset/LLM Prompt Recovery
kaggle.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aisha AL Mahmoud (2024). 10K rewritten texts dataset/LLM Prompt Recovery [Dataset]. https://www.kaggle.com/datasets/aishaalmahmoud/10k-rewritten-texts-datasetllm-prompt-recovery
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 8, 2024
Authors
Aisha AL Mahmoud
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)

if you find it useful, upvote it

Stroke Risk Prediction Dataset based on Literature

kaggle.com

Updated Mar 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahatir Ahmed Tusher (2025). Stroke Risk Prediction Dataset based on Literature [Dataset]. http://doi.org/10.34740/kaggle/dsv/10892812

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10892812

Dataset updated

Mar 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mahatir Ahmed Tusher

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Stroke Risk Prediction Dataset (Version 2)

Medically Validated, Age-Accurate, and Balanced
Samples: 35,000 | Features: 16 | Targets: 2 (Binary + Regression)

📌 Overview

This dataset is designed for predicting stroke risk using symptoms, demographics, and medical literature-inspired risk modeling. Version 2 significantly improves upon Version 1 by incorporating age-dependent symptom probabilities, gender-specific risk modifiers, and medically validated feature engineering.

Key Enhancements in Version 2:

Age-Accurate Risk Modeling:
- Stroke risk now follows a sigmoidal curve (sharp increase after age 50), reflecting real-world epidemiological trends.
- Symptom probabilities (e.g., hypertension, chest pain) scale with age (see Medical Validity).
Gender-Specific Risk:
- Males under 60 have 1.5× higher risk, while females over 60 have 1.8× higher risk (post-menopausal hormonal changes).
Balanced and Expanded Data:
- 35,000 samples (vs. 10,000 in Version 1) to improve model generalizability and capture rare symptom combinations.
- 50% at-risk (stroke risk ≥50%) and 50% not-at-risk (stroke risk <50%).

📊 Dataset Statistics

Column	Type	Description
`age`	Integer	Age (18–90)
`gender`	String	Male/Female
`chest_pain`	Binary	1 = Present, 0 = Absent
`shortness_of_breath`	Binary	1 = Present, 0 = Absent
`irregular_heartbeat`	Binary	1 = Present, 0 = Absent
`fatigue_weakness`	Binary	1 = Present, 0 = Absent
`dizziness`	Binary	1 = Present, 0 = Absent
`swelling_edema`	Binary	1 = Present, 0 = Absent
`neck_jaw_pain`	Binary	1 = Present, 0 = Absent
`excessive_sweating`	Binary	1 = Present, 0 = Absent
`persistent_cough`	Binary	1 = Present, 0 = Absent
`nausea_vomiting`	Binary	1 = Present, 0 = Absent
`high_blood_pressure`	Binary	1 = Present, 0 = Absent
`chest_discomfort`	Binary	1 = Present, 0 = Absent
`cold_hands_feet`	Binary	1 = Present, 0 = Absent
`snoring_sleep_apnea`	Binary	1 = Present, 0 = Absent
`anxiety_doom`	Binary	1 = Present, 0 = Absent
`at_risk`	Binary	Target for classification (1 = At Risk, 0 = Not At Risk)
`stroke_risk_percentage`	Float	Target for regression (0–100%)

Age distribution in Version 2 vs. Version 1
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21100322%2F6317df05bc7526268853e24a5ce831ba%2FAge%20Distribution%20Plot.png?generation=1740875866152537&alt=media" alt="">

🔬 Medical Validity

This dataset is grounded in peer-reviewed medical literature, with symptom probabilities, risk weights, and demographic relationships directly derived from clinical guidelines and epidemiological studies. Below is a detailed breakdown of how medical knowledge was translated into dataset parameters:

1. Age-Dependent Symptom Probabilities

The prevalence of symptoms increases with age, reflecting real-world clinical observations. Probabilities are calibrated using population-level data from medical literature:

Hypertension (High Blood Pressure)

Probability by Age: 10% (18–30), 25% (31–50), 45% (51–70), 60% (71–90).
Source: WHO Global Report on Stroke (2023) identifies hypertension as the leading modifiable stroke risk factor, with prevalence rising from ~12% in adults <30 to ~65% in adults >70.
Clinical Basis: Arterial stiffness and cumulative vascular damage over time explain the age-dependent increase (Chapter 4, Harrison’s Principles of Internal Medicine).

Chest Pain

Probability by Age: 5% (18–30), 15% (31–50), 25% (51–70), 35% (71–90).
Source: The Stroke Book (Cambridge Medicine) notes that chest pain is rare in young adults but becomes prevalent in older populations due to atherosclerosis and coronary artery disease.
Clinical Basis: Atherosclerotic plaque buildup accelerates after age ...

Machine Learning Job Postings in the US

kaggle.com
opendatabay.com

Updated Apr 20, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ivan Kumeyko (2025). Machine Learning Job Postings in the US [Dataset]. https://www.kaggle.com/datasets/ivankmk/thousand-ml-jobs-in-usa

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 20, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ivan Kumeyko

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

United States

Description

This dataset contains 1,000 job postings for Machine Learning-related roles across the United States, scraped between late 2024 and early 2025. The data was collected directly from company career pages and job boards, focusing on full job descriptions and associated company information.

Column Descriptions

Column	Description
`job_posted_date`	The date the job was posted (format: YYYY-MM-DD).
`company_address_locality`	The city or locality of the job or company.
`company_address_region`	The U.S. state or region where the job is located.
`company_name`	The name of the company posting the job.
`company_website`	The official website of the company.
`company_description`	A short description or mission statement of the company.
`job_description_text`	The full job description text as listed in the original posting.
`seniority_level`	The required seniority level (e.g., Internship, Entry level, Mid-Senior).
`job_title`	The full job title listed in the posting.

LLM - Detect AI Generated Text Dataset
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
SMILES DataSet for Analysis & Prediction Dataset
kaggle.com
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yan Maksi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
Heart Disease Prediction Dataset
kaggle.com
Updated Jul 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Ujeniya (2024). Heart Disease Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/krishujeniya/heart-diseae
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Krish Ujeniya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains medical data used for predicting heart disease. The data includes various attributes such as age, sex, chest pain type (cp), resting blood pressure (trestbps), cholesterol (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), and ST depression induced by exercise relative to rest (oldpeak).

Columns

age: Age of the patient (in years) sex: Sex of the patient (1 = male, 0 = female) cp: Chest pain type (1-4) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: Serum cholesterol in mg/dl fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0-2) thalach: Maximum heart rate achieved exang: Exercise-induced angina (1 = yes; 0 = no) oldpeak: ST depression induced by exercise relative to rest
Book-Crossing Dataset
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
somnambWl (2019). Book-Crossing Dataset [Dataset]. https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset
Explore at:
zip(17632108 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
somnambWl
Description
Book-Crossing dataset mined by Cai-Nicolas Ziegler

Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

PDF

Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

Further information and the original dataset can be found at the original webpage.

Changes to the dataset:

Location removed as it comes in different formats not in default (city, state, country).

Transferred from ISO-8859-1 to UTF-8

Manually fixed a few rows with incorrect number of columns

Note:

out of 278859 users:

only 99053 rated at least 1 book

only 43385 rated at least 2 books.

only 12306 rated at least 10 books.

out of 271379 books:

only 270171 are rated at least once.

only 124513 have at least 2 ratings.

only 17480 have at least 10 ratings.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data

PASTA Data

Data used in paper "Preference Adaptive and Sequential Text-to-Image Generation"

Explore at:

398 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 10, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Google Research

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.

Clear search

Close search

Google apps

Main menu

PASTA Data

Citation

Open Images

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration

Data from: Spam email Dataset

Customer Shopping Trends Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

ICR-integer-data

Datasets for federated learning

🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️

Car Price Prediction Challenge

Assignment

Dataset Description

Attributes

Data from: AI & Computer Vision Dataset

📣 Ad Click Prediction Dataset

Iris Species

Mental Health Dataset

Benefits of using this dataset:

Gemma-Data Science Agent- Instruct- Dataset

10K rewritten texts dataset/LLM Prompt Recovery

Stroke Risk Prediction Dataset based on Literature

Stroke Risk Prediction Dataset (Version 2)

📌 Overview

Key Enhancements in Version 2:

📊 Dataset Statistics

🔬 Medical Validity

1. Age-Dependent Symptom Probabilities

Hypertension (High Blood Pressure)

Chest Pain

Machine Learning Job Postings in the US

Column Descriptions

LLM - Detect AI Generated Text Dataset

SMILES DataSet for Analysis & Prediction Dataset

Heart Disease Prediction Dataset

Overview

Columns

Book-Crossing Dataset

PASTA Data

Data used in paper "Preference Adaptive and Sequential Text-to-Image Generation"

Citation