100+ datasets found
  1. PASTA Data

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

    We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

    At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

    See our paper for a comprehensive description of the rater study.

    Citation

    Please cite our paper if you use it in your work.

  2. Open Images

    • kaggle.com
    • opendatalab.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Labeled datasets are useful in machine learning research.

    Content

    This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

    Tables: 1) annotations_bbox 2) dict 3) images 4) labels

    Update Frequency: Quarterly

    Querying BigQuery Tables

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

    https://cloud.google.com/bigquery/public-data/openimages

    APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

    Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

    The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    Banner Photo by Mattias Diesel from Unsplash.

    Inspiration

    Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

  3. Data from: Spam email Dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    _w1998
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Dataset Name: Spam Email Dataset

    Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

    Columns:

    text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

    spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

    Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.

  4. Customer Shopping Trends Dataset

    • kaggle.com
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  5. ICR-integer-data

    • kaggle.com
    Updated May 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raddar (2023). ICR-integer-data [Dataset]. https://www.kaggle.com/datasets/raddar/icr-integer-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    raddar
    Description

    The dataset contains https://www.kaggle.com/competitions/icr-identify-age-related-conditions competition dataset transformed into integerized data. The common denominator is found for each column. Distribution of even/odd numbers were performed to identify if some values should be a fraction.

    Columns 'FL' and 'GL' were untouched, probably float by nature.

    Please refer to notebook for exact transformations: https://www.kaggle.com/code/raddar/convert-icr-data-to-integers

  6. Datasets for federated learning

    • kaggle.com
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/datasets/wonghoitin/datasets-for-federated-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    wonghoitin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

    source:

    1. smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

    2. heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

    3. water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

    4. customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

    5. insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

    6. credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

    7. income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

    8. machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

    9. skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

    10. score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain

  7. 🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️

    • kaggle.com
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandeep SD (2025). 🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️ [Dataset]. https://www.kaggle.com/datasets/sandeep1080/bassburst
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sandeep SD
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🌟 Enjoying the Dataset? 🌟

    If this dataset helped you uncover new insights or make your day a little brighter. Thanks a ton for checking it out! Let’s keep those insights rolling! 🔥📈

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23961675%2Ff3761bd2d7ee460ad464de8f25634f63%2Fsteve-johnson-z6LlNgsDeug-unsplash.jpg?generation=1740481184467263&alt=media" alt="">

    Dataset Description:

    This dataset contains website conversion data for Bluetooth speaker sales. The dataset tracks user sessions on different landing page variants, with the primary goal of analyzing conversion rates, user behavior, and other factors influencing sales. It includes detailed user engagement metrics such as time spent, pages visited, device type, sign-in methods, and geographical information.

    Use Case:

    This dataset can be used for various analytical tasks including:

    A/B testing and multivariate analysis to compare landing page designs.
    User segmentation by demographics (age, gender, location, etc.).
    Conversion rate optimization (CRO) analysis.
    Predictive modeling for conversion likelihood based on session characteristics.
    Revenue and payment analysis.

  8. Car Price Prediction Challenge

    • kaggle.com
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deep Contractor
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Assignment

    Your notebooks must contain the following steps:

    • Perform data cleaning and pre-processing.
      • What steps did you use in this process and how did you clean your data.
    • Perform exploratory data analysis on the given dataset.
      • Explain each and every graphs that you make.
    • Train a ml-model and evaluate it using different metrics.
      • Why did you choose that particular model? What was the accuracy?
    • Hyperparameter optimization and feature selection is a plus.
    • Model deployment and use of ml-flow is a plus.
    • Perform model interpretation and show feature importance for your model.
      • Provide some explanation for the above point.
    • Future steps. Note: try to have your notebooks as presentable as possible.

    Dataset Description

    CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

    Attributes

    ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

    Confused or have any doubts in the data column values? Check the dataset discussion tab!

  9. Data from: AI & Computer Vision Dataset

    • kaggle.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khushi Yadav (2025). AI & Computer Vision Dataset [Dataset]. https://www.kaggle.com/datasets/khushikyad001/ai-and-computer-vision-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Khushi Yadav
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains metadata related to three categories of AI and computer vision applications:

    Handwritten Math Solutions: Metadata on images of handwritten math problems with step-by-step solutions.

    Multi-lingual Street Signs: Road sign images in various languages, with translations.

    Security Camera Anomalies: Surveillance footage metadata distinguishing between normal and suspicious activities.

    The dataset is useful for machine learning, image recognition, OCR (Optical Character Recognition), anomaly detection, and AI model training.

  10. 📣 Ad Click Prediction Dataset

    • kaggle.com
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ciobanu Marius (2024). 📣 Ad Click Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/marius2303/ad-click-prediction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ciobanu Marius
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About

    This dataset provides insights into user behavior and online advertising, specifically focusing on predicting whether a user will click on an online advertisement. It contains user demographic information, browsing habits, and details related to the display of the advertisement. This dataset is ideal for building binary classification models to predict user interactions with online ads.

    Features

    • id: Unique identifier for each user.
    • full_name: User's name formatted as "UserX" for anonymity.
    • age: Age of the user (ranging from 18 to 64 years).
    • gender: The gender of the user (categorized as Male, Female, or Non-Binary).
    • device_type: The type of device used by the user when viewing the ad (Mobile, Desktop, Tablet).
    • ad_position: The position of the ad on the webpage (Top, Side, Bottom).
    • browsing_history: The user's browsing activity prior to seeing the ad (Shopping, News, Entertainment, Education, Social Media).
    • time_of_day: The time when the user viewed the ad (Morning, Afternoon, Evening, Night).
    • click: The target label indicating whether the user clicked on the ad (1 for a click, 0 for no click).

    Goal

    The objective of this dataset is to predict whether a user will click on an online ad based on their demographics, browsing behavior, the context of the ad's display, and the time of day. You will need to clean the data, understand it and then apply machine learning models to predict and evaluate data. It is a really challenging request for this kind of data. This data can be used to improve ad targeting strategies, optimize ad placement, and better understand user interaction with online advertisements.

  11. Iris Species

    • kaggle.com
    zip
    Updated Sep 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
    Explore at:
    zip(3687 bytes)Available download formats
    Dataset updated
    Sep 27, 2016
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

    It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

    The columns in this dataset are:

    • Id
    • SepalLengthCm
    • SepalWidthCm
    • PetalLengthCm
    • PetalWidthCm
    • Species

    Sepal Width vs. Sepal Length

  12. Mental Health Dataset

    • kaggle.com
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Jikadara (2024). Mental Health Dataset [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhavik Jikadara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.

    Benefits of using this dataset:

    • Insight into Mental Health: The dataset provides valuable insights into mental health by analyzing linguistic patterns, sentiment, and psychological indicators in text data. Researchers and data scientists can gain a better understanding of how mental health issues manifest in online communication.
    • Predictive Modeling: With a wide range of features, including sentiment analysis scores and psychological indicators, the dataset offers opportunities for developing predictive models to identify or predict mental health outcomes based on textual data. This can be useful for early intervention and support.
    • Community Engagement: Mental health is a topic of increasing importance, and this dataset can foster community engagement on platforms like Kaggle. Data enthusiasts, researchers, and mental health professionals can collaborate to analyze the data and develop solutions to address mental health challenges.
    • Data-driven Insights: By analyzing the dataset, users can uncover correlations and patterns between linguistic features, sentiment, and mental health indicators. These insights can inform interventions, policies, and support systems aimed at promoting mental well-being.
    • Educational Resource: The dataset can serve as a valuable educational resource for teaching and learning about mental health analytics, sentiment analysis, and text mining techniques. It provides a real-world dataset for students and practitioners to apply data science skills in a meaningful context.
  13. Gemma-Data Science Agent- Instruct- Dataset

    • kaggle.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ian cecil akoto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

    Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

    Sources:

    Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

    The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

  14. 10K rewritten texts dataset/LLM Prompt Recovery

    • kaggle.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisha AL Mahmoud (2024). 10K rewritten texts dataset/LLM Prompt Recovery [Dataset]. https://www.kaggle.com/datasets/aishaalmahmoud/10k-rewritten-texts-datasetllm-prompt-recovery
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 8, 2024
    Authors
    Aisha AL Mahmoud
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)

    if you find it useful, upvote it

  15. Stroke Risk Prediction Dataset based on Literature

    • kaggle.com
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahatir Ahmed Tusher (2025). Stroke Risk Prediction Dataset based on Literature [Dataset]. http://doi.org/10.34740/kaggle/dsv/10892812
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mahatir Ahmed Tusher
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Stroke Risk Prediction Dataset (Version 2)

    Medically Validated, Age-Accurate, and Balanced
    Samples: 35,000 | Features: 16 | Targets: 2 (Binary + Regression)

    📌 Overview

    This dataset is designed for predicting stroke risk using symptoms, demographics, and medical literature-inspired risk modeling. Version 2 significantly improves upon Version 1 by incorporating age-dependent symptom probabilities, gender-specific risk modifiers, and medically validated feature engineering.

    Key Enhancements in Version 2:

    1. Age-Accurate Risk Modeling:

      • Stroke risk now follows a sigmoidal curve (sharp increase after age 50), reflecting real-world epidemiological trends.
      • Symptom probabilities (e.g., hypertension, chest pain) scale with age (see Medical Validity).
    2. Gender-Specific Risk:

      • Males under 60 have 1.5× higher risk, while females over 60 have 1.8× higher risk (post-menopausal hormonal changes).
    3. Balanced and Expanded Data:

      • 35,000 samples (vs. 10,000 in Version 1) to improve model generalizability and capture rare symptom combinations.
      • 50% at-risk (stroke risk ≥50%) and 50% not-at-risk (stroke risk <50%).

    📊 Dataset Statistics

    ColumnTypeDescription
    ageIntegerAge (18–90)
    genderStringMale/Female
    chest_painBinary1 = Present, 0 = Absent
    shortness_of_breathBinary1 = Present, 0 = Absent
    irregular_heartbeatBinary1 = Present, 0 = Absent
    fatigue_weaknessBinary1 = Present, 0 = Absent
    dizzinessBinary1 = Present, 0 = Absent
    swelling_edemaBinary1 = Present, 0 = Absent
    neck_jaw_painBinary1 = Present, 0 = Absent
    excessive_sweatingBinary1 = Present, 0 = Absent
    persistent_coughBinary1 = Present, 0 = Absent
    nausea_vomitingBinary1 = Present, 0 = Absent
    high_blood_pressureBinary1 = Present, 0 = Absent
    chest_discomfortBinary1 = Present, 0 = Absent
    cold_hands_feetBinary1 = Present, 0 = Absent
    snoring_sleep_apneaBinary1 = Present, 0 = Absent
    anxiety_doomBinary1 = Present, 0 = Absent
    at_riskBinaryTarget for classification (1 = At Risk, 0 = Not At Risk)
    stroke_risk_percentageFloatTarget for regression (0–100%)

    Age distribution in Version 2 vs. Version 1
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21100322%2F6317df05bc7526268853e24a5ce831ba%2FAge%20Distribution%20Plot.png?generation=1740875866152537&alt=media" alt="">

    🔬 Medical Validity

    This dataset is grounded in peer-reviewed medical literature, with symptom probabilities, risk weights, and demographic relationships directly derived from clinical guidelines and epidemiological studies. Below is a detailed breakdown of how medical knowledge was translated into dataset parameters:

    1. Age-Dependent Symptom Probabilities

    The prevalence of symptoms increases with age, reflecting real-world clinical observations. Probabilities are calibrated using population-level data from medical literature:

    Hypertension (High Blood Pressure)

    • Probability by Age: 10% (18–30), 25% (31–50), 45% (51–70), 60% (71–90).
    • Source: WHO Global Report on Stroke (2023) identifies hypertension as the leading modifiable stroke risk factor, with prevalence rising from ~12% in adults <30 to ~65% in adults >70.
    • Clinical Basis: Arterial stiffness and cumulative vascular damage over time explain the age-dependent increase (Chapter 4, Harrison’s Principles of Internal Medicine).

    Chest Pain

    • Probability by Age: 5% (18–30), 15% (31–50), 25% (51–70), 35% (71–90).
    • Source: The Stroke Book (Cambridge Medicine) notes that chest pain is rare in young adults but becomes prevalent in older populations due to atherosclerosis and coronary artery disease.
    • Clinical Basis: Atherosclerotic plaque buildup accelerates after age ...
  16. Machine Learning Job Postings in the US

    • kaggle.com
    • opendatabay.com
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Kumeyko (2025). Machine Learning Job Postings in the US [Dataset]. https://www.kaggle.com/datasets/ivankmk/thousand-ml-jobs-in-usa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ivan Kumeyko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This dataset contains 1,000 job postings for Machine Learning-related roles across the United States, scraped between late 2024 and early 2025. The data was collected directly from company career pages and job boards, focusing on full job descriptions and associated company information.

    Column Descriptions

    ColumnDescription
    job_posted_dateThe date the job was posted (format: YYYY-MM-DD).
    company_address_localityThe city or locality of the job or company.
    company_address_regionThe U.S. state or region where the job is located.
    company_nameThe name of the company posting the job.
    company_websiteThe official website of the company.
    company_descriptionA short description or mission statement of the company.
    job_description_textThe full job description text as listed in the original posting.
    seniority_levelThe required seniority level (e.g., Internship, Entry level, Mid-Senior).
    job_titleThe full job title listed in the posting.
  17. LLM - Detect AI Generated Text Dataset

    • kaggle.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sunil thite
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

    Dataset contains more than 28,000 essay written by student and AI generated.

    Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

  18. SMILES DataSet for Analysis & Prediction Dataset

    • kaggle.com
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yan Maksi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

    Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

    ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

    The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

    The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.

  19. Heart Disease Prediction Dataset

    • kaggle.com
    Updated Jul 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krish Ujeniya (2024). Heart Disease Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/krishujeniya/heart-diseae
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Krish Ujeniya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains medical data used for predicting heart disease. The data includes various attributes such as age, sex, chest pain type (cp), resting blood pressure (trestbps), cholesterol (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), and ST depression induced by exercise relative to rest (oldpeak).

    Columns

    age: Age of the patient (in years) sex: Sex of the patient (1 = male, 0 = female) cp: Chest pain type (1-4) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: Serum cholesterol in mg/dl fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0-2) thalach: Maximum heart rate achieved exang: Exercise-induced angina (1 = yes; 0 = no) oldpeak: ST depression induced by exercise relative to rest

  20. Book-Crossing Dataset

    • kaggle.com
    zip
    Updated Sep 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    somnambWl (2019). Book-Crossing Dataset [Dataset]. https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset
    Explore at:
    zip(17632108 bytes)Available download formats
    Dataset updated
    Sep 7, 2019
    Authors
    somnambWl
    Description

    Book-Crossing dataset mined by Cai-Nicolas Ziegler

    Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

    • PDF

    • Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

    Further information and the original dataset can be found at the original webpage.

    Changes to the dataset:

    • Location removed as it comes in different formats not in default (city, state, country).
    • Transferred from ISO-8859-1 to UTF-8
    • Manually fixed a few rows with incorrect number of columns

    Note:

    • out of 278859 users:
      • only 99053 rated at least 1 book
      • only 43385 rated at least 2 books.
      • only 12306 rated at least 10 books.
    • out of 271379 books:
      • only 270171 are rated at least once.
      • only 124513 have at least 2 ratings.
      • only 17480 have at least 10 ratings.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
Organization logo

PASTA Data

Data used in paper "Preference Adaptive and Sequential Text-to-Image Generation"

Explore at:
398 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google Research
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.

Search
Clear search
Close search
Google apps
Main menu