100+ datasets found

LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
A
‘Kaggle Competitions Top 100’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Competitions Top 100’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-competitions-top-100-961d/latest
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Kaggle Competitions Top 100’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-competitions-top-100 on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.

Content

100 rows and 13 columns. Columns' description are listed below.

User : Name of the user

Tier : Grandmaster, Master or Expert

Company/School : Company/School info of the user if mentioned

Country : Country info of the user if mentioned

Competitions_Num : Number of competitions joined

Competitions_Gold : Number of competitions gold medals won

Competitions_Silver : Number of competitions silver medals won

Competitions_Bronze : Number of competitions bronze medals won

Datasets_Num : Number of public datasets

Notebooks_Num : Number of public notebooks

Discussions_Num : Number of topics/comments posted

Points : Total points

Profile : Link of Kaggle profile

Acknowledgements

Data from Kaggle. Image from Smartcat.

If you're reading this, please upvote.

--- Original source retains full ownership of the source dataset ---
Agentic_AI_Applications_2025
kaggle.com
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hajra Amir (2025). Agentic_AI_Applications_2025 [Dataset]. https://www.kaggle.com/datasets/hajraamir21/agentic-ai-applications-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hajra Amir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comprehensive overview of various Agentic AI (autonomous AI) applications across multiple industries in 2025. It contains detailed records of how AI is being utilized to automate complex tasks, improve efficiency, and generate measurable outcomes. The dataset is designed to help researchers, data scientists, and businesses understand the current state and potential of Agentic AI in different sectors. Dataset Features: Industry: The sector where Agentic AI is applied (e.g., Healthcare, Finance, Manufacturing).

Application Area: The specific task or function performed by the AI agent (e.g., Fraud Detection, Predictive Maintenance).

AI Agent Name: The name of the AI system or agent deployed (e.g., HealthAI Monitor, FinSecure Agent).

Task Description: A brief description of the AI's function or role.

Technology Stack: The technologies powering the AI (e.g., Machine Learning, NLP, Computer Vision).

Outcome Metrics:The measurable impact of the AI deployment (e.g., 30% reduction in ER visits).

Deployment Year: The year the AI system was deployed (ranging from 2023 to 2025).

Geographical Region: The region where the AI application is implemented (e.g., North America, Asia, Europe).
scnu-ai-challenge-dataset-with-pred_support_facts
kaggle.com
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
czy111 (2024). scnu-ai-challenge-dataset-with-pred_support_facts [Dataset]. https://www.kaggle.com/datasets/czy111/scnu-ai-challenge-dataset-with-pred-support-facts/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
czy111
Description
Dataset

This dataset was created by czy111

Contents
A
‘Kaggle Competitions Ranking’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Competitions Ranking’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-competitions-ranking-f15f/7682e95e/?iid=003-169&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Kaggle Competitions Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-competitions-ranking on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains Kaggle ranking of competitions.

Content

5000 rows and 8 columns. Columns' description are listed below.

Rank : Rank of the user

Tier : Grandmaster, Master or Expert

Username : Name of the user

Join Date : Year of join

Gold Medals : Number of gold medals

Silver Medals : Number of silver medals

Bronze Medals : Number of bronze medals

Points : Total points

Acknowledgements

Data from Kaggle. Image from Olympics.

If you're reading this, please upvote.

--- Original source retains full ownership of the source dataset ---
A
‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-457&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

--- Dataset description provided by original source is as follows ---

From wiki

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

Source: Kaggle

--- Original source retains full ownership of the source dataset ---
LLM - Detect AI Generated Text Dataset
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
A
AI Training Dataset Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). AI Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-training-dataset-1501897
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 30, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The AI training dataset market is experiencing robust growth, driven by the increasing adoption of artificial intelligence across diverse sectors. The market's expansion is fueled by the urgent need for high-quality data to train sophisticated AI models capable of handling complex tasks. Key application areas, such as autonomous vehicles in the automotive industry, advanced medical diagnosis in healthcare, and personalized experiences in retail and e-commerce, are significantly contributing to this market's upward trajectory. The prevalence of text, image/video, and audio data types further diversifies the market, offering opportunities for specialized dataset providers. While the market faces challenges like data privacy concerns and the high cost of data annotation, the overall trajectory remains positive, with a projected Compound Annual Growth Rate (CAGR) exceeding 20% for the forecast period (2025-2033). This growth is further supported by advancements in deep learning techniques that demand increasingly larger and more diverse datasets for optimal performance. Leading companies like Google, Amazon, and Microsoft are actively investing in this space, expanding their dataset offerings and fostering competition within the market. Furthermore, the emergence of specialized data annotation providers caters to the specific needs of various industries, ensuring accurate and reliable data for AI model development. The geographic distribution of the market reveals strong presence in North America and Europe, driven by early adoption of AI technologies and the presence of major technology players. However, Asia Pacific is projected to witness significant growth in the coming years, propelled by increasing digitalization and a burgeoning AI ecosystem in countries like China and India. Government initiatives promoting AI development in various regions are also expected to stimulate demand for high-quality training datasets. While challenges related to data security and ethical considerations remain, the long-term outlook for the AI training dataset market is exceptionally promising, fueled by the continued evolution of artificial intelligence and its increasing integration into various aspects of modern life. The market segmentation by application and data type allows for granular analysis and targeted investments for businesses operating in this rapidly expanding sector.
mlcourse.ai - Dota 2 - winner prediction Dataset
kaggle.com
zip
Updated Sep 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
Explore at:
zip(759868828 bytes)Available download formats
Dataset updated
Sep 8, 2019
Authors
Sushma Biswas
Description
Context

Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

If you find this dataset useful, do upvote. Thank you and happy learning!

Content

This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

Acknowledgements

All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

Inspiration

to be updated.
C
Community-Driven Model Service Platform Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/community-driven-model-service-platform-507803
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The community-driven model service platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and expanding at a compound annual growth rate (CAGR) of 10.1% from 2025 to 2033. This surge is driven by several key factors. The increasing accessibility of machine learning models, fueled by platforms like Kaggle, GitHub, and Hugging Face, is lowering the barrier to entry for developers and researchers. The collaborative nature of these platforms fosters innovation and accelerates model development, leading to a wider adoption of AI solutions across various industries. Furthermore, the growing demand for specialized and customized AI models is pushing businesses to leverage community-driven platforms, where they can find pre-trained models or collaborate on developing tailored solutions, thereby reducing development time and costs. The trend towards open-source models and the rise of model zoos contribute significantly to this market expansion. While challenges exist, such as ensuring model quality, security, and addressing potential biases, the overall market trajectory remains strongly positive. The market's segmentation likely includes various model types (e.g., image recognition, natural language processing, time series analysis), deployment options (cloud-based, on-premise), and target industries (healthcare, finance, retail). Leading players, such as Kaggle, GitHub, Hugging Face, TensorFlow Hub, Model Zoo, DrivenData, and Cortex, are actively shaping the market landscape through continuous innovation and community engagement. The geographical distribution of the market is likely to reflect the global concentration of AI expertise and technological infrastructure, with regions like North America and Europe holding significant market shares initially, followed by rapid expansion in Asia and other developing regions as digital infrastructure improves. Future growth will hinge on continued technological advancements, further integration with cloud platforms, and the development of robust governance frameworks to address ethical concerns surrounding AI model development and deployment.
A
‘HR data, Predict changing jobs (competition form)’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘HR data, Predict changing jobs (competition form)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hr-data-predict-changing-jobs-competition-form-1d9b/a230c863/?iid=013-955&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘HR data, Predict changing jobs (competition form)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kukuroo3/hr-data-predict-change-jobscompetition-form on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context This dataset was taken from link and separated into competition format. The label for the test data is provided in the form of a function.

--- Original source retains full ownership of the source dataset ---
A
‘Covid-19 Prevent secondary transmission’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Covid-19 Prevent secondary transmission’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-prevent-secondary-transmission-f6b3/14be25d0/?iid=001-812&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Covid-19 Prevent secondary transmission’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mpwolke/cusersmarildownloadssecondcsv on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

Covid-19 data collected in the CORD 19 Challenge https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

Content

Studies subject: Secondary Transmission of Covid-19

Acknowledgements

Allen Institute for AI: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

David Mezzetti: https://www.kaggle.com/davidmezzetti/cord-19-task-csv-exports/data

Inspiration

Covid-19 Pandemic.

--- Original source retains full ownership of the source dataset ---
h
olympiad-math-contest-llama3-20k
huggingface.co
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Amiri (2024). olympiad-math-contest-llama3-20k [Dataset]. https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2024
Authors
Kevin Amiri
Description
AMC/AIME Mathematics Problem and Solution Dataset

Dataset Details

Dataset Name: AMC/AIME Mathematics Problem and Solution Dataset Version: 1.0 Release Date: 2024-06-1 Authors: Kevin Amiri

Intended Use

Primary Use: The dataset is created and intended for research and an AI Mathematical Olympiad Kaggle competition. Intended Users: Researchers in AI & mathematics or science.

Dataset Composition

Number of Examples: 20,300 problems and solution sets… See the full description on the dataset page: https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k.
A
Align Key Points Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Align Key Points Report [Dataset]. https://www.datainsightsmarket.com/reports/align-key-points-531365
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for [Insert Market Name Here – e.g., AI-powered computer vision] is experiencing robust growth, projected to reach $[Estimate Market Size in 2025, e.g., 15 Billion] in value by 2025. A Compound Annual Growth Rate (CAGR) of [Estimate CAGR, e.g., 25%] from 2025 to 2033 indicates a substantial expansion to an estimated $[Estimate Market Size in 2033, e.g., 75 Billion] by the end of the forecast period. Key drivers include the increasing adoption of AI across diverse industries like automotive, healthcare, and security, fueled by advancements in deep learning and improved data processing capabilities. Emerging trends, such as the rise of edge computing and the development of more sophisticated image recognition algorithms, are further propelling market expansion. However, challenges remain. High implementation costs associated with AI technologies and the need for substantial data sets for effective model training could hinder widespread adoption. Furthermore, concerns around data privacy and security, particularly regarding the ethical implications of facial recognition technologies, represent significant restraints. Market segmentation reveals a strong presence of players like ULUCU, Roboflow, Oosto, MathWorks, GitHub, Qualcomm Developer Network, Coursera, IFSEC Insider, Kaggle, and Thales, indicating a competitive landscape. These companies cater to different segments based on their offerings and target applications, contributing to the diverse growth patterns observed across the market. Regional analysis (data assumed to be available but unspecified in the prompt; regional distributions will vary but a logical breakdown needs to be presented) would reveal varied growth trajectories depending upon technological adoption rates and regulatory landscapes.
R
Car Damages Kaggle Dataset
universe.roboflow.com
zip
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Proyect (2025). Car Damages Kaggle Dataset [Dataset]. https://universe.roboflow.com/ai-proyect/car-damages-kaggle
Explore at:
zipAvailable download formats
Dataset updated
Feb 16, 2025
Dataset authored and provided by
AI Proyect
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Car Damages Polygons
Description
Car Damages Kaggle

## Overview Car Damages Kaggle is a dataset for instance segmentation tasks - it contains Car Damages annotations for 814 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Explainable AI (XAI) Drilling Dataset
kaggle.com
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael Wallsberger (2023). Explainable AI (XAI) Drilling Dataset [Dataset]. https://www.kaggle.com/datasets/raphaelwallsberger/xai-drilling-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raphael Wallsberger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is part of the following publication at the TransAI 2023 conference: R. Wallsberger, R. Knauer, S. Matzka; "Explainable Artificial Intelligence in Mechanical Engineering: A Synthetic Dataset for Comprehensive Failure Mode Analysis" DOI: http://dx.doi.org/10.1109/TransAI60598.2023.00032

This is the original XAI Drilling dataset optimized for XAI purposes and it can be used to evaluate explanations of such algortihms. The dataset comprises 20,000 data points, i.e., drilling operations, stored as rows, 10 features, one binary main failure label, and 4 binary subgroup failure modes, stored in columns. The main failure rate is about 5.0 % for the whole dataset. The features that constitute this dataset are as follows:

ID: Every data point in the dataset is uniquely identifiable, thanks to the ID feature. This ensures traceability and easy referencing, especially when analyzing specific drilling scenarios or anomalies.

Cutting speed vc (m/min): The cutting speed is a pivotal parameter in drilling, influencing the efficiency and quality of the drilling process. It represents the speed at which the drill bit's cutting edge moves through the material.

Spindle speed n (1/min): This feature captures the rotational speed of the spindle or drill bit, respectively.

Feed f (mm/rev): Feed denotes the depth the drill bit penetrates into the material with each revolution. There is a balance between speed and precision, with higher feeds leading to faster drilling but potentially compromising hole quality.

Feed rate vf (mm/min): The feed rate is a measure of how quickly the material is fed to the drill bit. It is a determinant of the overall drilling time and influences the heat generated during the process.

Power Pc (kW): The power consumption during drilling can be indicative of the efficiency of the process and the wear state of the drill bit.

Cooling (%): Effective cooling is paramount in drilling, preventing overheating and reducing wear. This ordinal feature captures the cooling level applied, with four distinct states representing no cooling (0%), partial cooling (25% and 50%), and high to full cooling (75% and 100%).

Material: The type of material being drilled can significantly influence the drilling parameters and outcomes. This dataset encompasses three primary materials: C45K hot-rolled heat-treatable steel (EN 1.0503), cast iron GJL (EN GJL-250), and aluminum-silicon (AlSi) alloy (EN AC-42000), each presenting its unique challenges and considerations. The three materials are represented as “P (Steel)” for C45K, “K (Cast Iron)” for cast iron GJL and “N (Non-ferrous metal)” for AlSi alloy.

Drill Bit Type: Different materials often require specialized drill bits. This feature categorizes the type of drill bit used, ensuring compatibility with the material and optimizing the drilling process. It consists of three categories, which are based on the DIN 1836: “N” for C45K, “H” for cast iron and “W” for AlSi alloy [5].

Process time t (s): This feature captures the full duration of each drilling operation, providing insights into efficiency and potential bottlenecks.

Main failure: This binary feature indicates if any significant failure on the drill bit occurred during the drilling process. A value of 1 flags a drilling process that encountered issues, which in this case is true when any of the subgroup failure modes are 1, while 0 indicates a successful drilling operation without any major failures.

Subgroup failures: - Build-up edge failure (215x): Represented as a binary feature, a build-up edge failure indicates the occurrence of material accumulation on the cutting edge of the drill bit due to a combination of low cutting speeds and insufficient cooling. A value of 1 signifies the presence of this failure mode, while 0 denotes its absence. - Compression chips failure (344x): This binary feature captures the formation of compressed chips during drilling, resulting from the factors high feed rate, inadequate cooling and using an incompatible drill bit. A value of 1 indicates the occurrence of at least two of the three factors above, while 0 suggests a smooth drilling operation without compression chips. - Flank wear failure (278x): A binary feature representing the wear of the drill bit's flank due to a combination of high feed rates and low cutting speeds. A value of 1 indicates significant flank wear, affecting the drilling operation's accuracy and efficiency, while 0 denotes a wear-free operation. - Wrong drill bit failure (300x): As a binary feature, it indicates the use of an inappropriate drill bit for the material being drilled. A value of 1 signifies a mismatch, leading to potential drilling issues, while 0 indicates the correct drill bit usage.
A
‘Gufhtugu Publications Dataset Challenge’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Gufhtugu Publications Dataset Challenge’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-gufhtugu-publications-dataset-challenge-0764/0bd8674f/?iid=006-565&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Gufhtugu Publications Dataset Challenge’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/gufhtugu-publications-dataset-challenge on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is the one of its kinds book sales dataset from Pakistan. It contains 20,000 book orders from January 2019 to January 2021. The data was collected from the merchant (Gufhtugu Publications www.Gufhtugu.com) who are partner in this research study. There is a dire need for such dataset to learn about Pakistan’s emerging e-commerce potential and I hope this will help many startups in many ways.

Content

Geography: Pakistan

Time period: 01/2019 – 01/2021

Unit of analysis: E-Commerce Orders

Dataset: The dataset contains detailed information of 200,000 online book orders in Pakistan from January 2019 to January 2021. It contains order number, order status (completed, cancelled, returned), order date and time, book name and city address. This is the most detailed dataset about e-commerce orders in Pakistan that you can find in the Public domain.

Variables: The dataset contains order number, order status, book name, order date, order time and city of the customer.

Size: 1.5 MB

File Type: CSV

Acknowledgements

I like to thank all the startups who are trying to make their mark in Pakistan despite the unavailability of research data. Thanks to Gufhtugu Publications (www.Gufhtugu.com) for allowing me to run this challenge.

Inspiration

I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:

• What is the best-selling book? • Visualize order status frequency • Find a correlation between date and time with order status • Find a correlation between city and order status • Find any hidden patterns that are counter-intuitive for a layman • Can we predict number of orders, or book names in advance?

--- Original source retains full ownership of the source dataset ---
h
arena-human-preference-55k
huggingface.co
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMArena (2025). arena-human-preference-55k [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2025
Dataset authored and provided by
LMArena
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset for Kaggle competition on predicting human preference on Chatbot Arena battles. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences across over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. Each sample represents a battle consisting of 2 LLMs which answer the same question, with a user label of either prefer model A, prefer model B, tie, or tie (both bad).

Citation

Please cite the… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k.
A
‘StockX Sneaker Data Contest’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘StockX Sneaker Data Contest’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-stockx-sneaker-data-contest-ae17/5fc3e134/?iid=010-160&v=presentation
Explore at:
Dataset updated
Nov 13, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘StockX Sneaker Data Contest’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hudsonstuck/stockx-data-contest on 29 August 2021.

--- Dataset description provided by original source is as follows ---

Context

This dataset is from the StockX 2019 Data Contest.

Content

Currently the dataset consists of the single file of sales provided by StockX. ~10000 shoe sales from 50 different models (Nike x Off-White and Yeezy).

In the coming weeks more data will be added, including the estimated number of pairs released for each model and other information that might be useful for making predictions. Additionally, some of the data types will be modified to make numerical analysis easier.

Inspiration

What shoes are most popular?

Which shoes have the best/worst profit margins?

What factors affect profit margin?

Is it possible to predict the sale price of a shoe at a given time? (i.e. when should I sell?)

--- Original source retains full ownership of the source dataset ---
AI vs. Human-Generated Images
kaggle.com
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandra Sala (2025). AI vs. Human-Generated Images [Dataset]. https://www.kaggle.com/datasets/alessandrasala79/ai-vs-human-generated-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alessandra Sala
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Official dataset for the 2025 Women in AI Kaggle Competition: https://www.kaggle.com/competitions/detect-ai-vs-human-generated-images

The dataset consists of authentic images sampled from the Shutterstock platform across various categories, including a balanced selection where one-third of the images feature humans. These authentic images are paired with their equivalents generated using state-of-the-art generative models. This structured pairing enables a direct comparison between real and AI-generated content, providing a robust foundation for developing and evaluating image authenticity detection systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset

LLM: 7 prompt training dataset

(for use in the LLM - Detect AI Generated Text competition)

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 15, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Carl McBride Ellis

License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts
Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv
"Car-free cities"
"Does the electoral college work?"
"Exploring Venus"
"The Face on Mars"
"Facial action coding system"
"A Cowboy Who Rode the Waves"
"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"
"Does the electoral college work?"
"Exploring Venus"
"The Face on Mars"
"Facial action coding system"
"Seeking multiple opinions"
"Phones and driving"

This dataset is a derivative of the datasets

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Clear search

Close search

Google apps

Main menu

LLM: 7 prompt training dataset

‘Kaggle Competitions Top 100’ analyzed by Analyst-2

Context

Content

Acknowledgements

Agentic_AI_Applications_2025

scnu-ai-challenge-dataset-with-pred_support_facts

Dataset

Contents

‘Kaggle Competitions Ranking’ analyzed by Analyst-2

Context

Content

Acknowledgements

‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

From wiki

LLM - Detect AI Generated Text Dataset

AI Training Dataset Report

mlcourse.ai - Dota 2 - winner prediction Dataset

Context

Content

Acknowledgements

Inspiration

Community-Driven Model Service Platform Report

‘HR data, Predict changing jobs (competition form)’ analyzed by Analyst-2

‘Covid-19 Prevent secondary transmission’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

olympiad-math-contest-llama3-20k

Align Key Points Report

Car Damages Kaggle Dataset

Car Damages Kaggle

Explainable AI (XAI) Drilling Dataset

‘Gufhtugu Publications Dataset Challenge’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

arena-human-preference-55k

‘StockX Sneaker Data Contest’ analyzed by Analyst-2

Context

Content

Inspiration

AI vs. Human-Generated Images

LLM: 7 prompt training dataset

(for use in the LLM - Detect AI Generated Text competition)