100+ datasets found
  1. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  2. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  3. Code Vulnerabilities Dataset

    • kaggle.com
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Code Vulnerabilities Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/code-vulnerabilities-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Code Vulnerabilities Dataset is designed for training and evaluating AI-driven models for identifying cybersecurity vulnerabilities in software codebases. The dataset consists of 1000 rows of synthetic code snippets, each labeled with one of two common vulnerabilities: SQL Injection (SQLi) and Cross-Site Scripting (XSS). The data contains the following columns:

    Code Snippet: A synthetic code pattern that demonstrates either an SQL Injection or XSS vulnerability. Vulnerability Type: A label indicating the type of vulnerability in the code (SQLi or XSS). Location: Simulated line numbers (representing where the vulnerability might appear in the code). Preprocessed Tokens: A list of tokenized words from the code snippet, formatted for NLP model processing. This dataset is particularly useful for training machine learning and natural language processing (NLP) models to detect vulnerabilities automatically in software systems, aiming to improve the accuracy of vulnerability detection and reduce the need for manual inspection.

  4. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  5. OpenAI HumanEval (Coding Challenges & Unit-tests)

    • kaggle.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). OpenAI HumanEval (Coding Challenges & Unit-tests) [Dataset]. https://www.kaggle.com/datasets/thedevastator/handcrafted-dataset-for-code-generation-models
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI HumanEval (Coding Challenges & Unit-tests)

    164 programming problems with a function signature, docstring, body, unittests

    Source

    Huggingface Hub: link

    About this dataset

    The OpenAI HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. The entry point for each problem is the prompt, making it an ideal dataset for testing natural language processing and machine learning models' ability to generate Python programs from scratch

    How to use the dataset

    To use this dataset, simply download the zip file and extract it. The resulting directory will contain the following files:

    canonical_solution.py: The solution to the problem. (String) entry_point.py: The entry point for the problem. (String) prompt.txt: The prompt for the problem. (String) test.py: The unit tests for the problem

    Research Ideas

    • The dataset could be used to develop a model that generates programs from natural language.
    • The dataset could be used to develop a model that completes or debugs programs.
    • The dataset could be used to develop a model that writes unit tests for programs

    Acknowledgements

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: test.csv | Column name | Description | |:-----------------------|:--------------------------------------------------------------------------------------------------| | prompt | A natural language description of the programming problem. (String) | | canonical_solution | The correct Python code solution to the problem. (String) | | test | A set of unit tests that the generated code must pass in order to be considered correct. (String) | | entry_point | The starting point for the generated code. (String) |

  6. Kaggle LLMSE Dataset

    • kaggle.com
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoquan Fang (2023). Kaggle LLMSE Dataset [Dataset]. https://www.kaggle.com/datasets/hqfang/kaggle-llmse-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Haoquan Fang
    Description

    deberta-billy is trained locally by @hqfang primarily using @radek1's notebook.

    deberta-lora-lindsey is trained locally by @lindseywei using the LoRA technique.

    deberta-openbook-eric-088 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0897 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0916 comes from @yuekaixueirc's dataset.

    54k_with_context_v1.csv was created by dropping duplicates @cdeotte's 60k training data all_12_with_context2.csv in this dataset.

    54k.csv was created by dropping the context column from the 54k_with_context_v1.csv.

    val_with_context_v1.csv was created by adding a context column to @itsuki9180's validation dataset.

  7. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

    Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

  8. Data from: training-results

    • kaggle.com
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuele Carelli (2025). training-results [Dataset]. https://www.kaggle.com/datasets/emanuelecarelli/training-results/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Emanuele Carelli
    Description

    Dataset

    This dataset was created by Emanuele Carelli

    Contents

  9. t

    Credit Card Fraud Detection

    • test.researchdata.tuwien.ac.at
    • zenodo.org
    • +1more
    csv, json, pdf +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
    Explore at:
    text/markdown, csv, pdf, txt, jsonAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

    1. Dataset Description

    Research Domain
    This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

    Purpose
    The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

    Data Sources
    We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

    Method of Dataset Preparation

    1. Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

    2. Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

    3. Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

    4. Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

    5. Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

    2. Technical Details

    Dataset Structure

    • The raw data is a single CSV with columns:

      • actionnr (integer transaction ID)

      • merchant_id (string)

      • average_amount_transaction_day (float)

      • transaction_amount (float)

      • is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

      • total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

    Naming Conventions

    • All columns use lowercase snake_case.

    • Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

    • Files in the code repo follow a clear structure:

      ├── data/         # local copies only; raw data lives in DBRepo 
      ├── notebooks/Task.ipynb 
      ├── models/rf_model_v1.joblib 
      ├── outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
      ├── README.md 
      ├── requirements.txt 
      └── codemeta.json 
      

    Required Software

    • Python 3.9+

    • pandas, numpy (data handling)

    • scikit-learn (modeling, metrics)

    • matplotlib (visualizations)

    • dbrepo‐client.py (DBRepo API)

    • requests (TU WRD API)

    Additional Resources

    3. Further Details

    Data Limitations

    • Highly imbalanced: only ~0.17% of transactions are fraudulent.

    • Anonymized PCA features (V1V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

    • Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

    Licensing and Attribution

    • Raw data: CC-0 (per Kaggle terms)

    • Code & notebooks: MIT License

    • Model artifacts & outputs: CC-BY 4.0

    • DUWRD records include ORCID identifiers for the author.

    Recommended Uses

    • Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

    • Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

    • Extension: adding time‐series or deep‐learning models.

    Known Issues

    • Possible temporal leakage if date/time features not handled correctly.

    • Model performance may degrade on live data due to concept drift.

    • Binary flags may oversimplify nuanced transaction outcomes.

  10. model_training

    • kaggle.com
    Updated Mar 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Yang Tan (2019). model_training [Dataset]. https://www.kaggle.com/datasets/tanzy96/model-training/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zhi Yang Tan
    Description

    Dataset

    This dataset was created by Zhi Yang Tan

    Contents

  11. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  12. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  13. R

    Gtsdb German Traffic Sign Detection Benchmark Dataset

    • universe.roboflow.com
    • kaggle.com
    zip
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Traore (2022). Gtsdb German Traffic Sign Detection Benchmark Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/gtsdb---german-traffic-sign-detection-benchmark/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 6, 2022
    Dataset authored and provided by
    Mohamed Traore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Signs Bounding Boxes
    Description

    This project was created by downloading the GTSDB German Traffic Sign Detection Benchmark

    dataset from Kaggle and importing the annotated training set files (images and annotation files)

    to Roboflow.

    https://www.kaggle.com/datasets/safabouguezzi/german-traffic-sign-detection-benchmark-gtsdb

    The annotation files were adjusted to conform to the YOLO Keras TXT format prior to upload, as the original format did not include a label map file.

    v1 contains the original imported images, without augmentations. This is the version to download and import to your own project if you'd like to add your own augmentations.

    v2 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "FAST" model.

    v3 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "ACCURATE" model.

  14. Synthetic Dyslexia Handwriting Dataset (YOLO-Format)

    • zenodo.org
    zip
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nora Fink; Nora Fink (2025). Synthetic Dyslexia Handwriting Dataset (YOLO-Format) [Dataset]. http://doi.org/10.5281/zenodo.14852659
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nora Fink; Nora Fink
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description
    This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset “A-Z Handwritten Alphabets in .csv format” 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).

    In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:

    • 78,275 images labeled as Normal
    • 52,196 images labeled as Reversal
    • 8,029 images labeled as Corrected

    Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a “text line” style on a black background, providing YOLO-compatible .txt annotations that specify bounding boxes for each letter.

    Key Points of the Synthetic Generation Process

    1. Letter-Level Source Data
      Individual characters were sampled from the original image sets.
    2. Randomized Layout
      Letters are randomly assembled into words and lines, ensuring a wide variety of visual arrangements.
    3. Bounding Box Labels
      Each character is assigned a bounding box with (x, y, width, height) in YOLO format.
    4. Class Annotations
      Classes include 0 = Normal, 1 = Reversal, and 2 = Corrected.
    5. Preservation of Visual Characteristics
      Letters retain their key dyslexia-relevant features (e.g., reversals).

    Historical References & Credits

    If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:

    • M. S. A. B. Rosli, I. S. Isa, S. A. Ramlan, S. N. Sulaiman and M. I. F. Maruzuki, "Development of CNN Transfer Learning for Dyslexia Handwriting Recognition," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 194–199, doi: 10.1109/ICCSCE52189.2021.9530971.
    • N. S. L. Seman, I. S. Isa, S. A. Ramlan, W. Li-Chih and M. I. F. Maruzuki, "Notice of Removal: Classification of Handwriting Impairment Using CNN for Potential Dyslexia Symptom," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 188–193, doi: 10.1109/ICCSCE52189.2021.9530989.
    • Isa, Iza Sazanita. CNN Comparisons Models On Dyslexia Handwriting Classification / Iza Sazanita Isa … [et Al.]. Universiti Teknologi MARA Cawangan Pulau Pinang, 2021.
    • Isa, I. S., Rahimi, W. N. S., Ramlan, S. A., & Sulaiman, S. N. (2019). Automated detection of dyslexia symptom based on handwriting image for primary school children. Procedia Computer Science, 163, 440–449.

    References to Original Data Sources

    111 P. J. Grother, “NIST Special Database 19,” NIST, 2016. [Online]. Available:
    https://www.nist.gov/srd/nist-special-database-19

    222 S. Patel, “A-Z Handwritten Alphabets in .csv format,” Kaggle, 2017. [Online]. Available:
    https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format

    Usage & Citation

    Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.

    Password Note (Original Data)

    The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.

  15. Equity in Healthcare Clean DataSets

    • kaggle.com
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anopsy
    Description

    This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

    What did I change? 1. I dropped 2 columns that contained to little data.
    2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

    If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

    According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."

  16. h

    finance-alpaca

    • huggingface.co
    Updated Apr 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurang Bharti (2023). finance-alpaca [Dataset]. http://doi.org/10.57967/hf/2557
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Authors
    Gaurang Bharti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a combination of Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5 Script for tuning through Kaggle's (https://www.kaggle.com) free resources using PEFT/LoRa: https://www.kaggle.com/code/gbhacker23/wealth-alpaca-lora GitHub repo with performance analyses, training and data generation scripts, and inference notebooks: https://github.com/gaurangbharti1/wealth-alpaca… See the full description on the dataset page: https://huggingface.co/datasets/gbharti/finance-alpaca.

  17. Python Programming Questions Dataset

    • kaggle.com
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavesh Mittal (2024). Python Programming Questions Dataset [Dataset]. https://www.kaggle.com/datasets/bhaveshmittal/python-programming-questions-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhavesh Mittal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Welcome to an exceptional dataset meticulously crafted for training state-of-the-art language models such as Gemma, Llama 2, Orca, and more.

    Dataset Highlights - Challenging Questions : Immerse your language models in various Python programming questions designed to stimulate cognitive growth. - Real-world Inputs : Provide your models with authentic input scenarios, ensuring they are well-equipped to handle practical coding challenges. - Accurate Answers : Sharpen the precision of your language models by exposing them to meticulously crafted Python code solutions.

    How to Get Started - Download : Grab a copy of the dataset and inject new life into your language models. - Build Brilliance : Watch your LLMs evolve as they engage with the challenging questions and nuanced coding scenarios. - Share & Collaborate : Join the Kaggle community to discuss, share insights, and collaborate with fellow enthusiasts.

    Unleash the full potential of your language models with this dataset. Elevate your LLM training experience and witness unprecedented growth in language understanding and coding prowess. Happy coding !

  18. training file

    • kaggle.com
    Updated Aug 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ytl0623 (2021). training file [Dataset]. https://www.kaggle.com/ytl0623/training-file/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ytl0623
    Description

    Dataset

    This dataset was created by ytl0623

    Contents

  19. training_df

    • kaggle.com
    Updated Mar 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharon26 (2025). training_df [Dataset]. https://www.kaggle.com/datasets/sharon26/training-df/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sharon26
    Description

    Dataset

    This dataset was created by Sharon26

    Contents

  20. training-images-curie

    • kaggle.com
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily de Oliveira Santos (2024). training-images-curie [Dataset]. https://www.kaggle.com/datasets/emilyyui/training-images-curie/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Emily de Oliveira Santos
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Emily de Oliveira Santos

    Released under CC0: Public Domain

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Organization logo

LLM: 7 prompt training dataset

(for use in the LLM - Detect AI Generated Text competition)

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description
  • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
    File: train_essays_RDizzl3_seven_v2.csv
    Human texts: 14247 LLM texts: 3004

    See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



  • Version 3: "**The RDizzl3 Seven**"
    File: train_essays_RDizzl3_seven_v1.csv

  • "Car-free cities"

  • "Does the electoral college work?"

  • "Exploring Venus"

  • "The Face on Mars"

  • "Facial action coding system"

  • "A Cowboy Who Rode the Waves"

  • "Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

  • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

  • "Car-free cities"
  • "Does the electoral college work?"
  • "Exploring Venus"
  • "The Face on Mars"
  • "Facial action coding system"
  • "Seeking multiple opinions"
  • "Phones and driving"

This dataset is a derivative of the datasets

as well as the original competition training dataset

  • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
Search
Clear search
Close search
Google apps
Main menu