100+ datasets found

m
Trained AI model and associated files
figshare.manchester.ac.uk
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Igor Larrosa (2023). Trained AI model and associated files [Dataset]. http://doi.org/10.48420/16965271.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48420/16965271.v2
Dataset updated
May 30, 2023
Dataset provided by
University of Manchester
Authors
Igor Larrosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning

Trained AI full model

Trained AI reduced models

python files to run predictions

python files to train model

template for inputing kinetics for predictions

data used in case studies

Unpack data file and follow instructions in publication's Supporting Information
MISATO - Machine learning dataset for structure-based drug discovery
data.niaid.nih.gov
zenodo.org
Updated May 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7711952
Explore at:
Dataset updated
May 25, 2023
Dataset provided by
Forschungszentrum Jülichhttp://www.fz-juelich.de/
Helmholtz Zentrum Münchenhttps://www.helmholtz-munich.de/
Helmholtz Munich, Computational Health Center, Institute of Computational Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.
Helmholtz Munich, Molecular Targets and Therapeutics Center, Institute of Structural Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.
Authors
Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
OpenAI HumanEval Code Gen
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). OpenAI HumanEval Code Gen [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-humaneval-code-gen
Explore at:
zip(45602 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
OpenAI HumanEval Code Gen

Handcrafted Python Programming Problems for Accurate Model Evaluation

By Huggingface Hub [source]

About this dataset

This dataset released by OpenAI, HumanEval, offers a unique opportunity for developers and researchers to accurately evaluate their code generation models in a safe environment. It includes 164 handcrafted programming problems written by engineers and researchers from OpenAI specificially designed to test the correctness and scalability of code generation models. Written in Python, these programming problems cover docstrings and comments full of natural English text which can be difficult for computers to comprehend. Each programming problem also includes a function signature, body as well as several unit tests. Placed under the MIT License, this HumanEval dataset is ideal for any practitioner looking to judge the efficacy of their machine-generated code with trusted results!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The first step is to explore the data that is included in the set by viewing the columns included. This guide will focus on four key columns: prompt, canonical_solution, test and entry_point. - The prompt column contains natural English text describing the programming problem. - The canonical_solution column holds the correct solution to each programming problem as determined by OpenAI researchers or engineers who hand-crafted the dataset. - The test column contains unit tests designed to check for correctness when debugging or evaluating code generated by neural networks or other automated tools.
- The entry_point column contains code for an entry point into each program which can be used as starting point while solving any programming problem from this dataset.

With this information we can now begin utilizing this data set for our own projects from building new case studies for specific AI algorithms to developing automated programs that generate compatible source code instructions based off open AI datasets like Human Eval!

Research Ideas

Training code generation models in a limited and supervised environment.

Benchmarking the performance of existing code generation models, as HumanEval consists of both the canonical solution for each problem and unit tests that can be used to evaluate model accuracy.

Using Natural Language Processing (NLP) algorithms on the docstrings and comments within HumanEval to develop better natural language understanding for programming contexts

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: test.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | prompt | A description of the programming problem. (String) | | canonical_solution | The expected solution to the programming problem. (String) | | test | Unit tests to verify the accuracy of the solution. (String) | | entry_point | The entry point for running the unit tests. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
ml_uncertainty: A Python module for estimating uncertainty in predictions of...
catalog.data.gov
datasets.ai
+2more
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). ml_uncertainty: A Python module for estimating uncertainty in predictions of machine learning models [Dataset]. https://catalog.data.gov/dataset/ml-uncertainty-a-python-module-for-estimating-uncertainty-in-predictions-of-machine-learni
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This software is a Python module for estimating uncertainty in predictions of machine learning models. It is a Python package that calculates uncertainties in machine learning models using bootstrapping and residual bootstrapping. It is intended to interface with scikit-learn but any Python package that uses a similar interface should work.
Fake vs Real News Dataset for NLP
kaggle.com
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEEPAK POLISETTI (2025). Fake vs Real News Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/deepakpolisetti/fake-vs-real-news-dataset-for-nlp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DEEPAK POLISETTI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📰 Fake News Detection Dataset 🔥

📌 Overview

This dataset is designed for Fake News Classification using NLP & Machine Learning. It contains labeled fake and real news articles, sourced from credible datasets. It is optimized for text analysis, deep learning models, and AI research.

🏗️ Dataset Structure

Fake.csv → Contains fake news articles 🏴

True.csv → Contains real news articles ✅

fake_news_data.csv → Merged dataset for AI models ⚡

🚀 Usage

1️⃣ Load the dataset in Python using Pandas
```python import pandas as pd df = pd.read_csv("fake_news_data.csv")
R
Ai Python Project Dataset
universe.roboflow.com
zip
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI project (2024). Ai Python Project Dataset [Dataset]. https://universe.roboflow.com/ai-project-4hlrl/ai-python-project
Explore at:
zipAvailable download formats
Dataset updated
Jan 18, 2024
Dataset authored and provided by
AI project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Vehicles FTNK Bounding Boxes
Description
AI Python Project

## Overview AI Python Project is a dataset for object detection tasks - it contains Vehicles FTNK annotations for 2,999 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
Machine learning model that estimates total monthly and annual per capita...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wa
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
d
Python Script for Cleaning Alum Dataset
search.dataone.org
hydroshare.org
Updated Oct 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
Explore at:
Dataset updated
Oct 18, 2025
Dataset provided by
Hydroshare
Authors
saikumar payyavula; Jeff Sadler
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
d
Data from: Python Codebase and Jupyter Notebooks - Applications of Machine...
datasets.ai
gdr.openei.org
+3more
57
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2022). Python Codebase and Jupyter Notebooks - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://datasets.ai/datasets/python-codebase-and-jupyter-notebooks-applications-of-machine-learning-techniques-to-geoth
Explore at:
57Available download formats
Dataset updated
Oct 27, 2022
Dataset authored and provided by
Department of Energy
Area covered
Great Basin
Description
Git archive containing Python modules and resources used to generate machine-learning models used in the "Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada" project. This software is licensed as free to use, modify, and distribute with attribution. Full license details are included within the archive. See "documentation.zip" for setup instructions and file trees annotated with module descriptions.
SynthFluencers: AI-Generated Influencers
kaggle.com
zip
Updated Jan 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnthonyTherrien (2024). SynthFluencers: AI-Generated Influencers [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/synthetic-influencer-backstory/code
Explore at:
zip(21280993 bytes)Available download formats
Dataset updated
Jan 21, 2024
Authors
AnthonyTherrien
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Introduction

Background

Exploring the creation of a unique dataset of synthetic influencer profiles using AI technologies, including OpenAI's GPT-3.5.

Methodology

Data Generation Process

Influencer Profile Generation: Profiles are generated with demographic details like age, gender, etc.

Location Allocation: Randomly assigning U.S. states or Canadian provinces based on population distribution.

GPT-3.5 Integration: Generating detailed backstories for each influencer profile using OpenAI's GPT-3.5-turbo-instruct model.

Dataset Overview

Structure

The dataset contains profiles with attributes like Name, Age, Sex, Lifestyle, Country of Origin, State or Province, Education Level, MBTI Personality and Backstory.

Applications and Use Cases

Potential Uses

Market Research: Understanding influencer dynamics in different niches.

AI Training: Enhancing the realism and diversity of AI-generated personas.

Social Media Strategy: Informing content creation and marketing strategies.

Analysis and Insights

Statistical Breakdown

Distribution of influencers across various lifestyles and locations.

Correlation between attractiveness ratings and lifestyle niches.

Key Insights

Predominant trends in influencer personas based on demographics and location.

Challenges and Limitations

Ethical Considerations

The impact of synthetic influencers on real-world perceptions and digital marketing.

Limitations of AI

Challenges in capturing the full depth of human characteristics and experiences.

Conclusion

Summary

This dataset provides a unique lens into the world of synthetic influencers, blending AI creativity with insights into social media dynamics.
c
Medium articles dataset
crawlfeeds.com
kaggle.com
json, zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

Request here for the large dataset Medium datasets

Checkout sample dataset in CSV

Use Cases:

Training language models (LLMs)

Analyzing content trends and engagement

Sentiment and text classification

SEO research and author profiling

Academic or commercial research

Why Choose This Dataset?

High-volume, cleanly structured JSON

Ideal for developers, researchers, and data scientists

Easy integration with Python, R, SQL, and other data pipelines

Affordable and ready-to-use
Z
DustNet - structured data and Python code to reproduce the model,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Kingston University
University of Exeter
Authors
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
h
bilingual-coding-qa-dataset
huggingface.co
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Convai Innovations (2025). bilingual-coding-qa-dataset [Dataset]. https://huggingface.co/datasets/convaiinnovations/bilingual-coding-qa-dataset
Explore at:
Dataset updated
Jan 15, 2025
Authors
Convai Innovations
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🌐 Bilingual Coding Q&A Dataset

📊 Dataset Description

A comprehensive bilingual (English-Hindi) dataset containing 25,151 high-quality question-answer pairsfocused on programming concepts, particularly Python, machine learning, and AI. This dataset was used to fine-tune coding assistant models and contains over 7 million tokens of training data.

Dataset Statistics

Metric Value

Total Examples 25,151 Q&A pairs

Total Lines 250,320+… See the full description on the dataset page: https://huggingface.co/datasets/convaiinnovations/bilingual-coding-qa-dataset.
h
CoT_Reasoning_Python_General_Query
huggingface.co
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt R. Wesney (2025). CoT_Reasoning_Python_General_Query [Dataset]. https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query
Explore at:
Dataset updated
Apr 16, 2025
Authors
Matt R. Wesney
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Okay, here's a revised description for the new dataset "CoT_Reasoning_Python_General_Query": CoT Reasoning Python General Query: Enhancing Python Understanding through Chain of Thought Reasoning

Description: Explore Python programming and general computing queries with the "CoT_Reasoning_Python_General_Query" dataset. This open-source resource (MIT licensed) provides a carefully curated collection of question-and-answer pairs designed to train AI models in understanding and reasoning about a… See the full description on the dataset page: https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query.
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
c
Walmart Products Dataset
crawlfeeds.com
csv, zip
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Walmart Products Dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-products-dataset-sept-2022
Explore at:
csv, zipAvailable download formats
Dataset updated
Dec 17, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Large Walmart Products Dataset is an essential resource for businesses, analysts, and developers seeking detailed insights into Walmart’s vast product catalog. This dataset includes extensive information on Walmart products, such as product names, descriptions, prices, categories, brand information, ratings, and customer reviews.

With Walmart being one of the largest retailers globally, this dataset provides a unique opportunity to study consumer trends, perform competitive pricing analysis, and develop e-commerce solutions. For startups and established businesses, the dataset is ideal for market research, inventory management insights, and enhancing product discovery mechanisms.

AI and machine learning practitioners can use this dataset to build recommendation systems, predictive pricing algorithms, and sentiment analysis models. Its structured format ensures smooth integration with Python, R, and other data analytics tools, making it user-friendly for data visualization and predictive modeling.

Walmart Products Dataset is also an invaluable resource for retail analysts and e-commerce marketers aiming to optimize product positioning or analyze buying behaviors. Its broad coverage across categories like groceries, electronics, fashion, and home essentials provides a holistic view of Walmart’s inventory.

Key Features:

Extensive Product Information: Details on pricing, discounts, availability, and ratings.

Diverse Applications: Suitable for AI models, trend analysis, and market research.

Retail Insights: Explore consumer preferences and popular product trends.

Whether you're developing an AI-driven product search engine or conducting a pricing strategy study, the Large Walmart Products Dataset equips you with the data you need to succeed in a competitive market.
m
CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization...
data.mendeley.com
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omer FOTSO (2025). CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization and Explanation in AI-Generated Code [Dataset]. http://doi.org/10.17632/wxmnyrp668.1
Explore at:
Unique identifier
https://doi.org/10.17632/wxmnyrp668.1
Dataset updated
Nov 7, 2025
Authors
Omer FOTSO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI).

The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations.

CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code.

By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.
Cat and Dog
kaggle.com
zip
Updated Apr 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SchubertSlySchubert (2018). Cat and Dog [Dataset]. https://www.kaggle.com/tongpython/cat-and-dog
Explore at:
zip(228487605 bytes)Available download formats
Dataset updated
Apr 26, 2018
Authors
SchubertSlySchubert
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is for running the code from this site: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8.

This is how to show a picture from the training set: display(Image('../input/cat-and-dog/training_set/training_set/dogs/dog.423.jpg'))

From the test set: display(Image('../input/cat-and-dog/test_set/test_set/cats/cat.4453.jpg'))

See an example of using this dataset. https://www.kaggle.com/tongpython/nattawut-5920421014-cat-vs-dog-dl
d
Fish Detection AI, Optic and Sonar-trained Object Detection Models
catalog.data.gov
data.openei.org
+1more
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Water Power Technology Office (2025). Fish Detection AI, Optic and Sonar-trained Object Detection Models [Dataset]. https://catalog.data.gov/dataset/fish-detection-ai-optic-and-sonar-trained-object-detection-models
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
Water Power Technology Office
Description
The Fish Detection AI project aims to improve the efficiency of fish monitoring around marine energy facilities to comply with regulatory requirements. Despite advancements in computer vision, there is limited focus on sonar images, identifying small fish with unlabeled data, and methods for underwater fish monitoring for marine energy. A YOLO (You Only Look Once) computer vision model was developed using the Eyesea dataset (optical) and sonar images from Alaska Fish and Games to identify fish in underwater environments. Supervised methods were used within YOLO to detect fish based on training using labeled data of fish. These trained models were then applied to different unseen datasets, aiming to reduce the need for labeling datasets and training new models for various locations. Additionally, hyper-image analysis and various image preprocessing methods were explored to enhance fish detection. In this research we achieved: 1. Enhanced YOLO Performance, as compared to a published article (Xu, Matzner 2018) using earlier yolo versions for fish object identification. Specifically, we achieved a best mean Average Precision (mAP) of 0.68 on the Eyesea optical dataset using YOLO v8 (medium-sized model), surpassing previous YOLO v3 benchmarks from that previous article publication. We further demonstrated up to 0.65 mAP on unseen sonar domains by leveraging a hyper-image approach (stacking consecutive frames), showing promising cross-domain adaptability. This submission of data includes: - The actual best-performing trained YOLO model neural network weights, which can be applied to do object detection (PyTorch files, .pt). These are found in the Yolo_models_downloaded zip file - Documentation file to explain the upload and the goals of each of the experiments 1-5, as detailed in the word document (named "Yolo_Object_Detection_How_To_Document.docx") - Coding files, namely 5 sub-folders of python, shell, and yaml files that were used to run the experiments 1-5, as well as a separate folder for yolo models. Each of these is found in their own zip file, named after each experiment - Sample data structures (sample1 and sample2, each with their own zip file) to show how the raw data should be structured after running our provided code on the raw downloaded data - link to the article that we were replicating (Xu, Matzner 2018) - link to the Yolo documentation site from the original creators of that model (ultralytics) - link to the downloadable EyeSea data set from PNNL (instructions on how to download and format the data in the right way to be able to replicate these experiments is found in the How To word document)
Z
Data from: Packing provenance using CPM RO-Crate profile
data.niaid.nih.gov
zenodo.org
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wittner, Rudolf; Gallo, Matej; Leo, Simone; Soiland-Reyes, Stian (2023). Packing provenance using CPM RO-Crate profile [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7676923
Explore at:
Dataset updated
Jun 29, 2023
Dataset provided by
The University of Manchester
CRS4
Masaryk University
Authors
Wittner, Rudolf; Gallo, Matej; Leo, Simone; Soiland-Reyes, Stian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.

As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

Description of the AI pipeline

The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.

Finally, all these artifacts are packed together in an RO-Crate.

For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.

Description of the RO-Crate

Process Run Crate related aspects

The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.

Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.

As a result, the crate consists the seven following “executables”:

Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

CPM RO-Crate related aspects

The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

Remarks

The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Igor Larrosa (2023). Trained AI model and associated files [Dataset]. http://doi.org/10.48420/16965271.v2

Trained AI model and associated files

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.48420/16965271.v2

Dataset updated

May 30, 2023

Dataset provided by

University of Manchester

Authors

Igor Larrosa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning

Trained AI full model
Trained AI reduced models
python files to run predictions
python files to train model
template for inputing kinetics for predictions
data used in case studies

Unpack data file and follow instructions in publication's Supporting Information

Clear search

Close search

Google apps

Main menu

Trained AI model and associated files

MISATO - Machine learning dataset for structure-based drug discovery

OpenAI HumanEval Code Gen

OpenAI HumanEval Code Gen

Handcrafted Python Programming Problems for Accurate Model Evaluation

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

ml_uncertainty: A Python module for estimating uncertainty in predictions of...

Fake vs Real News Dataset for NLP

📰 Fake News Detection Dataset 🔥

📌 Overview

🏗️ Dataset Structure

🚀 Usage

Ai Python Project Dataset

AI Python Project

Machine learning model that estimates total monthly and annual per capita...

Python Script for Cleaning Alum Dataset

Data from: Python Codebase and Jupyter Notebooks - Applications of Machine...

SynthFluencers: AI-Generated Influencers

Introduction

Background

Methodology

Data Generation Process

Dataset Overview

Structure

Applications and Use Cases

Potential Uses

Analysis and Insights

Statistical Breakdown

Key Insights

Challenges and Limitations

Ethical Considerations

Limitations of AI

Conclusion

Summary

Medium articles dataset

Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Use Cases:

Why Choose This Dataset?

DustNet - structured data and Python code to reproduce the model,...

bilingual-coding-qa-dataset

CoT_Reasoning_Python_General_Query

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

Walmart Products Dataset

CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization...

Cat and Dog

Fish Detection AI, Optic and Sonar-trained Object Detection Models

Data from: Packing provenance using CPM RO-Crate profile

Trained AI model and associated files