100+ datasets found
  1. m

    Trained AI model and associated files

    • figshare.manchester.ac.uk
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Larrosa (2023). Trained AI model and associated files [Dataset]. http://doi.org/10.48420/16965271.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    University of Manchester
    Authors
    Igor Larrosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning

    • Trained AI full model
    • Trained AI reduced models

    • python files to run predictions

    • python files to train model

    • template for inputing kinetics for predictions

    • data used in case studies

    Unpack data file and follow instructions in publication's Supporting Information

  2. MISATO - Machine learning dataset for structure-based drug discovery

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7711952
    Explore at:
    Dataset updated
    May 25, 2023
    Dataset provided by
    Forschungszentrum Jülichhttp://www.fz-juelich.de/
    Helmholtz Zentrum Münchenhttps://www.helmholtz-munich.de/
    Helmholtz Munich, Computational Health Center, Institute of Computational Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.
    Helmholtz Munich, Molecular Targets and Therapeutics Center, Institute of Structural Biology, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany.
    Authors
    Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.

  3. OpenAI HumanEval Code Gen

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenAI HumanEval Code Gen [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-humaneval-code-gen
    Explore at:
    zip(45602 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI HumanEval Code Gen

    Handcrafted Python Programming Problems for Accurate Model Evaluation

    By Huggingface Hub [source]

    About this dataset

    This dataset released by OpenAI, HumanEval, offers a unique opportunity for developers and researchers to accurately evaluate their code generation models in a safe environment. It includes 164 handcrafted programming problems written by engineers and researchers from OpenAI specificially designed to test the correctness and scalability of code generation models. Written in Python, these programming problems cover docstrings and comments full of natural English text which can be difficult for computers to comprehend. Each programming problem also includes a function signature, body as well as several unit tests. Placed under the MIT License, this HumanEval dataset is ideal for any practitioner looking to judge the efficacy of their machine-generated code with trusted results!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The first step is to explore the data that is included in the set by viewing the columns included. This guide will focus on four key columns: prompt, canonical_solution, test and entry_point. - The prompt column contains natural English text describing the programming problem. - The canonical_solution column holds the correct solution to each programming problem as determined by OpenAI researchers or engineers who hand-crafted the dataset. - The test column contains unit tests designed to check for correctness when debugging or evaluating code generated by neural networks or other automated tools.
    - The entry_point column contains code for an entry point into each program which can be used as starting point while solving any programming problem from this dataset.

    With this information we can now begin utilizing this data set for our own projects from building new case studies for specific AI algorithms to developing automated programs that generate compatible source code instructions based off open AI datasets like Human Eval!

    Research Ideas

    • Training code generation models in a limited and supervised environment.
    • Benchmarking the performance of existing code generation models, as HumanEval consists of both the canonical solution for each problem and unit tests that can be used to evaluate model accuracy.
    • Using Natural Language Processing (NLP) algorithms on the docstrings and comments within HumanEval to develop better natural language understanding for programming contexts

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: test.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | prompt | A description of the programming problem. (String) | | canonical_solution | The expected solution to the programming problem. (String) | | test | Unit tests to verify the accuracy of the solution. (String) | | entry_point | The entry point for running the unit tests. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  4. ml_uncertainty: A Python module for estimating uncertainty in predictions of...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). ml_uncertainty: A Python module for estimating uncertainty in predictions of machine learning models [Dataset]. https://catalog.data.gov/dataset/ml-uncertainty-a-python-module-for-estimating-uncertainty-in-predictions-of-machine-learni
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This software is a Python module for estimating uncertainty in predictions of machine learning models. It is a Python package that calculates uncertainties in machine learning models using bootstrapping and residual bootstrapping. It is intended to interface with scikit-learn but any Python package that uses a similar interface should work.

  5. Fake vs Real News Dataset for NLP

    • kaggle.com
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DEEPAK POLISETTI (2025). Fake vs Real News Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/deepakpolisetti/fake-vs-real-news-dataset-for-nlp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DEEPAK POLISETTI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📰 Fake News Detection Dataset 🔥

    📌 Overview

    This dataset is designed for Fake News Classification using NLP & Machine Learning. It contains labeled fake and real news articles, sourced from credible datasets. It is optimized for text analysis, deep learning models, and AI research.

    🏗️ Dataset Structure

    • Fake.csv → Contains fake news articles 🏴
    • True.csv → Contains real news articles ✅
    • fake_news_data.csv → Merged dataset for AI models ⚡

    🚀 Usage

    1️⃣ Load the dataset in Python using Pandas
    ```python import pandas as pd df = pd.read_csv("fake_news_data.csv")

  6. R

    Ai Python Project Dataset

    • universe.roboflow.com
    zip
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI project (2024). Ai Python Project Dataset [Dataset]. https://universe.roboflow.com/ai-project-4hlrl/ai-python-project
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset authored and provided by
    AI project
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Vehicles FTNK Bounding Boxes
    Description

    AI Python Project

    ## Overview
    
    AI Python Project is a dataset for object detection tasks - it contains Vehicles FTNK annotations for 2,999 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. d

    Machine learning model that estimates total monthly and annual per capita...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wa
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version

  8. d

    Python Script for Cleaning Alum Dataset

    • search.dataone.org
    • hydroshare.org
    Updated Oct 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
    Explore at:
    Dataset updated
    Oct 18, 2025
    Dataset provided by
    Hydroshare
    Authors
    saikumar payyavula; Jeff Sadler
    Description

    This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

  9. d

    Data from: Python Codebase and Jupyter Notebooks - Applications of Machine...

    • datasets.ai
    • gdr.openei.org
    • +3more
    57
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2022). Python Codebase and Jupyter Notebooks - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://datasets.ai/datasets/python-codebase-and-jupyter-notebooks-applications-of-machine-learning-techniques-to-geoth
    Explore at:
    57Available download formats
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    Department of Energy
    Area covered
    Great Basin
    Description

    Git archive containing Python modules and resources used to generate machine-learning models used in the "Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada" project. This software is licensed as free to use, modify, and distribute with attribution. Full license details are included within the archive. See "documentation.zip" for setup instructions and file trees annotated with module descriptions.

  10. SynthFluencers: AI-Generated Influencers

    • kaggle.com
    zip
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). SynthFluencers: AI-Generated Influencers [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/synthetic-influencer-backstory/code
    Explore at:
    zip(21280993 bytes)Available download formats
    Dataset updated
    Jan 21, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Introduction

    Background

    Exploring the creation of a unique dataset of synthetic influencer profiles using AI technologies, including OpenAI's GPT-3.5.

    Methodology

    Data Generation Process

    1. Influencer Profile Generation: Profiles are generated with demographic details like age, gender, etc.
    2. Location Allocation: Randomly assigning U.S. states or Canadian provinces based on population distribution.
    3. GPT-3.5 Integration: Generating detailed backstories for each influencer profile using OpenAI's GPT-3.5-turbo-instruct model.

    Dataset Overview

    Structure

    • The dataset contains profiles with attributes like Name, Age, Sex, Lifestyle, Country of Origin, State or Province, Education Level, MBTI Personality and Backstory.

    Applications and Use Cases

    Potential Uses

    • Market Research: Understanding influencer dynamics in different niches.
    • AI Training: Enhancing the realism and diversity of AI-generated personas.
    • Social Media Strategy: Informing content creation and marketing strategies.

    Analysis and Insights

    Statistical Breakdown

    • Distribution of influencers across various lifestyles and locations.
    • Correlation between attractiveness ratings and lifestyle niches.

    Key Insights

    • Predominant trends in influencer personas based on demographics and location.

    Challenges and Limitations

    Ethical Considerations

    • The impact of synthetic influencers on real-world perceptions and digital marketing.

    Limitations of AI

    • Challenges in capturing the full depth of human characteristics and experiences.

    Conclusion

    Summary

    • This dataset provides a unique lens into the world of synthetic influencers, blending AI creativity with insights into social media dynamics.
  11. c

    Medium articles dataset

    • crawlfeeds.com
    • kaggle.com
    json, zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

    Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

    Request here for the large dataset Medium datasets

    Checkout sample dataset in CSV

    Use Cases:

    • Training language models (LLMs)

    • Analyzing content trends and engagement

    • Sentiment and text classification

    • SEO research and author profiling

    • Academic or commercial research

    Why Choose This Dataset?

    • High-volume, cleanly structured JSON

    • Ideal for developers, researchers, and data scientists

    • Easy integration with Python, R, SQL, and other data pipelines

    • Affordable and ready-to-use

  12. Z

    DustNet - structured data and Python code to reproduce the model,...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Kingston University
    University of Exeter
    Authors
    Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

    Model input data and code

    Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

    Model output data and code

    This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

    *datasets are NumPy arrays (v1.23) created in Python v3.8.18.

    **all ML models were created with Keras in Python v3.10.10.

  13. h

    bilingual-coding-qa-dataset

    • huggingface.co
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Convai Innovations (2025). bilingual-coding-qa-dataset [Dataset]. https://huggingface.co/datasets/convaiinnovations/bilingual-coding-qa-dataset
    Explore at:
    Dataset updated
    Jan 15, 2025
    Authors
    Convai Innovations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🌐 Bilingual Coding Q&A Dataset

      📊 Dataset Description
    

    A comprehensive bilingual (English-Hindi) dataset containing 25,151 high-quality question-answer pairsfocused on programming concepts, particularly Python, machine learning, and AI. This dataset was used to fine-tune coding assistant models and contains over 7 million tokens of training data.

      Dataset Statistics
    

    Metric Value

    Total Examples 25,151 Q&A pairs

    Total Lines 250,320+… See the full description on the dataset page: https://huggingface.co/datasets/convaiinnovations/bilingual-coding-qa-dataset.

  14. h

    CoT_Reasoning_Python_General_Query

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt R. Wesney (2025). CoT_Reasoning_Python_General_Query [Dataset]. https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query
    Explore at:
    Dataset updated
    Apr 16, 2025
    Authors
    Matt R. Wesney
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Okay, here's a revised description for the new dataset "CoT_Reasoning_Python_General_Query": CoT Reasoning Python General Query: Enhancing Python Understanding through Chain of Thought Reasoning

    Description: Explore Python programming and general computing queries with the "CoT_Reasoning_Python_General_Query" dataset. This open-source resource (MIT licensed) provides a carefully curated collection of question-and-answer pairs designed to train AI models in understanding and reasoning about a… See the full description on the dataset page: https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query.

  15. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  16. c

    Walmart Products Dataset

    • crawlfeeds.com
    csv, zip
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Walmart Products Dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-products-dataset-sept-2022
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Dec 17, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Large Walmart Products Dataset is an essential resource for businesses, analysts, and developers seeking detailed insights into Walmart’s vast product catalog. This dataset includes extensive information on Walmart products, such as product names, descriptions, prices, categories, brand information, ratings, and customer reviews.

    With Walmart being one of the largest retailers globally, this dataset provides a unique opportunity to study consumer trends, perform competitive pricing analysis, and develop e-commerce solutions. For startups and established businesses, the dataset is ideal for market research, inventory management insights, and enhancing product discovery mechanisms.

    AI and machine learning practitioners can use this dataset to build recommendation systems, predictive pricing algorithms, and sentiment analysis models. Its structured format ensures smooth integration with Python, R, and other data analytics tools, making it user-friendly for data visualization and predictive modeling.

    Walmart Products Dataset is also an invaluable resource for retail analysts and e-commerce marketers aiming to optimize product positioning or analyze buying behaviors. Its broad coverage across categories like groceries, electronics, fashion, and home essentials provides a holistic view of Walmart’s inventory.

    Key Features:

    • Extensive Product Information: Details on pricing, discounts, availability, and ratings.
    • Diverse Applications: Suitable for AI models, trend analysis, and market research.
    • Retail Insights: Explore consumer preferences and popular product trends.

    Whether you're developing an AI-driven product search engine or conducting a pricing strategy study, the Large Walmart Products Dataset equips you with the data you need to succeed in a competitive market.

  17. m

    CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization...

    • data.mendeley.com
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer FOTSO (2025). CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization and Explanation in AI-Generated Code [Dataset]. http://doi.org/10.17632/wxmnyrp668.1
    Explore at:
    Dataset updated
    Nov 7, 2025
    Authors
    Omer FOTSO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI).

    The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations.

    CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code.

    By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.

  18. Cat and Dog

    • kaggle.com
    zip
    Updated Apr 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SchubertSlySchubert (2018). Cat and Dog [Dataset]. https://www.kaggle.com/tongpython/cat-and-dog
    Explore at:
    zip(228487605 bytes)Available download formats
    Dataset updated
    Apr 26, 2018
    Authors
    SchubertSlySchubert
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is for running the code from this site: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8.

    This is how to show a picture from the training set: display(Image('../input/cat-and-dog/training_set/training_set/dogs/dog.423.jpg'))

    From the test set: display(Image('../input/cat-and-dog/test_set/test_set/cats/cat.4453.jpg'))

    See an example of using this dataset. https://www.kaggle.com/tongpython/nattawut-5920421014-cat-vs-dog-dl

  19. d

    Fish Detection AI, Optic and Sonar-trained Object Detection Models

    • catalog.data.gov
    • data.openei.org
    • +1more
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Water Power Technology Office (2025). Fish Detection AI, Optic and Sonar-trained Object Detection Models [Dataset]. https://catalog.data.gov/dataset/fish-detection-ai-optic-and-sonar-trained-object-detection-models
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    Water Power Technology Office
    Description

    The Fish Detection AI project aims to improve the efficiency of fish monitoring around marine energy facilities to comply with regulatory requirements. Despite advancements in computer vision, there is limited focus on sonar images, identifying small fish with unlabeled data, and methods for underwater fish monitoring for marine energy. A YOLO (You Only Look Once) computer vision model was developed using the Eyesea dataset (optical) and sonar images from Alaska Fish and Games to identify fish in underwater environments. Supervised methods were used within YOLO to detect fish based on training using labeled data of fish. These trained models were then applied to different unseen datasets, aiming to reduce the need for labeling datasets and training new models for various locations. Additionally, hyper-image analysis and various image preprocessing methods were explored to enhance fish detection. In this research we achieved: 1. Enhanced YOLO Performance, as compared to a published article (Xu, Matzner 2018) using earlier yolo versions for fish object identification. Specifically, we achieved a best mean Average Precision (mAP) of 0.68 on the Eyesea optical dataset using YOLO v8 (medium-sized model), surpassing previous YOLO v3 benchmarks from that previous article publication. We further demonstrated up to 0.65 mAP on unseen sonar domains by leveraging a hyper-image approach (stacking consecutive frames), showing promising cross-domain adaptability. This submission of data includes: - The actual best-performing trained YOLO model neural network weights, which can be applied to do object detection (PyTorch files, .pt). These are found in the Yolo_models_downloaded zip file - Documentation file to explain the upload and the goals of each of the experiments 1-5, as detailed in the word document (named "Yolo_Object_Detection_How_To_Document.docx") - Coding files, namely 5 sub-folders of python, shell, and yaml files that were used to run the experiments 1-5, as well as a separate folder for yolo models. Each of these is found in their own zip file, named after each experiment - Sample data structures (sample1 and sample2, each with their own zip file) to show how the raw data should be structured after running our provided code on the raw downloaded data - link to the article that we were replicating (Xu, Matzner 2018) - link to the Yolo documentation site from the original creators of that model (ultralytics) - link to the downloadable EyeSea data set from PNNL (instructions on how to download and format the data in the right way to be able to replicate these experiments is found in the How To word document)

  20. Z

    Data from: Packing provenance using CPM RO-Crate profile

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wittner, Rudolf; Gallo, Matej; Leo, Simone; Soiland-Reyes, Stian (2023). Packing provenance using CPM RO-Crate profile [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7676923
    Explore at:
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    The University of Manchester
    CRS4
    Masaryk University
    Authors
    Wittner, Rudolf; Gallo, Matej; Leo, Simone; Soiland-Reyes, Stian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.

    As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

    Description of the AI pipeline

    The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

    Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

    AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

    AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

    In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.

    Finally, all these artifacts are packed together in an RO-Crate.

    For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.

    Description of the RO-Crate

    Process Run Crate related aspects

    The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.

    Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.

    As a result, the crate consists the seven following “executables”:

    Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

    Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

    For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

    Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

    CPM RO-Crate related aspects

    The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

    In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

    Remarks

    The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

    The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Igor Larrosa (2023). Trained AI model and associated files [Dataset]. http://doi.org/10.48420/16965271.v2

Trained AI model and associated files

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
University of Manchester
Authors
Igor Larrosa
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning

  • Trained AI full model
  • Trained AI reduced models

  • python files to run predictions

  • python files to train model

  • template for inputing kinetics for predictions

  • data used in case studies

Unpack data file and follow instructions in publication's Supporting Information

Search
Clear search
Close search
Google apps
Main menu