100+ datasets found
  1. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  2. SVG Code Generation Sample Training Data

    • kaggle.com
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vinothkumar Sekar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

    The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

     
    prompt=f""" I am participating in an SVG code generation competition.
      
       The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
      
       - Descriptions are generic and do not contain brand names, trademarks, or personal names.
       - No descriptions include people, even in generic terms.
       - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
       - Categories cover various domains, with some overlap between public and private test sets.
      
       To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
      
       Requirements:
       - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
       - Ensure **diversity and creativity** across topics.
       - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
       - Avoid duplication or overly similar phrasing.
      
       Example topics:
                     a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
      
       Please return the 100 topics in csv format.
       """
     
    • In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.
     
      prompt = f"""
          Generate SVG code to visually represent the following text description, while respecting the given constraints.
          
          Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
          Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
          
    
          Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
          Focus on a clear and concise representation of the input description within the given limitations. 
          Always give the complete SVG code with nothing omitted. Never use an ellipsis.
    
          The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
          Please generate a detailed svg code accordingly.
    
          input description: {text}
          """
     

    The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

  3. Trojan Detection Software Challenge - image-classification-feb2021-train

    • catalog.data.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Trojan Detection Software Challenge - image-classification-feb2021-train [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-4-train-dataset-170c9
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Round 4 Train DatasetThe data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1008 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  4. U

    Coast Train--Labeled imagery for training and evaluation of data-driven...

    • data.usgs.gov
    • catalog.data.gov
    Updated Aug 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillipe Wernette; Daniel Buscombe; Jaycee Favela; Sharon Fitzpatrick; Evan Goldstein; Nicholas Enwright; Erin Dunand (2024). Coast Train--Labeled imagery for training and evaluation of data-driven models for image segmentation [Dataset]. http://doi.org/10.5066/P91NP87I
    Explore at:
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Phillipe Wernette; Daniel Buscombe; Jaycee Favela; Sharon Fitzpatrick; Evan Goldstein; Nicholas Enwright; Erin Dunand
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2008 - Dec 31, 2020
    Description

    Coast Train is a library of images of coastal environments, annotations, and corresponding thematic label masks (or ‘label images’) collated for the purposes of training and evaluating machine learning (ML), deep learning, and other models for image segmentation. It includes image sets from both geospatial satellite, aerial, and UAV imagery and orthomosaics, as well as non-geospatial oblique and nadir imagery. Images include a diverse range of coastal environments from the U.S. Pacific, Gulf of Mexico, Atlantic, and Great Lakes coastlines, consisting of time-series of high-resolution (≤1m) orthomosaics and satellite image tiles (10–30m). Each image, image annotation, and labelled image is available as a single NPZ zipped file. NPZ files follow the following naming convention: {datasource}_{numberofclasses}_{threedigitdatasetversion}.zip, where {datasource} is the source of the original images (for example, NAIP, Landsat 8, Sentinel 2), {numberofclasses} is the number of classes us ...

  5. Dataset, splits, models, and scripts for the QM descriptors prediction

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

    Below are descriptions of the available scripts:

    1. atom_bond_descriptors.sh: Trains atom/bond targets.
    2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
    3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
    4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
    5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
    6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
    7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
    8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

    Below is the procedure for running the ml-QM-GNN on your own dataset:

    1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
    2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
    3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
    4. Run Chemprop to train your models using the additional predicted features supported here.
  6. A

    AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-market-5881
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    The AI Training Dataset Market size was valued at USD 2124.0 million in 2023 and is projected to reach USD 8593.38 million by 2032, exhibiting a CAGR of 22.1 % during the forecasts period. An AI training dataset is a collection of data used to train machine learning models. It typically includes labeled examples, where each data point has an associated output label or target value. The quality and quantity of this data are crucial for the model's performance. A well-curated dataset ensures the model learns relevant features and patterns, enabling it to generalize effectively to new, unseen data. Training datasets can encompass various data types, including text, images, audio, and structured data. The driving forces behind this growth include:

  7. Data from: Predicting Critical Properties and Acentric Factors of Fluids...

    • acs.figshare.com
    zip
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayandeep Biswas; Yunsie Chung; Josephine Ramirez; Haoyang Wu; William H. Green (2023). Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.3c00546.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sayandeep Biswas; Yunsie Chung; Josephine Ramirez; Haoyang Wu; William H. Green
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Knowledge of critical properties, such as critical temperature, pressure, density, as well as acentric factor, is essential to calculate thermo-physical properties of chemical compounds. Experiments to determine critical properties and acentric factors are expensive and time intensive; therefore, we developed a machine learning (ML) model that can predict these molecular properties given the SMILES representation of a chemical species. We explored directed message passing neural network (D-MPNN) and graph attention network as ML architecture choices. Additionally, we investigated featurization with additional atomic and molecular features, multitask training, and pretraining using estimated data to optimize model performance. Our final model utilizes a D-MPNN layer to learn the molecular representation and is supplemented by Abraham parameters. A multitask training scheme was used to train a single model to predict all the critical properties and acentric factors along with boiling point, melting point, enthalpy of vaporization, and enthalpy of fusion. The model was evaluated on both random and scaffold splits where it shows state-of-the-art accuracies. The extensive data set of critical properties and acentric factors contains 1144 chemical compounds and is made available in the public domain together with the source code that can be used for further exploration.

  8. APARENT2 Training Data and Models

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Linder; Johannes Linder (2022). APARENT2 Training Data and Models [Dataset]. http://doi.org/10.5281/zenodo.7317446
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Linder; Johannes Linder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed training data for the APARENT2 model (measurements from the random MPRA and designed oligo pool originally published by Bogard et al., 2019; see https://doi.org/10.1016/j.cell.2019.04.046 for reference). This repository also contains the APARENT2 model file. For more information on the training procedure, see the Genome Biology article "Deciphering the impact of genetic variation on human polyadenylation using APARENT2" (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02799-4). Two versions of the model are available:

    (a) aparent_all_libs_resnet_no_clinvar_wt_ep_5.h5: The originally trained APARENT2 model.
    (b) aparent_all_libs_resnet_no_clinvar_wt_ep_5_var_batch_size_inference_mode_no_drop.h5: Identical weights and predictions as model (a), but the normalization layers have been set to inference mode and the dropout layers have been removed (thus making it compatible with the scrambler pipeline).

  9. R

    Accident Detection Model Dataset

    • universe.roboflow.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Accident detection model
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Accident Bounding Boxes
    Description

    Accident-Detection-Model

    Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

    Problem Statement

    • Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.
    • According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.
    • The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

    Accidents survey

    https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

    Literature Survey

    • Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.
    • Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

    Research Gap

    • Lack of real-world data - We trained model for more then 3200 images.
    • Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.
    • Outdated Versions of previous works - We aer using Latest version of Yolo v8.

    Proposed methodology

    • We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.
    • This model after training with 25 iterations and is ready to detect an accident with a significant probability.

    Model Set-up

    Preparing Custom dataset

    • We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.
    • Then we annotated all of them individually on a tool called roboflow.
    • During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident
    • Then we divided the data set into train, val, test in the ratio of 8:1:1
    • At the final step we downloaded the dataset in yolov8 format.
      #### Using Google Collab
    • We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.
    • You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.
    • Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.
    • In Google collab, First of all we Changed runtime from TPU to GPU.
    • We cross checked it by running command ‘!nvidia-smi’
      #### Coding
    • First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’
    • Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’
    • Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’
    • Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’
    • After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’
    • Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’
    • The results are stored in the runs/detect/predict folder.
      Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

    Challenges I ran into

    I majorly ran into 3 problems while making this model

    • I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.
    • I was facing problem on cvat website because i was not sure what
  10. Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata

    • datarade.ai
    .csv
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WIRESTOCK (2023). Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata [Dataset]. https://datarade.ai/data-products/wirestock-s-ai-ml-image-training-data-4-5m-files-with-metadata-wirestock
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jul 18, 2023
    Dataset provided by
    Wirestock, Inc.
    Authors
    WIRESTOCK
    Area covered
    Estonia, Georgia, Belarus, Swaziland, Pakistan, New Caledonia, Sudan, Peru, Jersey, Chile
    Description

    Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.

    The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.

    The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.

    This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.

    The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.

    In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.

    The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.

  11. Z

    DCASE 2024 Challenge Task 2 Additional Training Dataset

    • data.niaid.nih.gov
    Updated May 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takashi, Endo (2024). DCASE 2024 Challenge Task 2 Additional Training Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11183283
    Explore at:
    Dataset updated
    May 15, 2024
    Dataset provided by
    Yohei, Kawaguchi
    Noboru, Harada
    Sannino, Roberto
    Kota, Dohi
    Tomoya, Nishida
    Keisuke, Imoto
    Albertini, Davide
    Augusti, Filippo
    Daisuke, Niizumi
    Takashi, Endo
    Pradolini, Simone
    Harsh, Purohit
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is the "additional training dataset" for the DCASE 2024 Challenge Task 2.

    The data consists of the normal/anomalous operating sounds of nine types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 10 seconds. The following nine types of real/toy machines are used in this task:

    3DPrinter

    AirCompressor

    BrushlessMotor

    HairDryer

    HoveringDrone

    RoboticArm

    Scanner

    ToothBrush

    ToyCircuit

    Overview of the task

    Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

    This task is the follow-up from DCASE 2020 Task 2 to DCASE 2023 Task 2. The task this year is to develop an ASD system that meets the following five requirements.

    1. Train a model using only normal sound (unsupervised learning scenario) Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

    2. Detect anomalies regardless of domain shifts (domain generalization task) In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2 and DCASE 2023 Task 2.

    3. Train a model for a completely new machine typeFor a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same as in DCASE 2023 Task 2.

    4. Train a model using a limited number of machines from its machine typeWhile sounds from multiple machines of the same machine type can be used to enhance the detection performance, it is often the case that only a limited number of machines are available for a machine type. In such a case, the system should be able to train models using a few machines from a machine type. This requirement is the same as in DCASE 2023 Task 2.

    5 . Train a model both with or without attribute informationWhile additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.

    The last requirement is newly introduced in DCASE 2024 Task2.

    Definition

    We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

    "Machine type" indicates the type of machine, which in the additional training dataset is one of nine: 3D-printer, air compressor, brushless motor, hair dryer, hovering drone, robotic arm, document scanner (scanner), toothbrush, and Toy circuit.

    A section is defined as a subset of the dataset for calculating performance metrics.

    The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

    Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.

    Dataset

    This dataset consists of nine machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2024 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

    File names and attribute csv files

    File names and attribute csv files provide reference labels for each clip. The given reference labels for each training clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

    [filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
    

    For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.

    Recording procedure

    Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

    Directory structure

    • /eval_data

      • /raw - /3DPrinter - /train (only normal clips) - /section_00_source_train_normal_0001_.wav - ... - /section_00_source_train_normal_0990_.wav - /section_00_target_train_normal_0001_.wav - ... - /section_00_target_train_normal_0010_.wav - attributes_00.csv (attribute csv for section 00) - /AirCompressor (The other machine types have the same directory structure as 3DPrinter.) - /BrushlessMotor - /HairDryer - /HoveringDrone - /RoboticArm - /Scanner - /ToothBrush - /ToyCircuit

    Baseline system

    The baseline system is available on the Github repository . The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

    Condition of use

    This dataset was created jointly by Hitachi, Ltd., NTT Corporation and STMicroelectronics and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    Citation

    Contact

    If there is any problem, please contact us:

    Tomoya Nishida, tomoya.nishida.ax@hitachi.com

    Keisuke Imoto, keisuke.imoto@ieee.org

    Noboru Harada, noboru@ieee.org

    Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

    Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com

  12. f

    Training, test data and model parameters.

    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salvatore Cosentino; Mette Voldby Larsen; Frank Møller Aarestrup; Ole Lund (2023). Training, test data and model parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0077302.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Salvatore Cosentino; Mette Voldby Larsen; Frank Møller Aarestrup; Ole Lund
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training, test data and model parameters. The last 3 columns show the MinORG, LT and HT parameters used to create the pathogenicity families and build the model for each of the 10 models. Zthr is a threshold value, calculated for each model at the cross validation phase, which is used, given the final prediction score, to decide if the input organisms will be predicted as pathogenic or non-pathogenic. The parameters for each model are chosen after 5-fold cross-validation tests.

  13. Data sources used by companies for training AI models South Korea 2024

    • statista.com
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Data sources used by companies for training AI models South Korea 2024 [Dataset]. https://www.statista.com/statistics/1452822/south-korea-data-sources-for-training-artificial-intelligence-models/
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2024 - Nov 2024
    Area covered
    South Korea
    Description

    As of 2024, customer data was the leading source of information used to train artificial intelligence (AI) models in South Korea, with nearly ** percent of surveyed companies answering that way. About ** percent responded to use public sector support initiatives.

  14. Complete Blood Count (CBC)

    • kaggle.com
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Noukhez
    Description

    Dataset Description:

    This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

    Dataset Breakdown:

    • Training Images: 300
    • Validation Images: 60
    • Test Images: 60
    • Annotations: Detailed annotations included for all images

    Overview:

    The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

    Data Composition:

    1. Training Set:

      • Contains 300 images
      • These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.
    2. Validation Set:

      • Contains 60 images
      • Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.
    3. Test Set:

      • Contains 60 images
      • Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

    Annotations:

    Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

    Key Features:

    • High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.
    • Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.
    • Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

    Applications:

    This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

    Usage Notes:

    • Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.
    • Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.
    • Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

    Conclusion:

    This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.

  15. f

    Supplemental Synthetic Images (outdated)

    • figshare.com
    zip
    Updated May 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duke Bass Connections Deep Learning for Rare Energy Infrastructure 2020-2021 (2021). Supplemental Synthetic Images (outdated) [Dataset]. http://doi.org/10.6084/m9.figshare.13546643.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2021
    Dataset provided by
    figshare
    Authors
    Duke Bass Connections Deep Learning for Rare Energy Infrastructure 2020-2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewThis is a set of synthetic overhead imagery of wind turbines that was created with CityEngine. There are corresponding labels that provide the class, x and y coordinates, and height and width (YOLOv3 format) of the ground truth bounding boxes for each wind turbine in the images. These labels are named similarly to the images (e.g. image.png will have the label titled image.txt)..UseThis dataset is meant as supplementation to training an object detection model on overhead images of wind turbines. It can be added to the training set of an object detection model to potentially improve performance when using the model on real overhead images of wind turbines.WhyThis dataset was created to examine the utility of adding synthetic imagery to the training set of an object detection model to improve performance on rare objects. Since wind turbines are both very rare in number and sparse, this makes acquiring data very costly. This synthetic imagery is meant to solve this issue by automating the generation of new training data. The use of synthetic imagery can also be applied to the issue of cross-domain testing, where the model lacks training data on a particular region and consequently struggles when used on that region.MethodThe process for creating the dataset involved selecting background images from NAIP imagery available on Earth OnDemand. These images were randomlyselected from these geographies: forest, farmland, grasslands, water, urban/suburban,mountains, and deserts. No consideration was put into whether the background images would seem realistic. This is because we wanted to see if this would help the model become better at detecting wind turbines regardless of their context (which would help when using the model on novel geographies). Then, a script was used to select these at random and uniformly generate 3D models of large wind turbines over the image and then position the virtual camera to save four 608x608 pixel images. This process was repeated with the same random seed, but with no background image and the wind turbines colored as black. Next, these black and white images were converted into ground truth labels by grouping the black pixels in the images.

  16. u

    Protein structures predicted using DMPfold2, plus training data

    • rdr.ucl.ac.uk
    application/x-gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaun Kandathil; Andy Lau; Joe Greener; David Jones (2023). Protein structures predicted using DMPfold2, plus training data [Dataset]. http://doi.org/10.5522/04/14979990.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    University College London
    Authors
    Shaun Kandathil; Andy Lau; Joe Greener; David Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises predicted protein structures from the paper "Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins". Structures were predicted using DMPfold2.BFD_1.3M.hdf5 contains all the models from the set of 1.3M that were generated. The models can be retrieved from this file using the provided hdf5_extract.py script and the list of IDs in bfdfold_1.3M_target_ids.csv.Also provided are tarballs of the models and sequence alignments for the 5193 Pfam families modelled in the paper, as well as for the set of 255 Pfams with released structures used for comparisons against DMPfold1 and C-I-TASSER.train_data.tar.bz2 contains the data used to train the DMPfold2 neural network. Further scripts and instructions are available on the associated GitHub page: https://github.com/psipred/DMPfold2

  17. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  18. m

    Trained AI model and associated files

    • figshare.manchester.ac.uk
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Larrosa (2023). Trained AI model and associated files [Dataset]. http://doi.org/10.48420/16965271.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    University of Manchester
    Authors
    Igor Larrosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning

    • Trained AI full model
    • Trained AI reduced models

    • python files to run predictions

    • python files to train model

    • template for inputing kinetics for predictions

    • data used in case studies

    Unpack data file and follow instructions in publication's Supporting Information

  19. H

    Data from: Data augmentation for disruption prediction via robust surrogate...

    • dataverse.harvard.edu
    • osti.gov
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

  20. R

    Game Event Detection Training Data #1 Dataset

    • universe.roboflow.com
    zip
    Updated Nov 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gamingeventds (2022). Game Event Detection Training Data #1 Dataset [Dataset]. https://universe.roboflow.com/gamingeventds/game-event-detection-training-data-1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 16, 2022
    Dataset authored and provided by
    gamingeventds
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    Game Event Detection Training Data #1

    ## Overview
    
    Game Event Detection Training Data #1 is a dataset for object detection tasks - it contains Objects annotations for 1,508 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2

Custom Yolov7 On Kaggle On Custom Dataset

custom-yolov7-on-kaggle-on-custom-dataset

custom-yolov7-on-kaggle-on-custom-dataset-rakiq

Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured
Person Car Bounding Boxes
Description

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

from IPython.display import Markdown, display

display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

  • Export the dataset to YOLOv7
  • Train YOLOv7 to recognize the objects in our dataset
  • Evaluate our YOLOv7 model's performance
  • Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images



print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try:
  user_secrets = UserSecretsClient()
  wandb_api_key = user_secrets.get_secret("wandb_api")
  wandb.login(key=wandb_api_key)
  anonymous = None
except:
  wandb.login(anonymous='must')
  print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
Get your W&B access token from here: https://wandb.ai/authorize')
  
  
  
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

Search
Clear search
Close search
Google apps
Main menu