22 datasets found
  1. Cats&Dogs (Pickle)

    • kaggle.com
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FLuzmano (2020). Cats&Dogs (Pickle) [Dataset]. https://www.kaggle.com/fariziluzman/catsdogs-pickle/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FLuzmano
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by FLuzmano

    Released under CC0: Public Domain

    Contents

    CNN

    For Google colab practice

  2. deaplearninexamAU2024

    • kaggle.com
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christoffer fuglkjær (2024). deaplearninexamAU2024 [Dataset]. https://www.kaggle.com/datasets/christofferfuglkjr/deeplearninexam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    christoffer fuglkjær
    Description

    This is just a reuploaded version of https://www.kaggle.com/datasets/ubitquitin/geolocation-geoguessr-images-50k?resource=download. But with the GeoGuessr UI cropped out and countries sorted into regions. This dataset is just used to make reloading training data in Google Colab faster.

  3. P

    EDGE-IIOTSET Dataset

    • paperswithcode.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
    Explore at:
    Dataset updated
    Oct 16, 2023
    Description

    ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

    Instructions:

    Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

    Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

    Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

    The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

    Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

    Link to paper : https://ieeexplore.ieee.org/document/9751703

    The directories of the Edge-IIoTset dataset include the following:

    •File 1 (Normal traffic)

    -File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

    -File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

    -File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

    -File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

    -File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

    -File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

    -File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

    -File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

    •File 2 (Attack traffic):

    -File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

    -File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

    •File 3 (Selected dataset for ML and DL):

    -File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

    -File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

    Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

    !pip install -q kaggle

    files.upload()

    !mkdir ~/.kaggle

    !cp kaggle.json ~/.kaggle/

    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

    !unzip DNN-EdgeIIoT-dataset.csv.zip

    !rm DNN-EdgeIIoT-dataset.csv.zip

    Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

    import numpy as np

    df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

    Step 3 : Exploring some of the DataFrame's contents: df.head(5)

    print(df['Attack_type'].value_counts())

    Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

    drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

     "http.file_data","http.request.full_uri","icmp.transmit_timestamp",
    
     "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
    
     "tcp.dstport", "udp.port", "mqtt.msg"]
    

    df.drop(drop_columns, axis=1, inplace=True)

    df.dropna(axis=0, how='any', inplace=True)

    df.drop_duplicates(subset=None, keep="first", inplace=True)

    df = shuffle(df)

    df.isna().sum()

    print(df['Attack_type'].value_counts())

    Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn import preprocessing

    def encode_text_dummy(df, name):

    dummies = pd.get_dummies(df[name])

    for x in dummies.columns:

    dummy_name = f"{name}-{x}"
    
    df[dummy_name] = dummies[x]
    

    df.drop(name, axis=1, inplace=True)

    encode_text_dummy(df,'http.request.method')

    encode_text_dummy(df,'http.referer')

    encode_text_dummy(df,"http.request.version")

    encode_text_dummy(df,"dns.qry.name.len")

    encode_text_dummy(df,"mqtt.conack.flags")

    encode_text_dummy(df,"mqtt.protoname")

    encode_text_dummy(df,"mqtt.topic")

    Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

    For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

    More information about Dr. Mohamed Amine Ferrag is available at:

    https://www.linkedin.com/in/Mohamed-Amine-Ferrag

    https://dblp.uni-trier.de/pid/142/9937.html

    https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

    https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

    https://www.scopus.com/authid/detail.uri?authorId=56115001200

    https://publons.com/researcher/1322865/mohamed-amine-ferrag/

    https://orcid.org/0000-0002-0632-3172

    Last Updated: 27 Mar. 2023

  4. R

    Accident Detection Model Dataset

    • universe.roboflow.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Accident detection model
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Accident Bounding Boxes
    Description

    Accident-Detection-Model

    Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

    Problem Statement

    • Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.
    • According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.
    • The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

    Accidents survey

    https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

    Literature Survey

    • Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.
    • Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

    Research Gap

    • Lack of real-world data - We trained model for more then 3200 images.
    • Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.
    • Outdated Versions of previous works - We aer using Latest version of Yolo v8.

    Proposed methodology

    • We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.
    • This model after training with 25 iterations and is ready to detect an accident with a significant probability.

    Model Set-up

    Preparing Custom dataset

    • We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.
    • Then we annotated all of them individually on a tool called roboflow.
    • During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident
    • Then we divided the data set into train, val, test in the ratio of 8:1:1
    • At the final step we downloaded the dataset in yolov8 format.
      #### Using Google Collab
    • We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.
    • You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.
    • Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.
    • In Google collab, First of all we Changed runtime from TPU to GPU.
    • We cross checked it by running command ‘!nvidia-smi’
      #### Coding
    • First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’
    • Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’
    • Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’
    • Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’
    • After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’
    • Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’
    • The results are stored in the runs/detect/predict folder.
      Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

    Challenges I ran into

    I majorly ran into 3 problems while making this model

    • I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.
    • I was facing problem on cvat website because i was not sure what
  5. h

    part1_dataSorted_Diversevul_llama2_dataset

    • huggingface.co
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atharva Prashant Pawar (2024). part1_dataSorted_Diversevul_llama2_dataset [Dataset]. https://huggingface.co/datasets/atharvapawar/part1_dataSorted_Diversevul_llama2_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 19, 2024
    Authors
    Atharva Prashant Pawar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset : part1_dataSorted_Diversevul_llama2_dataset

      dataset lines : 2768
    
    
    
    
    
      Kaggle Notebook (for dataset splitting) : https://www.kaggle.com/code/mrappplg/securix-diversevul-dataset
    
    
    
    
    
      Google Colab Notebook : https://colab.research.google.com/drive/1z6fLQrcMSe1-AVMHp0dp6uDr4RtVIOzF?usp=sharing
    
  6. h

    hagrid-sample-120k-384p

    • huggingface.co
    Updated Jul 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Mills (2023). hagrid-sample-120k-384p [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Authors
    Christian Mills
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 127,331 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

      Original Authors:
    

    Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

      Original Dataset Links
    

    GitHub Kaggle Datasets Page

      Object Classes
    

    ['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p.

  7. R

    Robust Shelf Monitoring Dataset

    • universe.roboflow.com
    zip
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelf Monitoring (2022). Robust Shelf Monitoring Dataset [Dataset]. https://universe.roboflow.com/shelf-monitoring/robust-shelf-monitoring/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2022
    Dataset authored and provided by
    Shelf Monitoring
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Stock Of Products In Shelf Bounding Boxes
    Description

    Robust Shelf Monitoring

    We aim to build a Robust Shelf Monitoring system to help store keepers to maintain accurate inventory details, to re-stock items efficiently and on-time and to tackle the problem of misplaced items where an item is accidentally placed at a different location. Our product aims to serve as store manager that alerts the owner about items that needs re-stocking and misplaced items.

    Training the model:

    • Unzip the labelled dataset from kaggle and store it to your google drive.
    • Follow the tutorial and update the training parameters in custom-yolov4-detector.cfg file in /darknet/cfg/ directory.
    • filters = (number of classes + 5) * 3 for each yolo layer.
    • max_batches = (number of classes) * 2000

    Steps to run the prediction colab notebook:

    1. Install the required dependencies; pymongo,dnspython.
    2. Clone the darknet repository and the required python scripts.
    3. Mount the google drive containing the weight file.
    4. Copy the pre-trained weight file to the yolo content directory.
    5. Run the detect.py script to peform the prediction. ## Presenting the predicted result. The detect.py script have option to send SMS notification to the shop keepers. We have built a front-end for building the phone-book for collecting the details of the shopkeepers. It also displays the latest prediction result and model accuracy.
  8. gld20GB

    • kaggle.com
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JkReddy (2020). gld20GB [Dataset]. https://www.kaggle.com/jkreddy/gld20gb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JkReddy
    Description

    Context

    It took very long time/weeks, to make this dataset, giving me an extensive data engineering capabilities. Used both GitHub and GCP as storage and both kaggle and colab to prepare this dataset. It would have been more useful to everyone, had i done this much earlier.

    Content

    All images from original set are included. To reduce the dataset size, all images have been resized to a minimum dimension of (224320) using tensorflow resize API.

    Acknowledgements

    Extensively used stackoverflow to find best solutions for many data engineering tasks and thanks for all those who have solved those issues earlier.

    Inspiration

    Original dataset size 99GB cant be used in colab to train the custom model.

  9. xView1 dataset yolov5

    • kaggle.com
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Scotto Rosato (2023). xView1 dataset yolov5 [Dataset]. https://www.kaggle.com/datasets/luigiscottorosato/xview1-dataset-yolov5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luigi Scotto Rosato
    Description

    xView1 Adapted for YOLOv5 in Colab

    Overview:

    This dataset is a modified version of the xView1 dataset, specifically tailored for seamless integration with YOLOv5 in Google Colab. The xView1 dataset originally consists of high-resolution satellite imagery labeled for object detection tasks. In this adapted version, we have preprocessed the data and organized it to facilitate easy usage with YOLOv5, a popular deep learning framework for object detection.

    Dataset Contents:

    Images: The dataset includes a collection of high-resolution satellite images covering diverse geographic locations. These images have been resized and preprocessed to align with the requirements of YOLOv5, ensuring efficient training and testing.

    Annotations:

    Object annotations are provided for each image, specifying the bounding boxes and class labels of various objects present in the scenes. The annotations are formatted to match the YOLOv5 input specifications.

    Usage Instructions:

    1. Download the dataset files, including images and annotations.
    2. Clone the YOLOv5 repository in Colab.
    3. Move dataset files (train.txt and val.txt) to the yolov5 directory.
    4. Use the provided .yaml for training.
  10. NYC Jobs Dataset (Filtered Columns)

    • kaggle.com
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jeffery Mandrake
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

    The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

    I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

    Once the csv file is uploaded to Google Colab, use these commands to process the file.

    import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)

  11. f

    Sample Posts from the ADHD dataset.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Sample Posts from the ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.

  12. Nike,Adidas Shoes for Image Classification Dataset

    • kaggle.com
    Updated Jul 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ifeanyi Nneji (2022). Nike,Adidas Shoes for Image Classification Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/3980041
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2022
    Dataset provided by
    Kaggle
    Authors
    Ifeanyi Nneji
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset can be used to build a CNN model that can classify if a shoe is an Adidas or Nike brand.

    The images were pulled from bing using bing_image_search from pypi, 400 images of each class were downloaded and then the dataset was trimmed to 300(some unrelated images were removed in the process of compiling the dataset).

    Link to Notebook

  13. h

    hagrid-classification-512p-no-gesture-150k

    • huggingface.co
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Mills (2025). hagrid-classification-512p-no-gesture-150k [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-classification-512p-no-gesture-150k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2025
    Authors
    Christian Mills
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for "hagrid-classification-512p-no-gesture-150k"

    This dataset contains 153,735 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

      Original Authors:
    

    Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-classification-512p-no-gesture-150k.

  14. Tajweed Dataset

    • kaggle.com
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ala'a Abdu Saleh Alawdi (2025). Tajweed Dataset [Dataset]. https://www.kaggle.com/datasets/alawdisoft/tajweed-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ala'a Abdu Saleh Alawdi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:

    Dataset Structure:

    • Organized by Tajweed Rule and Sheikh: The dataset is structured into directories for each Tajweed rule (e.g., 'Ikhfa', 'Izhar'). Within each rule's directory, there are subdirectories representing different reciters (sheikhs). This hierarchical organization is crucial for creating a structured metadata file and for training machine learning models.
    • Audio Files: The audio files (presumably WAV or other supported formats) are stored within the sheikh's subdirectories. The original filenames are not standardized.
    • Multiple Sheikhs per Rule: The dataset includes multiple recitations for each rule from different sheikhs, offering diversity in pronunciation.
    • Google Drive Storage: The dataset is located on Google Drive, which requires mounting the drive to access the data within a Colab environment.

    Code Functionality:

    1. Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.

    2. Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.

    3. Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.

      • Iterating through Paths: The code iterates through each Tajweed rule and its corresponding paths.
      • File Listing: Inside each directory, it iterates through the audio files.
      • Metadata Dictionary: For each audio file, it creates a metadata dictionary that includes:
        • global_id: A unique identifier for each audio file.
        • original_filename: The original filename of the audio file.
        • new_filename: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.
        • label: The Tajweed rule.
        • sheikh_id: A numerical identifier for each sheikh.
        • sheikh_name: The name of the reciter.
        • audio_number: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.
        • original_path: Full path to the original audio file.
        • new_path: Full path to the intended location for the renamed and potentially converted audio file.
      • Pandas DataFrame: The metadata is collected in a list of dictionaries and then converted into a Pandas DataFrame for easier viewing and processing. This DataFrame is highly informative.
    4. File Renaming and Conversion:

      • File Renaming: (commented out) The code is able to rename the audio files to the standardized format defined in new_filename and store it in the designated directory.
      • Audio Conversion to WAV: The script then converts any files in the specified directories to .wav format, creating standardized files in a new output_dataset directory. The new filenames are based on rules, sheikh and a counter.
    5. Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv) in the output directory. This CSV file is crucial for training any machine learning model using this data.

  15. Banana Classification

    • kaggle.com
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atri Thakar (2024). Banana Classification [Dataset]. https://www.kaggle.com/datasets/atrithakar/banana-classification/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    Kaggle
    Authors
    Atri Thakar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a dataset for detecting banana quality using ML. This dataset contains four categories: Unripe, Ripe, Overripe and Rotten. In this dataset, there are enormous amount of images which will help users to train the ML model conveniently and easily.

    NOTE: THIS DATASET HAS BEEN PICKED FROM https://universe.roboflow.com/roboflow-universe-projects/banana-ripeness-classification. I WAS FACING DIFFICULTIES WHILE DOWNLOADING DATASET DIRECTLY TO THE GOOGLE COLAB TO TRAIN MY CNN MODEL AS A PART OF UNIVERSITY PROJECT. ALL CREDITS FOR THIS DATASET, AS FAR AS MY KNOWLEDGE GOES, GOES TO ROBOFLOW. I DO NOT INTEND TO TAKE ANY CREDITS MYSELF OR UNETHICALLY CLAIM OWNERSHIP, I JUST UPLOADED DATASET HERE FOR MY CONVENIENCE, THANK YOU.

  16. Common Voice Corpus 5.1

    • kaggle.com
    zip
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krish Baisoya (2023). Common Voice Corpus 5.1 [Dataset]. https://www.kaggle.com/datasets/krishbaisoya/cv-en-5
    Explore at:
    zip(54099708635 bytes)Available download formats
    Dataset updated
    Sep 15, 2023
    Authors
    Krish Baisoya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Common Voice is a corpus of speech data read by users on the Common Voice website, and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.

    How it is collected ?

    In google colab, i downloaded the .tar.gz from common-voice (mozilla). And placed the compressed file in a folder marked the folder as dataset and straight-up uploaded it

  17. Van Gogh vs Tree Oil Painting Ai Analysis

    • kaggle.com
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARONTHAI MONGBUNSRI (2025). Van Gogh vs Tree Oil Painting Ai Analysis [Dataset]. https://www.kaggle.com/datasets/haronthaimongbunsri/van-gogh-vs-tree-oil-painting-ai-analysis/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARONTHAI MONGBUNSRI
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Title: AI model for artwork verification using brushstroke structure, rhythm, and X-ray analysis. Developed from 10 years of research and refined through AI Natural Matching.

    Overview

    This dataset presents an AI model developed specifically for artwork authentication. The goal is to determine whether two paintings—one known and one unknown—may have been created by the same artist, based on brushstroke analysis and structural features.

    This model demonstrates how machine learning can assist in verifying stylistic consistency using visual data, supported by structural scans such as X-ray imaging. The method was built from human-led comparisons and later refined into a formal system of AI-driven brushstroke analysis.

    Method: The 18 Supreme Techniques

    The model uses a set of 18 core visual techniques designed to analyze:

    Stroke rhythm and torque

    Brushstroke flow direction

    Angular deviation and energy pattern

    Layer buildup and underpainting patterns (via X-ray)

    These techniques were developed in Google Colab with AI support, based on over 10 years of image-based manual study. The system is optimized for AI Natural Matching, a visual neural comparison approach that mimics human perception but operates at scale.

    Requirements:

    1. High-resolution candidate image

    2. Reference image (known work by artist)

    3. X-ray scan of candidate image

    4. AI Natural Matching only (SSIM is strictly forbidden)

    This model must use AI Natural Matching. Do not use SSIM or any structural similarity index.

    Dataset Use

    This dataset is intended for researchers, AI developers, and art historians who wish to:

    Validate or disprove potential authorship of a painting

    Test the model using different artist references

    Study visual fingerprinting and stylistic consistency

    Scientific pigment data (XRF, FTIR, SEM) and aging process validation for The Tree Oil Painting are available in a separate dataset. Cross-checking with physical material data is strongly encouraged.

    Licensing and Attribution

    All data is licensed under CC BY 4.0 and freely available for academic, research, and AI development use.

    Model and research developed by Haronthai Mongbunsri (Independent Researcher, 2015–2025) AI structure refined through collaboration with neural tools via Google Colab.

    This dataset is part of an open effort to build transparent, reproducible systems for artwork verification.

    Reference: Scientific Verification Dataset on Hugging Face

    This analysis is built upon scientific pigment data, X-ray, and FTIR results hosted on Hugging Face:

    We strongly recommend reviewing this core dataset to understand the chemical and material basis behind the visual AI analysis.

  18. Generated-images

    • kaggle.com
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Bonnet (2023). Generated-images [Dataset]. https://www.kaggle.com/datasets/antoinebonnet2001/generated-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Antoine Bonnet
    Description

    This dataset during the challengeV2 of the INF473V at ecole polytechnique. It consists in additionnal images for the dataset generated with stable diffusion. Code used to generate them : https://colab.research.google.com/drive/1zicIWGK7hd-TH_8tNJ4kgxrrPeHsgZWv?usp=sharing

  19. SemEval2018 tweetDataset

    • kaggle.com
    Updated Aug 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BM Abir (2020). SemEval2018 tweetDataset [Dataset]. https://www.kaggle.com/bmabir17/semeval2018-tweetdataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BM Abir
    Description

    Context

    The Dataset was obtained from the following source http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/SemEval2018-Task1-all-data.zip

    Content

    This dataset contains only the English/ELreg portion from the original dataset It was preprocessed using the code written in this notebook section for Combining the dataset

  20. chess object detection + yolov5 for chess

    • kaggle.com
    Updated Mar 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Haytham (2022). chess object detection + yolov5 for chess [Dataset]. https://www.kaggle.com/ahmedhaytham/chess-object-detection-yolov5-for-chess
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Haytham
    Description

    here ?

    1. just uploaded it here to make it easy for me and others to use
    2. There is no data similar to it on kaggle # $ evaluation of my model is in ****yolov5/runs/train/exp/****

    Chess Pieces > 416x416_aug

    https://public.roboflow.ai/object-detection/chess-full

    Provided by Roboflow License: Public Domain

    Overview

    This is a dataset of Chess board photos and various pieces. All photos were captured from a constant angle, a tripod to the left of the board. The bounding boxes of all pieces are annotated as follows: white-king, white-queen, white-bishop, white-knight, white-rook, white-pawn, black-king, black-queen, black-bishop, black-knight, black-rook, black-pawn. There are 2894 labels across 292 images.

    https://i.imgur.com/nkjobw1.png" alt="Chess Example">

    Follow this tutorial to see an example of training an object detection model using this dataset or jump straight to the Colab notebook.

    Use Cases

    At Roboflow, we built a chess piece object detection model using this dataset.

    https://blog.roboflow.ai/content/images/2020/01/chess-detection-longer.gif" alt="ChessBoss">

    You can see a video demo of that here. (We did struggle with pieces that were occluded, i.e. the state of the board at the very beginning of a game has many pieces obscured - let us know how your results fare!)

    Using this Dataset

    We're releasing the data free on a public license.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility.

    Roboflow Workmark

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FLuzmano (2020). Cats&Dogs (Pickle) [Dataset]. https://www.kaggle.com/fariziluzman/catsdogs-pickle/activity
Organization logo

Cats&Dogs (Pickle)

For Google Colab CNN

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
FLuzmano
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by FLuzmano

Released under CC0: Public Domain

Contents

CNN

For Google colab practice

Search
Clear search
Close search
Google apps
Main menu