22 datasets found

Cats&Dogs (Pickle)
kaggle.com
Updated Feb 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FLuzmano (2020). Cats&Dogs (Pickle) [Dataset]. https://www.kaggle.com/fariziluzman/catsdogs-pickle/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
FLuzmano
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by FLuzmano

Released under CC0: Public Domain

Contents

CNN

For Google colab practice
deaplearninexamAU2024
kaggle.com
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christoffer fuglkjær (2024). deaplearninexamAU2024 [Dataset]. https://www.kaggle.com/datasets/christofferfuglkjr/deeplearninexam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
christoffer fuglkjær
Description
This is just a reuploaded version of https://www.kaggle.com/datasets/ubitquitin/geolocation-geoguessr-images-50k?resource=download. But with the GeoGuessr UI cropped out and countries sorted into regions. This dataset is just used to make reloading training data in Google Colab faster.
P
EDGE-IIOTSET Dataset
paperswithcode.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
Explore at:
Dataset updated
Oct 16, 2023
Description
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

Instructions:

Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

Link to paper : https://ieeexplore.ieee.org/document/9751703

The directories of the Edge-IIoTset dataset include the following:

•File 1 (Normal traffic)

-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

•File 2 (Attack traffic):

-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

•File 3 (Selected dataset for ML and DL):

-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

!pip install -q kaggle

files.upload()

!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

!unzip DNN-EdgeIIoT-dataset.csv.zip

!rm DNN-EdgeIIoT-dataset.csv.zip

Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

import numpy as np

df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

Step 3 : Exploring some of the DataFrame's contents: df.head(5)

print(df['Attack_type'].value_counts())

Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

"http.file_data","http.request.full_uri","icmp.transmit_timestamp", "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport", "tcp.dstport", "udp.port", "mqtt.msg"]

df.drop(drop_columns, axis=1, inplace=True)

df.dropna(axis=0, how='any', inplace=True)

df.drop_duplicates(subset=None, keep="first", inplace=True)

df = shuffle(df)

df.isna().sum()

print(df['Attack_type'].value_counts())

Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn import preprocessing

def encode_text_dummy(df, name):

dummies = pd.get_dummies(df[name])

for x in dummies.columns:

dummy_name = f"{name}-{x}" df[dummy_name] = dummies[x]

df.drop(name, axis=1, inplace=True)

encode_text_dummy(df,'http.request.method')

encode_text_dummy(df,'http.referer')

encode_text_dummy(df,"http.request.version")

encode_text_dummy(df,"dns.qry.name.len")

encode_text_dummy(df,"mqtt.conack.flags")

encode_text_dummy(df,"mqtt.protoname")

encode_text_dummy(df,"mqtt.topic")

Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

More information about Dr. Mohamed Amine Ferrag is available at:

https://www.linkedin.com/in/Mohamed-Amine-Ferrag

https://dblp.uni-trier.de/pid/142/9937.html

https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

https://www.scopus.com/authid/detail.uri?authorId=56115001200

https://publons.com/researcher/1322865/mohamed-amine-ferrag/

https://orcid.org/0000-0002-0632-3172

Last Updated: 27 Mar. 2023
R
Accident Detection Model Dataset
universe.roboflow.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 8, 2024
Dataset authored and provided by
Accident detection model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Accident Bounding Boxes
Description
Accident-Detection-Model

Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

Problem Statement

Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.

According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.

The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

Accidents survey

https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

Literature Survey

Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.

Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

Research Gap

Lack of real-world data - We trained model for more then 3200 images.

Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.

Outdated Versions of previous works - We aer using Latest version of Yolo v8.

Proposed methodology

We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.

This model after training with 25 iterations and is ready to detect an accident with a significant probability.

Model Set-up

Preparing Custom dataset

We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.

Then we annotated all of them individually on a tool called roboflow.

During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident

Then we divided the data set into train, val, test in the ratio of 8:1:1

At the final step we downloaded the dataset in yolov8 format.
#### Using Google Collab

We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.

You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.

Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.

In Google collab, First of all we Changed runtime from TPU to GPU.

We cross checked it by running command ‘!nvidia-smi’
#### Coding

First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’

Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’

Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’

Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’

After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’

Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’

The results are stored in the runs/detect/predict folder.
Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

Challenges I ran into

I majorly ran into 3 problems while making this model

I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.

I was facing problem on cvat website because i was not sure what
h
part1_dataSorted_Diversevul_llama2_dataset
huggingface.co
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atharva Prashant Pawar (2024). part1_dataSorted_Diversevul_llama2_dataset [Dataset]. https://huggingface.co/datasets/atharvapawar/part1_dataSorted_Diversevul_llama2_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2024
Authors
Atharva Prashant Pawar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset : part1_dataSorted_Diversevul_llama2_dataset

dataset lines : 2768 Kaggle Notebook (for dataset splitting) : https://www.kaggle.com/code/mrappplg/securix-diversevul-dataset Google Colab Notebook : https://colab.research.google.com/drive/1z6fLQrcMSe1-AVMHp0dp6uDr4RtVIOzF?usp=sharing
h
hagrid-sample-120k-384p
huggingface.co
Updated Jul 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Mills (2023). hagrid-sample-120k-384p [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Authors
Christian Mills
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 127,331 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

Original Dataset Links

GitHub Kaggle Datasets Page

Object Classes

['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p.
R
Robust Shelf Monitoring Dataset
universe.roboflow.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelf Monitoring (2022). Robust Shelf Monitoring Dataset [Dataset]. https://universe.roboflow.com/shelf-monitoring/robust-shelf-monitoring/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 14, 2022
Dataset authored and provided by
Shelf Monitoring
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Stock Of Products In Shelf Bounding Boxes
Description
Robust Shelf Monitoring

We aim to build a Robust Shelf Monitoring system to help store keepers to maintain accurate inventory details, to re-stock items efficiently and on-time and to tackle the problem of misplaced items where an item is accidentally placed at a different location. Our product aims to serve as store manager that alerts the owner about items that needs re-stocking and misplaced items.

Training the model:

Unzip the labelled dataset from kaggle and store it to your google drive.

Follow the tutorial and update the training parameters in custom-yolov4-detector.cfg file in /darknet/cfg/ directory.

filters = (number of classes + 5) * 3 for each yolo layer.

max_batches = (number of classes) * 2000

Steps to run the prediction colab notebook:

Install the required dependencies; pymongo,dnspython.

Clone the darknet repository and the required python scripts.

Mount the google drive containing the weight file.

Copy the pre-trained weight file to the yolo content directory.

Run the detect.py script to peform the prediction. ## Presenting the predicted result. The detect.py script have option to send SMS notification to the shop keepers. We have built a front-end for building the phone-book for collecting the details of the shopkeepers. It also displays the latest prediction result and model accuracy.
gld20GB
kaggle.com
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JkReddy (2020). gld20GB [Dataset]. https://www.kaggle.com/jkreddy/gld20gb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JkReddy
Description
Context

It took very long time/weeks, to make this dataset, giving me an extensive data engineering capabilities. Used both GitHub and GCP as storage and both kaggle and colab to prepare this dataset. It would have been more useful to everyone, had i done this much earlier.

Content

All images from original set are included. To reduce the dataset size, all images have been resized to a minimum dimension of (224320) using tensorflow resize API.

Acknowledgements

Extensively used stackoverflow to find best solutions for many data engineering tasks and thanks for all those who have solved those issues earlier.

Inspiration

Original dataset size 99GB cant be used in colab to train the custom model.
xView1 dataset yolov5
kaggle.com
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Scotto Rosato (2023). xView1 dataset yolov5 [Dataset]. https://www.kaggle.com/datasets/luigiscottorosato/xview1-dataset-yolov5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luigi Scotto Rosato
Description
xView1 Adapted for YOLOv5 in Colab

Overview:

This dataset is a modified version of the xView1 dataset, specifically tailored for seamless integration with YOLOv5 in Google Colab. The xView1 dataset originally consists of high-resolution satellite imagery labeled for object detection tasks. In this adapted version, we have preprocessed the data and organized it to facilitate easy usage with YOLOv5, a popular deep learning framework for object detection.

Dataset Contents:

Images: The dataset includes a collection of high-resolution satellite images covering diverse geographic locations. These images have been resized and preprocessed to align with the requirements of YOLOv5, ensuring efficient training and testing.

Annotations:

Object annotations are provided for each image, specifying the bounding boxes and class labels of various objects present in the scenes. The annotations are formatted to match the YOLOv5 input specifications.

Usage Instructions:

Download the dataset files, including images and annotations.

Clone the YOLOv5 repository in Colab.

Move dataset files (train.txt and val.txt) to the yolov5 directory.

Use the provided .yaml for training.
NYC Jobs Dataset (Filtered Columns)
kaggle.com
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
f
Sample Posts from the ADHD dataset.
plos.figshare.com
figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Sample Posts from the ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t002
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Nike,Adidas Shoes for Image Classification Dataset
kaggle.com
Updated Jul 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ifeanyi Nneji (2022). Nike,Adidas Shoes for Image Classification Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/3980041
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/3980041
Dataset updated
Jul 24, 2022
Dataset provided by
Kaggle
Authors
Ifeanyi Nneji
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset can be used to build a CNN model that can classify if a shoe is an Adidas or Nike brand.

The images were pulled from bing using bing_image_search from pypi, 400 images of each class were downloaded and then the dataset was trimmed to 300(some unrelated images were removed in the process of compiling the dataset).

Link to Notebook
h
hagrid-classification-512p-no-gesture-150k
huggingface.co
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Mills (2025). hagrid-classification-512p-no-gesture-150k [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-classification-512p-no-gesture-150k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2025
Authors
Christian Mills
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for "hagrid-classification-512p-no-gesture-150k"

This dataset contains 153,735 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-classification-512p-no-gesture-150k.
Tajweed Dataset
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ala'a Abdu Saleh Alawdi (2025). Tajweed Dataset [Dataset]. https://www.kaggle.com/datasets/alawdisoft/tajweed-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ala'a Abdu Saleh Alawdi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:

Dataset Structure:

Organized by Tajweed Rule and Sheikh: The dataset is structured into directories for each Tajweed rule (e.g., 'Ikhfa', 'Izhar'). Within each rule's directory, there are subdirectories representing different reciters (sheikhs). This hierarchical organization is crucial for creating a structured metadata file and for training machine learning models.

Audio Files: The audio files (presumably WAV or other supported formats) are stored within the sheikh's subdirectories. The original filenames are not standardized.

Multiple Sheikhs per Rule: The dataset includes multiple recitations for each rule from different sheikhs, offering diversity in pronunciation.

Google Drive Storage: The dataset is located on Google Drive, which requires mounting the drive to access the data within a Colab environment.

Code Functionality:

Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.

Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.

Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.

Iterating through Paths: The code iterates through each Tajweed rule and its corresponding paths.

File Listing: Inside each directory, it iterates through the audio files.

Metadata Dictionary: For each audio file, it creates a metadata dictionary that includes:

global_id: A unique identifier for each audio file.

original_filename: The original filename of the audio file.

new_filename: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.

label: The Tajweed rule.

sheikh_id: A numerical identifier for each sheikh.

sheikh_name: The name of the reciter.

audio_number: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.

original_path: Full path to the original audio file.

new_path: Full path to the intended location for the renamed and potentially converted audio file.

Pandas DataFrame: The metadata is collected in a list of dictionaries and then converted into a Pandas DataFrame for easier viewing and processing. This DataFrame is highly informative.

File Renaming and Conversion:

File Renaming: (commented out) The code is able to rename the audio files to the standardized format defined in new_filename and store it in the designated directory.

Audio Conversion to WAV: The script then converts any files in the specified directories to .wav format, creating standardized files in a new output_dataset directory. The new filenames are based on rules, sheikh and a counter.

Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv) in the output directory. This CSV file is crucial for training any machine learning model using this data.
Banana Classification
kaggle.com
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atri Thakar (2024). Banana Classification [Dataset]. https://www.kaggle.com/datasets/atrithakar/banana-classification/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2024
Dataset provided by
Kaggle
Authors
Atri Thakar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a dataset for detecting banana quality using ML. This dataset contains four categories: Unripe, Ripe, Overripe and Rotten. In this dataset, there are enormous amount of images which will help users to train the ML model conveniently and easily.

NOTE: THIS DATASET HAS BEEN PICKED FROM https://universe.roboflow.com/roboflow-universe-projects/banana-ripeness-classification. I WAS FACING DIFFICULTIES WHILE DOWNLOADING DATASET DIRECTLY TO THE GOOGLE COLAB TO TRAIN MY CNN MODEL AS A PART OF UNIVERSITY PROJECT. ALL CREDITS FOR THIS DATASET, AS FAR AS MY KNOWLEDGE GOES, GOES TO ROBOFLOW. I DO NOT INTEND TO TAKE ANY CREDITS MYSELF OR UNETHICALLY CLAIM OWNERSHIP, I JUST UPLOADED DATASET HERE FOR MY CONVENIENCE, THANK YOU.
Common Voice Corpus 5.1
kaggle.com
zip
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Baisoya (2023). Common Voice Corpus 5.1 [Dataset]. https://www.kaggle.com/datasets/krishbaisoya/cv-en-5
Explore at:
zip(54099708635 bytes)Available download formats
Dataset updated
Sep 15, 2023
Authors
Krish Baisoya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Common Voice is a corpus of speech data read by users on the Common Voice website, and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.

How it is collected ?

In google colab, i downloaded the .tar.gz from common-voice (mozilla). And placed the compressed file in a folder marked the folder as dataset and straight-up uploaded it
Van Gogh vs Tree Oil Painting Ai Analysis
kaggle.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HARONTHAI MONGBUNSRI (2025). Van Gogh vs Tree Oil Painting Ai Analysis [Dataset]. https://www.kaggle.com/datasets/haronthaimongbunsri/van-gogh-vs-tree-oil-painting-ai-analysis/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HARONTHAI MONGBUNSRI
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Title: AI model for artwork verification using brushstroke structure, rhythm, and X-ray analysis. Developed from 10 years of research and refined through AI Natural Matching.

Overview

This dataset presents an AI model developed specifically for artwork authentication. The goal is to determine whether two paintings—one known and one unknown—may have been created by the same artist, based on brushstroke analysis and structural features.

This model demonstrates how machine learning can assist in verifying stylistic consistency using visual data, supported by structural scans such as X-ray imaging. The method was built from human-led comparisons and later refined into a formal system of AI-driven brushstroke analysis.

Method: The 18 Supreme Techniques

The model uses a set of 18 core visual techniques designed to analyze:

Stroke rhythm and torque

Brushstroke flow direction

Angular deviation and energy pattern

Layer buildup and underpainting patterns (via X-ray)

These techniques were developed in Google Colab with AI support, based on over 10 years of image-based manual study. The system is optimized for AI Natural Matching, a visual neural comparison approach that mimics human perception but operates at scale.

Requirements:

High-resolution candidate image

Reference image (known work by artist)

X-ray scan of candidate image

AI Natural Matching only (SSIM is strictly forbidden)

This model must use AI Natural Matching. Do not use SSIM or any structural similarity index.

Dataset Use

This dataset is intended for researchers, AI developers, and art historians who wish to:

Validate or disprove potential authorship of a painting

Test the model using different artist references

Study visual fingerprinting and stylistic consistency

Scientific pigment data (XRF, FTIR, SEM) and aging process validation for The Tree Oil Painting are available in a separate dataset. Cross-checking with physical material data is strongly encouraged.

Licensing and Attribution

All data is licensed under CC BY 4.0 and freely available for academic, research, and AI development use.

Model and research developed by Haronthai Mongbunsri (Independent Researcher, 2015–2025) AI structure refined through collaboration with neural tools via Google Colab.

This dataset is part of an open effort to build transparent, reproducible systems for artwork verification.

Reference: Scientific Verification Dataset on Hugging Face

This analysis is built upon scientific pigment data, X-ray, and FTIR results hosted on Hugging Face:

Tree Oil – Scientific Core: CrVI/CrIII Cross-Verified (2025)

We strongly recommend reviewing this core dataset to understand the chemical and material basis behind the visual AI analysis.
Generated-images
kaggle.com
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Bonnet (2023). Generated-images [Dataset]. https://www.kaggle.com/datasets/antoinebonnet2001/generated-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Antoine Bonnet
Description
This dataset during the challengeV2 of the INF473V at ecole polytechnique. It consists in additionnal images for the dataset generated with stable diffusion. Code used to generate them : https://colab.research.google.com/drive/1zicIWGK7hd-TH_8tNJ4kgxrrPeHsgZWv?usp=sharing
SemEval2018 tweetDataset
kaggle.com
Updated Aug 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BM Abir (2020). SemEval2018 tweetDataset [Dataset]. https://www.kaggle.com/bmabir17/semeval2018-tweetdataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BM Abir
Description
Context

The Dataset was obtained from the following source http://saifmohammad.com/WebDocs/AIT-2018/AIT2018-DATA/SemEval2018-Task1-all-data.zip

Content

This dataset contains only the English/ELreg portion from the original dataset It was preprocessed using the code written in this notebook section for Combining the dataset
chess object detection + yolov5 for chess
kaggle.com
Updated Mar 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Haytham (2022). chess object detection + yolov5 for chess [Dataset]. https://www.kaggle.com/ahmedhaytham/chess-object-detection-yolov5-for-chess
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmed Haytham
Description
here ?

just uploaded it here to make it easy for me and others to use

There is no data similar to it on kaggle # $ evaluation of my model is in ****yolov5/runs/train/exp/****

Chess Pieces > 416x416_aug

https://public.roboflow.ai/object-detection/chess-full

Provided by Roboflow License: Public Domain

Overview

This is a dataset of Chess board photos and various pieces. All photos were captured from a constant angle, a tripod to the left of the board. The bounding boxes of all pieces are annotated as follows: white-king, white-queen, white-bishop, white-knight, white-rook, white-pawn, black-king, black-queen, black-bishop, black-knight, black-rook, black-pawn. There are 2894 labels across 292 images.

https://i.imgur.com/nkjobw1.png" alt="Chess Example">

Follow this tutorial to see an example of training an object detection model using this dataset or jump straight to the Colab notebook.

Use Cases

At Roboflow, we built a chess piece object detection model using this dataset.

https://blog.roboflow.ai/content/images/2020/01/chess-detection-longer.gif" alt="ChessBoss">

You can see a video demo of that here. (We did struggle with pieces that were occluded, i.e. the state of the board at the very beginning of a game has many pieces obscured - let us know how your results fare!)

Using this Dataset

We're releasing the data free on a public license.

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility.

Facebook

Twitter

Click to copy link

Link copied

Cite

FLuzmano (2020). Cats&Dogs (Pickle) [Dataset]. https://www.kaggle.com/fariziluzman/catsdogs-pickle/activity

Cats&Dogs (Pickle)

For Google Colab CNN

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 27, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

FLuzmano

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by FLuzmano

Released under CC0: Public Domain

CNN

For Google colab practice

Clear search

Close search

Google apps

Main menu

Cats&Dogs (Pickle)

Dataset

Contents

deaplearninexamAU2024

EDGE-IIOTSET Dataset

Accident Detection Model Dataset

Accident-Detection-Model

Problem Statement

Accidents survey

Literature Survey

Research Gap

Proposed methodology

Model Set-up

Preparing Custom dataset

Challenges I ran into

I majorly ran into 3 problems while making this model

part1_dataSorted_Diversevul_llama2_dataset

hagrid-sample-120k-384p

Robust Shelf Monitoring Dataset

Robust Shelf Monitoring

Training the model:

Steps to run the prediction colab notebook:

gld20GB

Context

Content

Acknowledgements

Inspiration

xView1 dataset yolov5

xView1 Adapted for YOLOv5 in Colab

Overview:

Dataset Contents:

Annotations:

Usage Instructions:

NYC Jobs Dataset (Filtered Columns)

Sample Posts from the ADHD dataset.

Nike,Adidas Shoes for Image Classification Dataset

hagrid-classification-512p-no-gesture-150k

Tajweed Dataset

Banana Classification

Common Voice Corpus 5.1

How it is collected ?

Van Gogh vs Tree Oil Painting Ai Analysis

Reference: Scientific Verification Dataset on Hugging Face

Generated-images

SemEval2018 tweetDataset

Context

Content

chess object detection + yolov5 for chess

here ?

Chess Pieces > 416x416_aug

Overview

Use Cases

Using this Dataset

About Roboflow

Cats&Dogs (Pickle)

For Google Colab CNN

Dataset

Contents