100+ datasets found

Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
COCO2017 Image Caption Train
kaggle.com
zip
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seungjun Lee (2024). COCO2017 Image Caption Train [Dataset]. https://www.kaggle.com/datasets/seungjunleeofficial/coco2017-image-caption-train
Explore at:
zip(19236355851 bytes)Available download formats
Dataset updated
May 30, 2024
Authors
Seungjun Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains only the COCO 2017 train images (118K images) and a caption annotation JSON file, designed to fit within Google Colab's available disk space of approximately 50GB when connected to a GPU runtime.

If you're using PyTorch on Google Colab, you can easily utilize this dataset as follows:

Manually downloading and uploading the file to Colab can be time-consuming. Therefore, it's more efficient to download this data directly into Google Colab. Please ensure you have first added your Kaggle key to Google Colab. You can find more details on this process here

from google.colab import drive import os import torch import torchvision.datasets as dset import torchvision.transforms as transforms os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY') os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME') # Download the Dataset and unzip it !kaggle datasets download -d seungjunleeofficial/coco2017-image-caption-train !mkdir "/content/Dataset" !unzip "coco2017-image-caption-train" -d "/content/Dataset" # load the dataset cap = dset.CocoCaptions(root = '/content/Dataset/COCO2017 Image Captioning Train/train2017', annFile = '/content/Dataset/COCO2017 Image Captioning Train/captions_train2017.json', transform=transforms.PILToTensor())

You can then use the dataset in the following way:

print(f"Number of samples: {len(cap)}") img, target = cap[3] print(img.shape) print(target) # Output example: torch.Size([3, 425, 640]) # ['A zebra grazing on lush green grass in a field.', 'Zebra reaching its head down to ground where grass is.', # 'The zebra is eating grass in the sun.', 'A lone zebra grazing in some green grass.', # 'A Zebra grazing on grass in a green open field.']
h
hagrid-sample-120k-384p
huggingface.co
Updated Jul 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Mills (2023). hagrid-sample-120k-384p [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Authors
Christian Mills
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 127,331 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

Original Dataset Links

GitHub Kaggle Datasets Page

Object Classes

['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p.
Brain Tumor Classification
kaggle.com
zip
Updated Nov 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taneem UR Rehman (2022). Brain Tumor Classification [Dataset]. https://www.kaggle.com/datasets/taneemurrehman/brain-tumor-classification
Explore at:
zip(91002358 bytes)Available download formats
Dataset updated
Nov 26, 2022
Authors
Taneem UR Rehman
Description
Please follow the steps below to download and use Kaggle data within Google Colab:

1) from google.colab import files files.upload()

Choose the kaggle.json file that you downloaded 2) ! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

Make directory named kaggle and copy kaggle.json file there. 4) ! chmod 600 ~/.kaggle/kaggle.json

Change the permissions of the file. 5) ! kaggle datasets list - That's all ! You can check if everything's okay by running this command.

Use unzip command to unzip the data:

unzip train data there,

! unzip train.zip -d train
h
part1_dataSorted_Diversevul_llama2_dataset
huggingface.co
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atharva Prashant Pawar (2024). part1_dataSorted_Diversevul_llama2_dataset [Dataset]. https://huggingface.co/datasets/atharvapawar/part1_dataSorted_Diversevul_llama2_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2024
Authors
Atharva Prashant Pawar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset : part1_dataSorted_Diversevul_llama2_dataset

dataset lines : 2768 Kaggle Notebook (for dataset splitting) : https://www.kaggle.com/code/mrappplg/securix-diversevul-dataset Google Colab Notebook : https://colab.research.google.com/drive/1z6fLQrcMSe1-AVMHp0dp6uDr4RtVIOzF?usp=sharing
Google_Colab
kaggle.com
zip
Updated Jul 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiran Kolte (2020). Google_Colab [Dataset]. https://www.kaggle.com/datasets/kskolte2020/google-colab
Explore at:
zip(8535 bytes)Available download formats
Dataset updated
Jul 26, 2020
Authors
Kiran Kolte
Description
Dataset

This dataset was created by Kiran Kolte

Contents
h
hagrid-classification-512p-no-gesture-150k-zip
huggingface.co
Updated May 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Mills (2023). hagrid-classification-512p-no-gesture-150k-zip [Dataset]. https://huggingface.co/datasets/cj-mills/hagrid-classification-512p-no-gesture-150k-zip
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2023
Authors
Christian Mills
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 153,735 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani

Original Dataset Links

GitHub Kaggle Datasets Page
h
finalproject_spotify
huggingface.co
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ulee berber (2025). finalproject_spotify [Dataset]. https://huggingface.co/datasets/uleeberber/finalproject_spotify
Explore at:
Dataset updated
Nov 19, 2025
Authors
ulee berber
Description
Readme link to video presentation: https://youtu.be/Ybz20H5reBI link to collab: https://colab.research.google.com/drive/1zDY3D8hn8id8kgqX2QR5tmk22LnOccfc?usp=sharing link to kaggle data set: https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/data?select=spotify_songs.csv Dataset Description: This dataset is brought from kaggle: "30000 Spotify Songs". The dataset contains both numeric and categorical variables describing songs available on Spotify. It includes musical… See the full description on the dataset page: https://huggingface.co/datasets/uleeberber/finalproject_spotify.
R
Accident Detection Model Dataset
universe.roboflow.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 8, 2024
Dataset authored and provided by
Accident detection model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Accident Bounding Boxes
Description
Accident-Detection-Model

Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

Problem Statement

Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.

According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.

The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

Accidents survey

https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

Literature Survey

Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.

Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

Research Gap

Lack of real-world data - We trained model for more then 3200 images.

Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.

Outdated Versions of previous works - We aer using Latest version of Yolo v8.

Proposed methodology

We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.

This model after training with 25 iterations and is ready to detect an accident with a significant probability.

Model Set-up

Preparing Custom dataset

We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.

Then we annotated all of them individually on a tool called roboflow.

During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident

Then we divided the data set into train, val, test in the ratio of 8:1:1

At the final step we downloaded the dataset in yolov8 format.
#### Using Google Collab

We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.

You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.

Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.

In Google collab, First of all we Changed runtime from TPU to GPU.

We cross checked it by running command ‘!nvidia-smi’
#### Coding

First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’

Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’

Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’

Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’

After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’

Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’

The results are stored in the runs/detect/predict folder.
Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

Challenges I ran into

I majorly ran into 3 problems while making this model

I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.

I was facing problem on cvat website because i was not sure what
Cats&Dogs (Pickle)
kaggle.com
zip
Updated Feb 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FLuzmano (2020). Cats&Dogs (Pickle) [Dataset]. https://www.kaggle.com/fariziluzman/catsdogs-pickle
Explore at:
zip(226313720 bytes)Available download formats
Dataset updated
Feb 27, 2020
Authors
FLuzmano
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by FLuzmano

Released under CC0: Public Domain

Contents

CNN

For Google colab practice
R
Robust Shelf Monitoring Dataset
universe.roboflow.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelf Monitoring (2022). Robust Shelf Monitoring Dataset [Dataset]. https://universe.roboflow.com/shelf-monitoring/robust-shelf-monitoring/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 14, 2022
Dataset authored and provided by
Shelf Monitoring
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Stock Of Products In Shelf Bounding Boxes
Description
Robust Shelf Monitoring

We aim to build a Robust Shelf Monitoring system to help store keepers to maintain accurate inventory details, to re-stock items efficiently and on-time and to tackle the problem of misplaced items where an item is accidentally placed at a different location. Our product aims to serve as store manager that alerts the owner about items that needs re-stocking and misplaced items.

Training the model:

Unzip the labelled dataset from kaggle and store it to your google drive.

Follow the tutorial and update the training parameters in custom-yolov4-detector.cfg file in /darknet/cfg/ directory.

filters = (number of classes + 5) * 3 for each yolo layer.

max_batches = (number of classes) * 2000

Steps to run the prediction colab notebook:

Install the required dependencies; pymongo,dnspython.

Clone the darknet repository and the required python scripts.

Mount the google drive containing the weight file.

Copy the pre-trained weight file to the yolo content directory.

Run the detect.py script to peform the prediction. ## Presenting the predicted result. The detect.py script have option to send SMS notification to the shop keepers. We have built a front-end for building the phone-book for collecting the details of the shopkeepers. It also displays the latest prediction result and model accuracy.
Sample Posts from the ADHD dataset.
plos.figshare.com
figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Sample Posts from the ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t002
Dataset updated
Feb 6, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
h
DubaiRealEstateSalesInsights
huggingface.co
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PELEG ELRAZ (2024). DubaiRealEstateSalesInsights [Dataset]. https://huggingface.co/datasets/pelegelraz/DubaiRealEstateSalesInsights
Explore at:
Dataset updated
Apr 25, 2024
Authors
PELEG ELRAZ
Description
Dubai Real Estate – Exploratory Data Analysis (EDA)

Overview

This project presents an Exploratory Data Analysis (EDA) of residential real-estate listings in Dubai.The goal is to identify key factors influencing property prices using statistical exploration, data cleaning, and visual insights. The full analysis was performed in Google Colab.The dataset (dubai_real_estate.csv) is hosted on HuggingFace.

Dataset

Source: Kaggle – Dubai Real Estate Listings
File:… See the full description on the dataset page: https://huggingface.co/datasets/pelegelraz/DubaiRealEstateSalesInsights.

HaGRID Sample 500k 384p

kaggle.com

zip

Updated Sep 20, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Innominate817 (2022). HaGRID Sample 500k 384p [Dataset]. https://www.kaggle.com/datasets/innominate817/hagrid-sample-500k-384p

Explore at:

zip(13099488276 bytes)Available download formats

Dataset updated

Sep 20, 2022

Authors

Innominate817

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains the 509,323 training images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Original Authors:

Original Dataset Links

Kaggle Notebooks

Training:
- The free GPU tier for Kaggle Notebooks takes around 15 minutes per epoch.
- A screen recording walking through the setup steps is available on Youtube (link). Timestamps are in the video description.
Inference

Initial IceVision YOLOX Tiny Results:

30K 384p: 76.2509% COCOMetric after 20 epochs
60K 384p: 78.4270% COCOMetric after 20 epochs
120K 384p: 81.3696% COCOMetric after 20 epochs
500K 384p: 81.5283% COCOMetric after 10 epochs

Object Classes

['call',
 'no_gesture',
 'dislike',
 'fist',
 'four',
 'like',
 'mute',
 'ok',
 'one',
 'palm',
 'peace',
 'peace_inverted',
 'rock',
 'stop',
 'stop_inverted',
 'three',
 'three2',
 'two_up',
 'two_up_inverted']

Annotations

bboxes: [top-left-X-position, top-left-Y-position, width, height]

Multiply top-left-X-position and width values by the image width and multiply top-left-Y-position and height values by the image height.

	00005c9c-3548-4a8f-9d0b-2dd4aff37fc9
bboxes	[[0.23925175, 0.28595301, 0.25055143, 0.20777627]]
labels	[call]
leading_hand	right
leading_conf	1
user_id	5a389ffe1bed6660a59f4586c7d8fe2770785e5bf79b09334aa951f6f119c024

Mobiles & laptop Sales Data
kaggle.com
zip
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VINOTH KANNA S (2025). Mobiles & laptop Sales Data [Dataset]. https://www.kaggle.com/datasets/vinothkannaece/mobiles-and-laptop-sales-data
Explore at:
zip(3242055 bytes)Available download formats
Dataset updated
Mar 24, 2025
Authors
VINOTH KANNA S
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset simulates sales transactions for mobile phones and laptops, including product specifications, customer details, and sales information. It contains 50,000 rows of randomly generated data to help analyze product sales trends, customer purchasing behavior, and regional distribution of sales.

Dataset Overview

Dataset Type: Structured tabular data

Number of Rows: 50,000

Number of Columns: 16

Purpose of the Dataset
This dataset can be used for:
✅ Sales Analysis – Understanding product demand and pricing trends.
✅ Customer Behavior Analysis– Identifying buying patterns across locations.
✅ Inventory Management – Monitoring inward and dispatched product movements.
✅ Machine Learning & AI – Predicting sales trends, customer preferences, and stock management.

Key Features in the Dataset

Product Information

Product: Type of product (Mobile Phone / Laptop).

Brand: Various brands like Apple, Samsung, Dell, Lenovo, OnePlus, etc.

Product Code: Unique identifier for each product.

Product Specification: Brief description of the product features.

Sales & Pricing Details

Price: Cost of the product (randomly generated).

Inward Date: Date when the product was received in stock.

Dispatch Date: Date when the product was sold/dispatched.

Quantity Sold: Number of units sold per transaction.

Customer & Location Details

Customer Name: Randomly generated customer names.

Customer Location: City of the customer.

Region: Sales region (North, South, East, West, Central).

Technical Specifications -Core Specification (For Laptops): Includes processor models like i3, i5, i7, i9, Ryzen 3-9.
-Processor Specification (For Mobiles): Includes processors like Snapdragon, Exynos, Apple A-Series, and MediaTek Dimensity.
-RAM: Randomly assigned memory sizes (4GB to 32GB).
-ROM: Storage capacity (64GB to 1TB).
-SSD (For Laptops): Additional storage (256GB to 2TB), "N/A" for mobile phones.

Potential Use Cases: Business Intelligence Dashboards Market Trend Analysis Supply Chain Optimization
Customer Segmentation
Machine Learning Model Training (Sales Prediction, Price Optimization, etc.)
h
COREVQA
huggingface.co
kaggle.com
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COREVQA (2025). COREVQA [Dataset]. https://huggingface.co/datasets/COREVQA2025/COREVQA
Explore at:
Dataset updated
May 28, 2025
Authors
COREVQA
Description
COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Paper: https://www.arxiv.org/abs/2507.13405 Repository: https://github.com/corevqa/COREVQA Demo: https://colab.research.google.com/drive/1SpuTta5tSzktiCo9xN4CtE9P1pmYV0ax CrowdHuman Dataset Homepage: https://www.crowdhuman.org/

Abstract

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA)… See the full description on the dataset page: https://huggingface.co/datasets/COREVQA2025/COREVQA.
Batik Nusantara (Batik Indonesia) Dataset
kaggle.com
zip
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HendryHB (2024). Batik Nusantara (Batik Indonesia) Dataset [Dataset]. https://www.kaggle.com/datasets/hendryhb/batik-nusantara-batik-indonesia-dataset
Explore at:
zip(105554919 bytes)Available download formats
Dataset updated
Feb 17, 2024
Authors
HendryHB
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Batik Background

Indonesian textile craftsmanship has evolved over millennia, transitioning from basic utilitarian weaving techniques around 2500 BC to more intricate patterns and religious symbolism and social and culture during the time, with production hubs across regions like Sumatra, Borneo, Java, Celebes, Nusa Tenggara, and Bali. These textiles evolved from utilitarian items to carriers of sacred meanings, divided into secular and sacred cloths, both renowned for their aesthetic beauty. They played a pivotal role in individuals' cultural journeys, symbolizing life stages like maternity, matrimony, and mortality, with designs reflecting religious beliefs and the era's influence. The Batik technique, a hallmark of Indonesian textile artistry, involves creating intricate patterns using a resist wax method. Traditionally, artisans used a tool called a canting to draw patterns on fabric, a process known as batik tulis (drawn batik). Following the drawing phase, the cloth was dyed using natural dyes, and then subjected to the "lorot" process, involving boiling the wax out of the fabric. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19051508%2Fe543b4e91ad5dffe2b54e7f4300cc7b2%2F2024-02-16%2015.09.06%20copy%202.jpg?generation=1708074019154098&alt=media" alt=""> Batik making is revered for its complexity and demands high craftsmanship, requiring precise hand gestures and mastery of the canting tool. It stands as one of the most challenging pattern-making techniques in textile artistry. [1]

Dataset Collection

The primary objective of this dataset is to serve as a resource for research or academic or educational purposes rather than commercial endeavors. The dataset was meticulously compiled to include high-quality images representative of various types of Batik, encompassing the rich diversity of Batik Nusantara or Indonesian Batik from the Aceh to Papua regions.

Andrew has mentioned that the cornerstone of effective machine learning lies in the quality of the data. Meticulously curated datasets hold the power to unlock valuable insights and drive meaningful results. In other words, data is more important than models. In contrast, datasets lacking in quality may hinder the learning process and lead to suboptimal outcomes. Therefore, prioritizing data quality is paramount, as it lays the foundation for successful machine learning initiatives [2]. Also Sebastian added that the effectiveness of a machine learning algorithm greatly depends on the quality of the data and the richness of the information it encapsulates [3].

Acknowledgments

This dataset was meticulously carefully collected with the assistance of Ultralytics. The ownership of all images within this dataset belongs to respective parties, to whom we extend our gratitude for their contribution of these visually captivating images.

To cite from Kaggle:

[Dataset creator's name]. ([Year & Month of dataset creation]). [Name of the dataset], [Version of the dataset]. Retrieved [Date Retrieved] from [URL of the dataset].

Dataset

Comprising 40 raw images per class with image dimension of 224 x 224, this dataset encompasses a wide array of Batik designs, each representing a distinct category. The classes include 'Aceh PintuAceh', 'Bali Barong', 'Bali Merak', 'DKI OndelOndel', 'JawaBarat Megamendung', 'JawaTimur Pring', 'Kalimantan Dayak', 'Lampung Gajah', 'Madura Mataketeran', 'Maluku Pala', 'NTB Lumbung', 'Papua Asmat', 'Papua Cendrawasih', 'Papua Tifa', 'Solo Parang', 'SulawesiSelatan Lontara', 'SumateraBarat Rumah Minang', 'SumateraUtara Boraspati', 'Yogyakarta Kawung', and 'Yogyakarta Parang' [2][3][4][5][6][7]. These classes collectively portray the rich heritage of Batik Nusantara or Batik Indonesia, spanning from the Aceh to Papua regions. Feel free to explore image augmentation techniques to further enhance the dataset.

Simple Coding is available @ git with assumption using Colab. For reference, the following pre-trained architectures have been added: VGG16, ResNet50, Xception, MobileNetV2, along with Content-Based Image Retrieval (CBIR), Random Forest, a CNN architecture, and modeling, in addition to the MLP. It is also available on Kaggle Dataset Notebooks (Code).

Instructions for Dataset Usage

Below are steps to utilise the dataset using either Google Colab or Jupyter Notebook: 1. Begin by downloading the dataset. 2. Upon extraction, you'll find separate folders for training and testing data. Should you require validation data, either manually split a portion (approximately around 20%) from the training set and store it separately, or perform on-the-fly splitting during coding. 3. If splitting validation data manually, remember to re-zip the dataset after the separation process. 4....
NYC Jobs Dataset (Filtered Columns)
kaggle.com
zip
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
zip(93408 bytes)Available download formats
Dataset updated
Oct 5, 2022
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
MyAnimeList Anime Dataset (till June 2025)
kaggle.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trushank Vashikar (2025). MyAnimeList Anime Dataset (till June 2025) [Dataset]. https://www.kaggle.com/datasets/trushankvashikar/myanimelist-anime-dataset-till-june-2025
Explore at:
zip(71480343 bytes)Available download formats
Dataset updated
Jun 19, 2025
Authors
Trushank Vashikar
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
MyAnimeList.net – Complete Anime Dataset (2025 Edition)

This dataset contains a comprehensive collection of anime entries from MyAnimeList.net, updated to reflect the latest titles as of 2025. It is ideal for performing Exploratory Data Analysis (EDA) and building robust anime recommendation systems, including collaborative filtering, content-based methods, and hybrid approaches.

Files Included:

1. myanimelist_recommender_ready.csv ``Contains core metadata for each anime such as: - mal_id - title - score - members - genres - type - episodes - synopsis, etc.

2. anime_reviews.json (not perfect /future update) A separate JSON file containing the top 1–10 user reviews for each anime (based on availability), scraped using the Jikan API and stored through Google Firebase.

Review Scraping Note: I attempted to scrape and save user reviews for each anime using a custom Python script that: - Used Google Colab for execution - Stored data directly into Firebase Firestore - Collected up to 10 top reviews per anime using the Jikan API

However, after reaching around 9,000 entries, the Colab runtime disconnected. Although I implemented a resume feature to continue scraping from a specific ID, a logic bug introduced incorrect mapping of reviews to anime IDs in Firebase, resulting in misplaced review records.

This is being fixed, and a properly cleaned version of the reviews will be uploaded in a future update.

### Use Cases: This dataset is great for: - Anime recommendation systems (content-based, collaborative, hybrid) - Natural Language Processing (NLP) on anime reviews - Clustering anime by genres, type, or user ratings - Sentiment analysis on review text - Visualization of anime trends and metadata

Credits: Data Source: MyAnimeList.net Scraped via: Jikan REST API Backend: Firebase Firestore Runtime: Google Colab
HaGRID Classification 512p no_gesture 150k
kaggle.com
zip
Updated Mar 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Innominate817 (2023). HaGRID Classification 512p no_gesture 150k [Dataset]. https://www.kaggle.com/datasets/innominate817/hagrid-classification-512p-no-gesture-150k
Explore at:
zip(3808072768 bytes)Available download formats
Dataset updated
Mar 9, 2023
Authors
Innominate817
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 125,912 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. This version contains a separate folder with 27,823 samples images containing no gestures for a total 153,787 training samples. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.

Training Notebooks:

Fastai+timm->TFJS Image Classification HaGRID

Original Authors:

Alexander Kapitanov

Andrey Makhlyarchuk

Karina Kvanchiani

Original Dataset Links

GitHub

Kaggle Datasets Page

Facebook

Twitter

Click to copy link

Link copied

Cite

SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health

Social Media and Mental Health

Correlation between Social Media use and General Mental Well-being

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

zip(10944 bytes)Available download formats

Dataset updated

Jul 18, 2023

Authors

SouvikAhmed071

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn

Clear search

Close search

Google apps

Main menu

Social Media and Mental Health

COCO2017 Image Caption Train

hagrid-sample-120k-384p

Brain Tumor Classification

part1_dataSorted_Diversevul_llama2_dataset

Google_Colab

Dataset

Contents

hagrid-classification-512p-no-gesture-150k-zip

finalproject_spotify

Accident Detection Model Dataset

Accident-Detection-Model

Problem Statement

Accidents survey

Literature Survey

Research Gap

Proposed methodology

Model Set-up

Preparing Custom dataset

Challenges I ran into

I majorly ran into 3 problems while making this model

Cats&Dogs (Pickle)

Dataset

Contents

Robust Shelf Monitoring Dataset

Robust Shelf Monitoring

Training the model:

Steps to run the prediction colab notebook:

Sample Posts from the ADHD dataset.

DubaiRealEstateSalesInsights

HaGRID Sample 500k 384p

Original Authors:

Original Dataset Links

Kaggle Notebooks

Initial IceVision YOLOX Tiny Results:

Object Classes

Annotations

Mobiles & laptop Sales Data

COREVQA

Batik Nusantara (Batik Indonesia) Dataset

Batik Background

Dataset Collection

Acknowledgments

To cite from Kaggle:

Dataset

Instructions for Dataset Usage

NYC Jobs Dataset (Filtered Columns)

MyAnimeList Anime Dataset (till June 2025)

MyAnimeList.net – Complete Anime Dataset (2025 Edition)

Files Included:

HaGRID Classification 512p no_gesture 150k

Training Notebooks:

Original Authors:

Original Dataset Links

Social Media and Mental Health

Correlation between Social Media use and General Mental Well-being