Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains only the COCO 2017 train images (118K images) and a caption annotation JSON file, designed to fit within Google Colab's available disk space of approximately 50GB when connected to a GPU runtime.
If you're using PyTorch on Google Colab, you can easily utilize this dataset as follows:
Manually downloading and uploading the file to Colab can be time-consuming. Therefore, it's more efficient to download this data directly into Google Colab. Please ensure you have first added your Kaggle key to Google Colab. You can find more details on this process here
from google.colab import drive
import os
import torch
import torchvision.datasets as dset
import torchvision.transforms as transforms
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
# Download the Dataset and unzip it
!kaggle datasets download -d seungjunleeofficial/coco2017-image-caption-train
!mkdir "/content/Dataset"
!unzip "coco2017-image-caption-train" -d "/content/Dataset"
# load the dataset
cap = dset.CocoCaptions(root = '/content/Dataset/COCO2017 Image Captioning Train/train2017',
annFile = '/content/Dataset/COCO2017 Image Captioning Train/captions_train2017.json',
transform=transforms.PILToTensor())
You can then use the dataset in the following way:
print(f"Number of samples: {len(cap)}")
img, target = cap[3]
print(img.shape)
print(target)
# Output example: torch.Size([3, 425, 640])
# ['A zebra grazing on lush green grass in a field.', 'Zebra reaching its head down to ground where grass is.',
# 'The zebra is eating grass in the sun.', 'A lone zebra grazing in some green grass.',
# 'A Zebra grazing on grass in a green open field.']
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 127,331 images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.
Original Authors:
Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani
Original Dataset Links
GitHub Kaggle Datasets Page
Object Classes
['call'… See the full description on the dataset page: https://huggingface.co/datasets/cj-mills/hagrid-sample-120k-384p.
Facebook
TwitterPlease follow the steps below to download and use Kaggle data within Google Colab:
1) from google.colab import files files.upload()
Choose the kaggle.json file that you downloaded 2) ! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
Make directory named kaggle and copy kaggle.json file there. 4) ! chmod 600 ~/.kaggle/kaggle.json
Change the permissions of the file. 5) ! kaggle datasets list - That's all ! You can check if everything's okay by running this command.
Use unzip command to unzip the data:
unzip train data there,
! unzip train.zip -d train
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset : part1_dataSorted_Diversevul_llama2_dataset
dataset lines : 2768
Kaggle Notebook (for dataset splitting) : https://www.kaggle.com/code/mrappplg/securix-diversevul-dataset
Google Colab Notebook : https://colab.research.google.com/drive/1z6fLQrcMSe1-AVMHp0dp6uDr4RtVIOzF?usp=sharing
Facebook
TwitterThis dataset was created by Kiran Kolte
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 153,735 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.
Original Authors:
Alexander Kapitanov Andrey Makhlyarchuk Karina Kvanchiani
Original Dataset Links
GitHub Kaggle Datasets Page
Facebook
TwitterReadme link to video presentation: https://youtu.be/Ybz20H5reBI link to collab: https://colab.research.google.com/drive/1zDY3D8hn8id8kgqX2QR5tmk22LnOccfc?usp=sharing link to kaggle data set: https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/data?select=spotify_songs.csv Dataset Description: This dataset is brought from kaggle: "30000 Spotify Songs". The dataset contains both numeric and categorical variables describing songs available on Spotify. It includes musical… See the full description on the dataset page: https://huggingface.co/datasets/uleeberber/finalproject_spotify.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by FLuzmano
Released under CC0: Public Domain
CNN
For Google colab practice
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We aim to build a Robust Shelf Monitoring system to help store keepers to maintain accurate inventory details, to re-stock items efficiently and on-time and to tackle the problem of misplaced items where an item is accidentally placed at a different location. Our product aims to serve as store manager that alerts the owner about items that needs re-stocking and misplaced items.
custom-yolov4-detector.cfg file in /darknet/cfg/ directory.filters = (number of classes + 5) * 3 for each yolo layer.max_batches = (number of classes) * 2000detect.py script to peform the prediction.
## Presenting the predicted result.
The detect.py script have option to send SMS notification to the shop keepers. We have built a front-end for building the phone-book for collecting the details of the shopkeepers. It also displays the latest prediction result and model accuracy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Facebook
TwitterDubai Real Estate – Exploratory Data Analysis (EDA)
Overview
This project presents an Exploratory Data Analysis (EDA) of residential real-estate listings in Dubai.The goal is to identify key factors influencing property prices using statistical exploration, data cleaning, and visual insights. The full analysis was performed in Google Colab.The dataset (dubai_real_estate.csv) is hosted on HuggingFace.
Dataset
Source: Kaggle – Dubai Real Estate Listings
File:… See the full description on the dataset page: https://huggingface.co/datasets/pelegelraz/DubaiRealEstateSalesInsights.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains the 509,323 training images from HaGRID (HAnd Gesture Recognition Image Dataset) downscaled to 384p. The original dataset is 716GB and contains 552,992 1080p images. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.
['call',
'no_gesture',
'dislike',
'fist',
'four',
'like',
'mute',
'ok',
'one',
'palm',
'peace',
'peace_inverted',
'rock',
'stop',
'stop_inverted',
'three',
'three2',
'two_up',
'two_up_inverted']
bboxes: [top-left-X-position, top-left-Y-position, width, height]top-left-X-position and width values by the image width and multiply top-left-Y-position and height values by the image height.
| 00005c9c-3548-4a8f-9d0b-2dd4aff37fc9 | |
|---|---|
| bboxes | [[0.23925175, 0.28595301, 0.25055143, 0.20777627]] |
| labels | [call] |
| leading_hand | right |
| leading_conf | 1 |
| user_id | 5a389ffe1bed6660a59f4586c7d8fe2770785e5bf79b09334aa951f6f119c024 |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset simulates sales transactions for mobile phones and laptops, including product specifications, customer details, and sales information. It contains 50,000 rows of randomly generated data to help analyze product sales trends, customer purchasing behavior, and regional distribution of sales.
Dataset Overview
Purpose of the Dataset
This dataset can be used for:
✅ Sales Analysis – Understanding product demand and pricing trends.
✅ Customer Behavior Analysis– Identifying buying patterns across locations.
✅ Inventory Management – Monitoring inward and dispatched product movements.
✅ Machine Learning & AI – Predicting sales trends, customer preferences, and stock management.
Key Features in the Dataset
Product Information
Sales & Pricing Details
Customer & Location Details
Technical Specifications
-Core Specification (For Laptops): Includes processor models like i3, i5, i7, i9, Ryzen 3-9.
-Processor Specification (For Mobiles): Includes processors like Snapdragon, Exynos, Apple A-Series, and MediaTek Dimensity.
-RAM: Randomly assigned memory sizes (4GB to 32GB).
-ROM: Storage capacity (64GB to 1TB).
-SSD (For Laptops): Additional storage (256GB to 2TB), "N/A" for mobile phones.
Potential Use Cases:
Business Intelligence Dashboards
Market Trend Analysis
Supply Chain Optimization
Customer Segmentation
Machine Learning Model Training (Sales Prediction, Price Optimization, etc.)
Facebook
TwitterCOREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark
Paper: https://www.arxiv.org/abs/2507.13405 Repository: https://github.com/corevqa/COREVQA Demo: https://colab.research.google.com/drive/1SpuTta5tSzktiCo9xN4CtE9P1pmYV0ax CrowdHuman Dataset Homepage: https://www.crowdhuman.org/
Abstract
Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA)… See the full description on the dataset page: https://huggingface.co/datasets/COREVQA2025/COREVQA.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Indonesian textile craftsmanship has evolved over millennia, transitioning from basic utilitarian weaving techniques around 2500 BC to more intricate patterns and religious symbolism and social and culture during the time, with production hubs across regions like Sumatra, Borneo, Java, Celebes, Nusa Tenggara, and Bali. These textiles evolved from utilitarian items to carriers of sacred meanings, divided into secular and sacred cloths, both renowned for their aesthetic beauty. They played a pivotal role in individuals' cultural journeys, symbolizing life stages like maternity, matrimony, and mortality, with designs reflecting religious beliefs and the era's influence. The Batik technique, a hallmark of Indonesian textile artistry, involves creating intricate patterns using a resist wax method. Traditionally, artisans used a tool called a canting to draw patterns on fabric, a process known as batik tulis (drawn batik). Following the drawing phase, the cloth was dyed using natural dyes, and then subjected to the "lorot" process, involving boiling the wax out of the fabric.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19051508%2Fe543b4e91ad5dffe2b54e7f4300cc7b2%2F2024-02-16%2015.09.06%20copy%202.jpg?generation=1708074019154098&alt=media" alt="">
Batik making is revered for its complexity and demands high craftsmanship, requiring precise hand gestures and mastery of the canting tool. It stands as one of the most challenging pattern-making techniques in textile artistry. [1]
The primary objective of this dataset is to serve as a resource for research or academic or educational purposes rather than commercial endeavors. The dataset was meticulously compiled to include high-quality images representative of various types of Batik, encompassing the rich diversity of Batik Nusantara or Indonesian Batik from the Aceh to Papua regions.
Andrew has mentioned that the cornerstone of effective machine learning lies in the quality of the data. Meticulously curated datasets hold the power to unlock valuable insights and drive meaningful results. In other words, data is more important than models. In contrast, datasets lacking in quality may hinder the learning process and lead to suboptimal outcomes. Therefore, prioritizing data quality is paramount, as it lays the foundation for successful machine learning initiatives [2]. Also Sebastian added that the effectiveness of a machine learning algorithm greatly depends on the quality of the data and the richness of the information it encapsulates [3].
This dataset was meticulously carefully collected with the assistance of Ultralytics. The ownership of all images within this dataset belongs to respective parties, to whom we extend our gratitude for their contribution of these visually captivating images.
[Dataset creator's name]. ([Year & Month of dataset creation]). [Name of the dataset], [Version of the dataset]. Retrieved [Date Retrieved] from [URL of the dataset].
Comprising 40 raw images per class with image dimension of 224 x 224, this dataset encompasses a wide array of Batik designs, each representing a distinct category. The classes include 'Aceh PintuAceh', 'Bali Barong', 'Bali Merak', 'DKI OndelOndel', 'JawaBarat Megamendung', 'JawaTimur Pring', 'Kalimantan Dayak', 'Lampung Gajah', 'Madura Mataketeran', 'Maluku Pala', 'NTB Lumbung', 'Papua Asmat', 'Papua Cendrawasih', 'Papua Tifa', 'Solo Parang', 'SulawesiSelatan Lontara', 'SumateraBarat Rumah Minang', 'SumateraUtara Boraspati', 'Yogyakarta Kawung', and 'Yogyakarta Parang' [2][3][4][5][6][7]. These classes collectively portray the rich heritage of Batik Nusantara or Batik Indonesia, spanning from the Aceh to Papua regions.
Feel free to explore image augmentation techniques to further enhance the dataset.
Simple Coding is available @ git with assumption using Colab. For reference, the following pre-trained architectures have been added: VGG16, ResNet50, Xception, MobileNetV2, along with Content-Based Image Retrieval (CBIR), Random Forest, a CNN architecture, and modeling, in addition to the MLP. It is also available on Kaggle Dataset Notebooks (Code).
Below are steps to utilise the dataset using either Google Colab or Jupyter Notebook:
1. Begin by downloading the dataset.
2. Upon extraction, you'll find separate folders for training and testing data. Should you require validation data, either manually split a portion (approximately around 20%) from the training set and store it separately, or perform on-the-fly splitting during coding.
3. If splitting validation data manually, remember to re-zip the dataset after the separation process.
4....
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial
The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing
Once the csv file is uploaded to Google Colab, use these commands to process the file.
import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset contains a comprehensive collection of anime entries from MyAnimeList.net, updated to reflect the latest titles as of 2025. It is ideal for performing Exploratory Data Analysis (EDA) and building robust anime recommendation systems, including collaborative filtering, content-based methods, and hybrid approaches.
1. myanimelist_recommender_ready.csv ``Contains core metadata for each anime such as: - mal_id - title - score - members - genres - type - episodes - synopsis, etc.
2. anime_reviews.json (not perfect /future update) A separate JSON file containing the top 1–10 user reviews for each anime (based on availability), scraped using the Jikan API and stored through Google Firebase.
Review Scraping Note: I attempted to scrape and save user reviews for each anime using a custom Python script that: - Used Google Colab for execution - Stored data directly into Firebase Firestore - Collected up to 10 top reviews per anime using the Jikan API
However, after reaching around 9,000 entries, the Colab runtime disconnected. Although I implemented a resume feature to continue scraping from a specific ID, a logic bug introduced incorrect mapping of reviews to anime IDs in Firebase, resulting in misplaced review records.
This is being fixed, and a properly cleaned version of the reviews will be uploaded in a future update.
### Use Cases: This dataset is great for: - Anime recommendation systems (content-based, collaborative, hybrid) - Natural Language Processing (NLP) on anime reviews - Clustering anime by genres, type, or user ratings - Sentiment analysis on review text - Visualization of anime trends and metadata
Credits: Data Source: MyAnimeList.net Scraped via: Jikan REST API Backend: Firebase Firestore Runtime: Google Colab
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 125,912 training images from HaGRID (HAnd Gesture Recognition Image Dataset) modified for image classification instead of object detection. This version contains a separate folder with 27,823 samples images containing no gestures for a total 153,787 training samples. The original dataset is 716GB. I created this sample for a tutorial so readers can use the dataset in the free tiers of Google Colab and Kaggle Notebooks.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn