100+ datasets found

Meta Kaggle Code
kaggle.com
zip
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle
zenodo.org
dataon.kisti.re.kr
+1more
bin, bz2, pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. http://doi.org/10.5281/zenodo.4468523
Explore at:
bz2, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4468523
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

More specifically, the package comprises the following three compressed archives:

KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.
Top 1000 Kaggle Datasets
kaggle.com
zip
Updated Jan 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
Explore at:
zip(34269 bytes)Available download formats
Dataset updated
Jan 3, 2022
Authors
Trrishan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
From wiki

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

Source: Kaggle
h
test-dataset-kaggle
huggingface.co
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gholamreza Dar (2024). test-dataset-kaggle [Dataset]. https://huggingface.co/datasets/Gholamreza/test-dataset-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2024
Authors
Gholamreza Dar
Description
Gholamreza/test-dataset-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community
Student Performance Dataset
kaggle.com
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghulam Muhammad Nabeel
Description
📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

student_id → Unique identifier for each student.

weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.

attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.

class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.

total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.

grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.

Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

Attendance & Participation: Independent but realistic variations added.

Grades: Assigned from scores using thresholds:

A: ≥ 85

B: ≥ 70

C: ≥ 55

D: ≥ 40

F: < 40

🎯 How to Use This Dataset

Regression Tasks

Predict total_score from weekly_self_study_hours.

Train and evaluate Linear Regression models.

Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

Apply train-test split and cross-validation.

Evaluate with MAE, RMSE, R².

Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Kaggle Dataset Metadata Repository
kaggle.com
zip
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
Explore at:
zip(5122110 bytes)Available download formats
Dataset updated
Nov 16, 2024
Authors
Ijaj Ahmed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

Kaggle Dataset Metadata Collection 📊

This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚

Dataset Overview:

Purpose: To provide detailed insights into Kaggle dataset metadata.

Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.

Target Audience: Data scientists, Kaggle competitors, and dataset curators.

Columns Description 📋

datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.

ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.

ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.

ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

creatorName 👩‍💻: The name of the dataset creator, which could be different from the owner.

creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.

creatorUserId 💼: The unique user ID of the dataset creator.

scriptCount 📜: The number of scripts (kernels) associated with this dataset.

scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.

forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

viewCount 👀: The number of views the dataset page has received on Kaggle.

downloadCount ⬇️: The number of times the dataset has been downloaded by users.

dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.

dateUpdated 🔄: The date when the dataset was last updated or modified.

voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.

commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

downloadUrl ⬇️: A direct link to download the dataset files.

newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.

usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.

firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.

datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.

rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).

medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.

ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.

category_names 📑: A comma-separated string of category names that represent the dataset’s classification.

This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
Kaggle Annotation Dataset
universe.roboflow.com
zip
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kaggle annotate (2025). Kaggle Annotation Dataset [Dataset]. https://universe.roboflow.com/kaggle-annotate/kaggle-annotation
Explore at:
zipAvailable download formats
Dataset updated
Apr 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
kaggle annotate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Objects Bounding Boxes
Description
Kaggle Annotation

## Overview Kaggle Annotation is a dataset for object detection tasks - it contains Objects annotations for 965 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
60k-data-with-context-v2
kaggle.com
Updated Sep 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Deotte (2023). 60k-data-with-context-v2 [Dataset]. https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chris Deotte
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset can be used to train an Open Book model for Kaggle's LLM Science Exam competition. This dataset was generated by searching and concatenating all publicly shared datasets on Sept 1 2023.

The context column was generated using Mgoksu's notebook here with NUM_TITLES=5 and NUM_SENTENCES=20

The source column indicates where the dataset originated. Below are the sources:

source = 1 & 2 * Radek's 6.5k dataset. Discussion here annd here, dataset here.

source = 3 & 4 * Radek's 15k + 5.9k. Discussion here and here, dataset here

source = 5 & 6 * Radek's 6k + 6k. Discussion here and here, dataset here

source = 7 * Leonid's 1k. Discussion here, dataset here

source = 8 * Gigkpeaeums 3k. Discussion here, dataset here

source = 9 * Anil 3.4k. Discussion here, dataset here

source = 10, 11, 12 * Mgoksu 13k. Discussion here, dataset here
📱💻 Mental Health & Technology Usage Dataset 🌱
kaggle.com
zip
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqar Ali (2024). 📱💻 Mental Health & Technology Usage Dataset 🌱 [Dataset]. https://www.kaggle.com/datasets/waqi786/mental-health-and-technology-usage-dataset
Explore at:
zip(201342 bytes)Available download formats
Dataset updated
Sep 5, 2024
Authors
Waqar Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset offers insights into how daily technology usage, including social media and screen time, impacts mental health. 📊 It captures various behavioral patterns and their correlations with mental health indicators like stress levels, sleep quality, and productivity. Dive in to analyze the relationship between our digital lives and mental wellness! 🌟

The data is useful for research, academic projects, or building predictive models to understand trends in mental health influenced by screen time and technology habits. 🔍📉
Learning Resources Database
kaggle.com
datadiscovery.nlm.nih.gov
+3more
zip
Updated Nov 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2023). Learning Resources Database [Dataset]. https://www.kaggle.com/datasets/prasad22/learning-resources-database
Explore at:
zip(82916 bytes)Available download formats
Dataset updated
Nov 5, 2023
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Learning Resources Database is a catalog of interactive tutorials, videos, online classes, finding aids, and other instructional resources on National Library of Medicine (NLM) products and services. Resources may be available for immediate use via a browser or downloadable for use in course management systems

Dataset Description

It contains 520 rows and 13 variables as listed below - - Resource ID : Alphanumeric identifier - Resource Name : Title of the resource - Resource URL : Link of the resource - Description : Brief explanation on the reource - Archived : Flagged as False for all data points - Format : Format of the resource ex. HTML, PDF, MP4 video , MS Word, Powerpoint etc. - Type : Type of the resource ex Webinar, document, tutorial, slides etc. - Runtime : Runtime of the resource - Subject Areas : Topic covered in reource - Authoring Organization : Name of the Authoring Organization - Intended Audiences : Profile of the intended audience - Record Modified : Timestamp info on record last modification - Resource Revised : Timestamp info on resource last modified
Kaggle Top Datasets🚀📊
kaggle.com
zip
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Frias (2024). Kaggle Top Datasets🚀📊 [Dataset]. https://www.kaggle.com/datasets/aaronfriasr/kaggle-top-datasets
Explore at:
zip(1572305 bytes)Available download formats
Dataset updated
Apr 10, 2024
Authors
Aaron Frias
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Kaggle is one of the largest communities of data scientists and machine learning practitioners in the world, and its platform hosts thousands of datasets covering a wide range of topics and industries. With so many options to choose from, it can be difficult to know where to start or what datasets are worth exploring. That's where this dataset comes in. By scraping information about the top 10,000 datasets on Kaggle, we have created a single source of truth for the most popular and useful datasets on the platform. This dataset is not just a list of names and numbers, but a valuable tool for data enthusiasts and professionals alike, providing insights into the latest trends and techniques in data science and machine learning

Column description - Dataset_name - Name of the dataset - Author_name - Name of the author - Author_id - Kaggle id of the author - No_of_files - Number of files the author has uploaded - size - Size of all the files - Type_of_file - Type of the files such as csv, json etc. - Upvotes - Total upvotes of the dataset - Medals - Medal of the dataset - Usability - Usability of the dataset - Date - Date in which the dataset is uploaded - Day - Day in which the dataset is uploaded - Time - Time in which the dataset is uploaded - Dataset_link - Kaggle link of the dataset

Acknowledgements The data has been scraped from the official Kaggle Website and is available under the Creative Common License.

Enjoy & Keep Learning !!!
(Sunset)📒 Meta Kaggle ported to MS SQL SERVER
kaggle.com
zip
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (Sunset)📒 Meta Kaggle ported to MS SQL SERVER [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-ported-to-sql-server-2022-database
Explore at:
zip(8635902534 bytes)Available download formats
Dataset updated
Mar 20, 2024
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

MSSQL VERSION: SQL Server 2022

Collation: SQL_Latin1_General_CP1_CI_AS

Recovery model: simple

Requirements

Download and install the SQL SERVER 2022 Developer edition here

Download the backup file

Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

Notes

I repeat, I just ported the dataset. All credits to Kaggle for the amazing source dataset.

Cover image from https://picryl.com/media/space-earth-bug-ce3ca6
Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
Kaggle: Forum Discussions
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolás Ariel González Muñoz (2025). Kaggle: Forum Discussions [Dataset]. https://www.kaggle.com/datasets/nicolasgonzalezmunoz/kaggle-forum-discussions
Explore at:
zip(542099 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
Nicolás Ariel González Muñoz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.

Summary

Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.

This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.

Extraction Technique

As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.

The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.

Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.

If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.

Structure

This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.

The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.

By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.

Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.
LMS Tracking Dataset
kaggle.com
zip
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). LMS Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/lms-tracking-dataset
Explore at:
zip(5419 bytes)Available download formats
Dataset updated
May 6, 2024
Authors
Prasad Patil
Description
This dataset was collected by a edtech startup. The startup is into teaching entrepreneurial life-skills in animated-gamified format through its video series to kids between the age group of 6-14 years. Through its learning management system the company tracks the progress made by all of its subscribers on the platform. Company records platform content usage activity data and tries to follow up with parents if there is any inactiveness on the platform by their child. Here's more information about the dataset

Dataset Information:

Child Name: Name of the subscriber kid

Email Address: Email address created by parent

Contact: Contact details of the parent

follow up: Responses received by the company employee after progress follow-up over the phone.

response: segregating the follow-up responses in to categories

Introduction: Tutorial 1

Activity:- Know your personality, a fun way:Tutorial 2

A Simple Quiz on the previous Video: Quiz on the Tutorial 2

Lets see what ‘Product’ is…:Tutorial 3

A Simple Quiz on the previous Video:Quiz on the Tutorial 3

Product that represents me: Tutorial 4

Let's see what 'Service' means: Tutorial 5

A Simple Quiz on the previous Video:Quiz on the Tutorial 5

Instruction for 'Product & Service' worksheet:Tutorial 6

Activity:- Product and Service Worksheet: Exercise on Tutorial 6

Instructions for Product Word Association:Tutorial 7

Activity:- Product Word Association:Exercise on Tutorial 7

Life without products??.... Impossible !:Tutorial 8

What Is a Need?:Tutorial 9

A Simple Quiz on the previous Video:Quiz on the Tutorial 9

Summary of Session 1: Summarizing all the learnings from the Tutorials 1-9

Your Feedback on Session 1: Feedback page

There is some missing data as well. I hope it would be good dataset for beginners practicing their NLP skills.

Image by Steven Weirather from Pixabay

Note: This dataset is partially synthetic meaning names, email and contact details mentioned are not of the actual customers. Kindly use it for educational and research purposes.

Arcade Natural Language to Code Challenge

kaggle.com

zip

Updated Feb 22, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Google AI (2023). Arcade Natural Language to Code Challenge [Dataset]. https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset

Explore at:

zip(3921922 bytes)Available download formats

Dataset updated

Feb 22, 2023

Dataset authored and provided by

Google AI

Description

Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

Arcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.

Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.

Folder Structure

Below is the structure of its content:

└── ./
  ├── existing_tasks # Problems derived from existing data science notebooks on Github/
  │  ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
  │  ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
  │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
  ├── new_tasks/
  │  ├── dataset.json # Original, unprepossessed dataset
  │  ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
  │  ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
  │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
  └── checksums.txt # Table of MD5 checksums of dataset files.

Dataset File Structure

All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:

{
  "notebook_name": "Name of the notebook.",
  "work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
  "annotator": "Anonymized annotator Id."
  "turns": [
    # A list of natural language to code examples using the current notebook context.
    {
      "input": "Prompt to a code generation model.",
      "turn": {
        "intent": {
          "value": "Annotated NL intent for the current turn.",
          "is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
          "cell_idx": "Index of the intent Markdown cell.",
          "line_span": "Line span of the intent.",
          "not_sure": "Annotation confidence.",
          "output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
        },
        "code": {
          "value": "Reference code solution.",
          "cell_idx": "Cell index of the code cell containing the solution.",
          "num_lines": "Number of lines in the reference solution.",
          "line_span": "Line span.",
        },
        "code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
        "delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
        "metadata": {
          "annotator_id": "Annotator Id",
          "num_code_lines": "Metadata, please ignore.",
          "utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
        },
      },
      "notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
      "metadata": {
        # A dict of metadata of this turn.
        "context_cells": [ # A list of context cells before the problem.
          {
            "cell_type": "code|markdown",
            "source": "Cell content."
          },
        ],
        "delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
        # The following fields only occur in datasets inlined with schema descriptions.
        "context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
        "inten...

SQL Injection dataset
kaggle.com
zip
Updated Jun 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayah Khaldi (2024). SQL Injection dataset [Dataset]. https://www.kaggle.com/datasets/ayahkhaldi/sql-injection-dataset
Explore at:
zip(26223784 bytes)Available download formats
Dataset updated
Jun 18, 2024
Authors
Ayah Khaldi
Description
SQL Injection Dataset:

A cleaned SQL injection dataset, sourced from multiple Kaggle datasets, has been cleaned and split into training, validation, and testing subsets with a 6:2:2 ratio. This dataset is intended for use in research focused on detecting SQL injection attacks.

Kaggle datasets:

https://www.kaggle.com/datasets/syedsaqlainhussain/sql-injection-dataset/data

https://www.kaggle.com/datasets/sajid576/sql-injection-dataset

https://www.kaggle.com/datasets/gambleryu/biggest-sql-injection-dataset

https://www.kaggle.com/datasets/mahdighaemi/sql-injection

https://www.kaggle.com/datasets/kholoodsalah/sql-injection-dataset

https://www.kaggle.com/datasets/alextrinity/sqlinjectionextend

The dataset contains two columns:

Query: This column represents the SQL query.

Label: This column represents the label for the SQL injection binary classification, where 1 indicates an SQL injection and 0 indicates a non-SQL injection.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20832724%2Fc0d648e947cec02e35beb665b24b5bdb%2Fsql-injection-datasets.png?generation=1718710914956418&alt=media" alt="">
University FAQ Dataset
kaggle.com
zip
Updated Jul 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andre Avindra (2023). University FAQ Dataset [Dataset]. https://www.kaggle.com/datasets/andreavindra/university-faq-dataset
Explore at:
zip(8402 bytes)Available download formats
Dataset updated
Jul 4, 2023
Authors
Andre Avindra
Description
The dataset is created for a chatbot using deep learning and NLP. This dataset can be used as an input to train deep learning models with NLP techniques, such as natural language processing and deep learning algorithms like neural networks, to develop a chatbot that can understand user conversation patterns and provide appropriate responses.
WikiHow Featured Articles
kaggle.com
zip
Updated Jun 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elfarouk Etawil (2023). WikiHow Featured Articles [Dataset]. https://www.kaggle.com/datasets/elfarouketawil/wikihow-featured-articles
Explore at:
zip(2809434 bytes)Available download formats
Dataset updated
Jun 28, 2023
Authors
Elfarouk Etawil
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
This dataset contains a collection of 997 featured articles from Wikihow, a collaborative platform that provides how-to guides on a wide range of topics. Each record in the dataset represents one article and includes the following six columns:

Title: the title of the article. Intro: a brief introduction to the article's topic. Article Content: the main content of the article, which provides step-by-step instructions on how to do something. Co-authors: the number of users who have contributed to the article on Wikihow. Updated: the date when the article was last updated on Wikihow. Views: the number of views that the article has received on Wikihow.

The data was scraped from Wikihow under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license, which allows for non-commercial use of the data as long as proper attribution is given to the original source.

This dataset can be used for various purposes such as text analysis, natural language processing, and machine learning. Researchers and data analysts can use this dataset to study the characteristics of featured articles on Wikihow, identify patterns or trends in the content or co-authorship, and explore how views and updates correlate with article popularity.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:

zip(167219625372 bytes)Available download formats

Dataset updated

Nov 27, 2025

Dataset authored and provided by

Kagglehttp://kaggle.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Clear search

Close search

Google apps

Main menu

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

Top 1000 Kaggle Datasets

From wiki

test-dataset-kaggle

Student Performance Dataset

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

🔑 Columns Description

📐 Data Generation Logic

🎯 How to Use This Dataset

kaggle-entity-annotated-corpus-ner-dataset

Kaggle Dataset Metadata Repository

Kaggle Dataset Metadata Collection 📊

Dataset Overview:

Columns Description 📋

Kaggle Annotation Dataset

Kaggle Annotation

60k-data-with-context-v2

📱💻 Mental Health & Technology Usage Dataset 🌱

Learning Resources Database

Dataset Description

Kaggle Top Datasets🚀📊

(Sunset)📒 Meta Kaggle ported to MS SQL SERVER

Context

Requirements

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Notes

Social Media and Mental Health

Kaggle: Forum Discussions

Summary

Extraction Technique

Structure

LMS Tracking Dataset

Dataset Information:

Note: This dataset is partially synthetic meaning names, email and contact details mentioned are not of the actual customers. Kindly use it for educational and research purposes.

Arcade Natural Language to Code Challenge

Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

Folder Structure

Dataset File Structure

SQL Injection dataset

SQL Injection Dataset:

Kaggle datasets:

The dataset contains two columns:

University FAQ Dataset

WikiHow Featured Articles

Meta Kaggle Code

Kaggle's public data on notebook code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

`Dataset Description`